Spark编写WordCount(scala编写)详解大数据

本文章主要介绍了Spark编写WordCount(scala编写),具有不错的的参考价值,希望对您有所帮助,如解说有误或未考虑完全的地方,请您留言指出,谢谢!

一、创建maven项目

二、导入依赖

<!-- 定义了一些常量 --> 
<properties> 
<maven.compiler.source>1.8</maven.compiler.source> 
<maven.compiler.target>1.8</maven.compiler.target> 
<scala.version>2.12.10</scala.version> 
<spark.version>3.0.0</spark.version> 
<encoding>UTF-8</encoding> 
</properties> 
<dependencies> 
<!-- 导入scala的依赖 --> 
<dependency> 
<groupId>org.scala-lang</groupId> 
<artifactId>scala-library</artifactId> 
<version>${scala.version}</version> 
<!-- 打包时不会将依赖打入jar包 --> 
<scope>provided</scope> 
</dependency> 
<dependency> 
<groupId>org.apache.spark</groupId> 
<artifactId>spark-core_2.12</artifactId> 
<version>${spark.version}</version> 
<!-- 打包时不会将依赖打入jar包 --> 
<scope>provided</scope> 
</dependency> 
</dependencies> 
<build> 
<pluginManagement> 
<plugins> 
<!-- 编译scala的插件 --> 
<plugin> 
<groupId>net.alchim31.maven</groupId> 
<artifactId>scala-maven-plugin</artifactId> 
<version>3.2.2</version> 
</plugin> 
<!-- 编译java的插件 --> 
<plugin> 
<groupId>org.apache.maven.plugins</groupId> 
<artifactId>maven-compiler-plugin</artifactId> 
<version>3.5.1</version> 
</plugin> 
</plugins> 
</pluginManagement> 
<plugins> 
<plugin> 
<groupId>net.alchim31.maven</groupId> 
<artifactId>scala-maven-plugin</artifactId> 
<executions> 
<execution> 
<id>scala-compile-first</id> 
<phase>process-resources</phase> 
<goals> 
<goal>add-source</goal> 
<goal>compile</goal> 
</goals> 
</execution> 
<execution> 
<id>scala-test-compile</id> 
<phase>process-test-resources</phase> 
<goals> 
<goal>testCompile</goal> 
</goals> 
</execution> 
</executions> 
</plugin> 
<plugin> 
<groupId>org.apache.maven.plugins</groupId> 
<artifactId>maven-compiler-plugin</artifactId> 
<executions> 
<execution> 
<phase>compile</phase> 
<goals> 
<goal>compile</goal> 
</goals> 
</execution> 
</executions> 
</plugin> 
<!-- 打jar插件 --> 
<plugin> 
<groupId>org.apache.maven.plugins</groupId> 
<artifactId>maven-shade-plugin</artifactId> 
<version>2.4.3</version> 
<executions> 
<execution> 
<phase>package</phase> 
<goals> 
<goal>shade</goal> 
</goals> 
<configuration> 
<filters> 
<filter> 
<artifact>*:*</artifact> 
<excludes> 
<exclude>META-INF/*.SF</exclude> 
<exclude>META-INF/*.DSA</exclude> 
<exclude>META-INF/*.RSA</exclude> 
</excludes> 
</filter> 
</filters> 
</configuration> 
</execution> 
</executions> 
</plugin> 
</plugins> 
</build> 

三、编写程序

package cn._51doit.day01 
import org.apache.spark.rdd.RDD 
import org.apache.spark.{
SparkConf, SparkContext} 
object WordCount {
 
def main(args: Array[String]): Unit = {
 
//创建SparkContext 
val conf = new SparkConf().setAppName("WordCount") 
//SparkContext是用来创建最原始的RDD的 
val sc = new SparkContext(conf) 
//创建RDD 
val lines: RDD[String] = sc.textFile(args(0)) 
//切分压平 
val words: RDD[String] = lines.flatMap(_.split(" ")) 
//将单词和1组合 
val wordAndOne = words.map((_, 1)) 
//分组聚合 
val reduced = wordAndOne.reduceByKey(_ + _) 
//排序 
val sorted = reduced.sortBy(_._2, false) 
//Action算子,会触发任务执行 
//保存数据到hdfs中 
sorted.saveAsTextFile(args(1)) 
//释放资源 
sc.stop() 
} 
} 

四、打包

五、上传到集群

六、启动

/opt/apps/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master spark://linux01:7077 --executor-memory 1g --total-executor-cores 5 --class cn._51doit.day01.WordCount /root/spark-in-action-1.0-SNAPSHOT.jar hdfs://linux01:9000/sp-data hdfs://linux01:9000/out-spark/out2
在这里插入图片描述

原创文章,作者:Maggie-Hunter,如若转载,请注明出处:https://blog.ytso.com/tech/bigdata/228146.html

(0)
上一篇 2022年1月11日 15:16
下一篇 2022年1月11日 15:36

相关推荐

发表回复

登录后才能评论