Spark编写WordCount(scala编写)详解大数据

本文章主要介绍了Spark编写WordCount(scala编写),具有不错的的参考价值,希望对您有所帮助,如解说有误或未考虑完全的地方,请您留言指出,谢谢!

一、创建maven项目

二、导入依赖

<!-- 定义了一些常量 --> 
    <properties> 
        <maven.compiler.source>1.8</maven.compiler.source> 
        <maven.compiler.target>1.8</maven.compiler.target> 
        <scala.version>2.12.10</scala.version> 
        <spark.version>3.0.0</spark.version> 
        <encoding>UTF-8</encoding> 
    </properties> 
 
    <dependencies> 
        <!-- 导入scala的依赖 --> 
        <dependency> 
            <groupId>org.scala-lang</groupId> 
            <artifactId>scala-library</artifactId> 
            <version>${scala.version}</version> 
            <!-- 打包时不会将依赖打入jar包 --> 
            <scope>provided</scope> 
        </dependency> 
 
        <dependency> 
            <groupId>org.apache.spark</groupId> 
            <artifactId>spark-core_2.12</artifactId> 
            <version>${spark.version}</version> 
            <!-- 打包时不会将依赖打入jar包 --> 
            <scope>provided</scope> 
        </dependency> 
    </dependencies> 
 
    <build> 
        <pluginManagement> 
            <plugins> 
                <!-- 编译scala的插件 --> 
                <plugin> 
                    <groupId>net.alchim31.maven</groupId> 
                    <artifactId>scala-maven-plugin</artifactId> 
                    <version>3.2.2</version> 
                </plugin> 
                <!-- 编译java的插件 --> 
                <plugin> 
                    <groupId>org.apache.maven.plugins</groupId> 
                    <artifactId>maven-compiler-plugin</artifactId> 
                    <version>3.5.1</version> 
                </plugin> 
            </plugins> 
        </pluginManagement> 
        <plugins> 
            <plugin> 
                <groupId>net.alchim31.maven</groupId> 
                <artifactId>scala-maven-plugin</artifactId> 
                <executions> 
                    <execution> 
                        <id>scala-compile-first</id> 
                        <phase>process-resources</phase> 
                        <goals> 
                            <goal>add-source</goal> 
                            <goal>compile</goal> 
                        </goals> 
                    </execution> 
                    <execution> 
                        <id>scala-test-compile</id> 
                        <phase>process-test-resources</phase> 
                        <goals> 
                            <goal>testCompile</goal> 
                        </goals> 
                    </execution> 
                </executions> 
            </plugin> 
 
            <plugin> 
                <groupId>org.apache.maven.plugins</groupId> 
                <artifactId>maven-compiler-plugin</artifactId> 
                <executions> 
                    <execution> 
                        <phase>compile</phase> 
                        <goals> 
                            <goal>compile</goal> 
                        </goals> 
                    </execution> 
                </executions> 
            </plugin> 
 
            <!-- 打jar插件 --> 
            <plugin> 
                <groupId>org.apache.maven.plugins</groupId> 
                <artifactId>maven-shade-plugin</artifactId> 
                <version>2.4.3</version> 
                <executions> 
                    <execution> 
                        <phase>package</phase> 
                        <goals> 
                            <goal>shade</goal> 
                        </goals> 
                        <configuration> 
                            <filters> 
                                <filter> 
                                    <artifact>*:*</artifact> 
                                    <excludes> 
                                        <exclude>META-INF/*.SF</exclude> 
                                        <exclude>META-INF/*.DSA</exclude> 
                                        <exclude>META-INF/*.RSA</exclude> 
                                    </excludes> 
                                </filter> 
                            </filters> 
                        </configuration> 
                    </execution> 
                </executions> 
            </plugin> 
        </plugins> 
    </build> 

三、编写程序

package cn._51doit.day01 
 
import org.apache.spark.rdd.RDD 
import org.apache.spark.{
   SparkConf, SparkContext} 
 
object WordCount {
    
  def main(args: Array[String]): Unit = {
    
    //创建SparkContext 
    val conf = new SparkConf().setAppName("WordCount") 
    //SparkContext是用来创建最原始的RDD的 
    val sc = new SparkContext(conf) 
 
    //创建RDD 
    val lines: RDD[String] = sc.textFile(args(0)) 
 
    //切分压平 
    val words: RDD[String] = lines.flatMap(_.split(" ")) 
 
    //将单词和1组合 
    val wordAndOne = words.map((_, 1)) 
 
    //分组聚合 
    val reduced = wordAndOne.reduceByKey(_ + _) 
 
    //排序 
    val sorted = reduced.sortBy(_._2, false) 
 
    //Action算子,会触发任务执行 
    //保存数据到hdfs中 
    sorted.saveAsTextFile(args(1)) 
 
    //释放资源 
    sc.stop() 
 
  } 
} 
 

四、打包

五、上传到集群

六、启动

/opt/apps/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master spark://linux01:7077 --executor-memory 1g --total-executor-cores 5 --class cn._51doit.day01.WordCount /root/spark-in-action-1.0-SNAPSHOT.jar hdfs://linux01:9000/sp-data hdfs://linux01:9000/out-spark/out2
在这里插入图片描述

原创文章,作者:Maggie-Hunter,如若转载,请注明出处:https://blog.ytso.com/228146.html

(0)
上一篇 2022年1月11日
下一篇 2022年1月11日

相关推荐

发表回复

登录后才能评论