[BigDataHadoop:Hadoop&Spark.V05] [BigDataHadoop.Spark内存级快速计算引擎][|章节四|Hadoop|spark|spark sql:spark sql编程&Transformation操作|]
一、Transformation 操作
### --- select * from tab where ... group by ... having... order by...
# --- 1、RDD类似的操作持久化
~~~ 缓存与checkpoint
~~~ select
~~~ where
~~~ group by / 聚合
~~~ order by
~~~ join
~~~ 集合操作
~~~ 空值操作(函数)
~~~ 函数
### --- 2、与RDD类似的操作
map、filter、flatMap、mapPartitions、sample、 randomSplit、
limit、distinct、dropDuplicates、describe
scala> df1.map(row=>row.getAs[Int](0)).show
+-----+
|value|
+-----+
| 7369|
| 7499|
| 7521|
| 7566|
| 7654|
| 7698|
| 7782|
| 7788|
| 7839|
| 7844|
| 7876|
| 7900|
| 7902|
| 7934|
+-----+
~~~ # randomSplit(与RDD类似,将DF、DS按给定参数分成多份)
scala> val df2 = df1.randomSplit(Array(0.5, 0.6, 0.7))
df2: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields])
scala> df2(0).count
res76: Long = 2
scala> df2(1).count
res77: Long = 4
scala> df2(2).count
res78: Long = 8
~~~ # 取10行数据生成新的DataSet
scala> val df2 = df1.limit(10)
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # distinct,去重
scala> val df2 = df1.union(df1)
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df2.distinct.count
res79: Long = 14
~~~ # dropDuplicates,按列值去重
scala> df2.dropDuplicates.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("mgr", "deptno").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("mgr").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("deptno").show
+-----+-----+--------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+--------+----+-------------------+----+----+------+
| 7369|SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7782|CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499|ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
+-----+-----+--------+----+-------------------+----+----+------+
~~~ # 返回全部列的统计(count、mean、stddev、min、max)
scala> ds1.describe().show
+-------+----+----------------+------------------+
|summary|name| age| height|
+-------+----+----------------+------------------+
| count| 3| 3| 3|
| mean|null| 18.0|164.33333333333334|
| stddev|null|9.16515138991168|20.008331597945226|
| min|Andy| 10| 144|
| max| Tom| 28| 184|
+-------+----+----------------+------------------+
~~~ # 返回指定列的统计
scala> ds1.describe("*").show
+-------+----+----------------+------------------+
|summary|name| age| height|
+-------+----+----------------+------------------+
| count| 3| 3| 3|
| mean|null| 18.0|164.33333333333334|
| stddev|null|9.16515138991168|20.008331597945226|
| min|Andy| 10| 144|
| max| Tom| 28| 184|
+-------+----+----------------+------------------+
### --- 3、存储相关
~~~ cacheTable、persist、checkpoint、unpersist、cache
~~~ 备注:Dataset 默认的存储级别是 MEMORY_AND_DISK
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> spark.sparkContext.setCheckpointDir("hdfs://hadoop01:9000/checkpoint")
scala> df1.show()
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.checkpoint()
res36: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.cache()
res37: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.persist(StorageLevel.MEMORY_ONLY)
21/10/20 15:45:46 WARN CacheManager: Asked to cache already cached data.
res38: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.count()
res39: Long = 14
scala> df1.unpersist(true)
res40: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.createOrReplaceTempView("t1")
scala> spark.catalog.cacheTable("t1")
scala> spark.catalog.uncacheTable("t1")
### --- 4、select相关
~~~ 列的多种表示、select、selectExpr
~~~ drop、withColumn、withColumnRenamed、cast(内置函数)
~~~ # 列的多种表示方法。使用""、$""、'、col()、ds("")
~~~ # 注意:不要混用;必要时使用spark.implicitis._;并非每个表示在所有的地方都有效
scala> df1.select($"ename", $"hiredate", $"sal").show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select("ename", "hiredate", "sal").show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select('ename, 'hiredate, 'sal).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select(col("ename"), col("hiredate"), col("sal")).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select(df1("ename"), df1("hiredate"), df1("sal")).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
~~~ # 下面的写法无效,其他列的表示法有效
scala> df1.select("ename", "hiredate", "sal"+100).show
scala> df1.select("ename", "hiredate", "sal+100").show
~~~ # 这样写才符合语法
scala> df1.select($"ename", $"hiredate", $"sal"+100).show
+------+-------------------+-----------+
| ename| hiredate|(sal + 100)|
+------+-------------------+-----------+
| SMITH|2001-01-02 22:12:13| 900|
| ALLEN|2002-01-02 22:12:13| 1700|
| WARD|2003-01-02 22:12:13| 1350|
| JONES|2004-01-02 22:12:13| 3075|
|MARTIN|2005-01-02 22:12:13| 1350|
| BLAKE|2005-04-02 22:12:13| 2950|
| CLARK|2006-03-02 22:12:13| 2550|
| SCOTT|2007-03-02 22:12:13| 3100|
| KING|2006-03-02 22:12:13| 5100|
|TURNER|2009-07-02 22:12:13| 1600|
| ADAMS|2010-05-02 22:12:13| 1200|
| JAMES|2011-06-02 22:12:13| 1050|
| FORD|2011-07-02 22:12:13| 3100|
|MILLER|2012-11-02 22:12:13| 1400|
+------+-------------------+-----------+
scala> df1.select('ename, 'hiredate, 'sal+100).show
+------+-------------------+-----------+
| ename| hiredate|(sal + 100)|
+------+-------------------+-----------+
| SMITH|2001-01-02 22:12:13| 900|
| ALLEN|2002-01-02 22:12:13| 1700|
| WARD|2003-01-02 22:12:13| 1350|
| JONES|2004-01-02 22:12:13| 3075|
|MARTIN|2005-01-02 22:12:13| 1350|
| BLAKE|2005-04-02 22:12:13| 2950|
| CLARK|2006-03-02 22:12:13| 2550|
| SCOTT|2007-03-02 22:12:13| 3100|
| KING|2006-03-02 22:12:13| 5100|
|TURNER|2009-07-02 22:12:13| 1600|
| ADAMS|2010-05-02 22:12:13| 1200|
| JAMES|2011-06-02 22:12:13| 1050|
| FORD|2011-07-02 22:12:13| 3100|
|MILLER|2012-11-02 22:12:13| 1400|
+------+-------------------+-----------+
~~~ # 可使用expr表达式(expr里面只能使用引号)
scala> df1.select(expr("comm+100"), expr("sal+100"), expr("ename")).show
+------------+-----------+------+
|(comm + 100)|(sal + 100)| ename|
+------------+-----------+------+
| null| 900| SMITH|
| 400| 1700| ALLEN|
| 600| 1350| WARD|
| null| 3075| JONES|
| 1500| 1350|MARTIN|
| null| 2950| BLAKE|
| null| 2550| CLARK|
| null| 3100| SCOTT|
| null| 5100| KING|
| 100| 1600|TURNER|
| null| 1200| ADAMS|
| null| 1050| JAMES|
| null| 3100| FORD|
| null| 1400|MILLER|
+------------+-----------+------+
scala> df1.selectExpr("ename as name").show
+------+
| name|
+------+
| SMITH|
| ALLEN|
| WARD|
| JONES|
|MARTIN|
| BLAKE|
| CLARK|
| SCOTT|
| KING|
|TURNER|
| ADAMS|
| JAMES|
| FORD|
|MILLER|
+------+
scala> df1.selectExpr("power(sal, 2)", "sal").show
+---------------------------------------------+----+
|POWER(CAST(sal AS DOUBLE), CAST(2 AS DOUBLE))| sal|
+---------------------------------------------+----+
| 640000.0| 800|
| 2560000.0|1600|
| 1562500.0|1250|
| 8850625.0|2975|
| 1562500.0|1250|
| 8122500.0|2850|
| 6002500.0|2450|
| 9000000.0|3000|
| 2.5E7|5000|
| 2250000.0|1500|
| 1210000.0|1100|
| 902500.0| 950|
| 9000000.0|3000|
| 1690000.0|1300|
+---------------------------------------------+----+
scala> df1.selectExpr("round(sal, -3) as newsal", "sal", "ename").show
+------+----+------+
|newsal| sal| ename|
+------+----+------+
| 1000| 800| SMITH|
| 2000|1600| ALLEN|
| 1000|1250| WARD|
| 3000|2975| JONES|
| 1000|1250|MARTIN|
| 3000|2850| BLAKE|
| 2000|2450| CLARK|
| 3000|3000| SCOTT|
| 5000|5000| KING|
| 2000|1500|TURNER|
| 1000|1100| ADAMS|
| 1000| 950| JAMES|
| 3000|3000| FORD|
| 1000|1300|MILLER|
+------+----+------+
~~~ # drop、withColumn、 withColumnRenamed、casting
~~~ # drop 删除一个或多个列,得到新的DF
scala> df1.drop("mgr")
res42: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 5 more fields]
scala> df1.drop("empno", "mgr")
res43: org.apache.spark.sql.DataFrame = [ENAME: string, JOB: string ... 4 more fields]
~~~ # withColumn,修改列值
scala> val df2 = df1.withColumn("sal", $"sal"+1000)
df2: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df2.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| sal|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13|1800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|2600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|2250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|3975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|2250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|3850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|3450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|4000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|6000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|2500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|2100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13|1950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|4000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|2300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # withColumnRenamed,更改列名
~~~ 备注:drop、withColumn、withColumnRenamed返回的是DF
scala> df1.withColumnRenamed("sal", "newsal")
res45: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # cast,类型转换
scala> df1.selectExpr("cast(empno as string)").printSchema
root
|-- empno: string (nullable = true)
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> df1.select('empno.cast(StringType)).printSchema
root
|-- empno: string (nullable = true)
### --- 5、where相关
~~~ where == filter
~~~ # where操作
scala> df1.filter("sal>1000").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter("sal>1000 and job=='MANAGER'").show
+-----+-----+-------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+-------+----+-------------------+----+----+------+
| 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
+-----+-----+-------+----+-------------------+----+----+------+
~~~ # filter操作
scala> df1.where("sal>1000").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.where("sal>1000 and job=='MANAGER'").show
+-----+-----+-------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+-------+----+-------------------+----+----+------+
| 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
+-----+-----+-------+----+-------------------+----+----+------+
### --- 6、groupBy相关
~~~ groupBy、agg、max、min、avg、sum、count(后面5个为内置函数)
~~~ # groupBy、max、min、mean、sum1 、count(与df1.count不同)
scala> df1.groupBy("Job").sum("sal").show
+---------+--------+
| Job|sum(sal)|
+---------+--------+
| ANALYST| 6000|
| SALESMAN| 5600|
| CLERK| 4150|
| MANAGER| 8275|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").max("sal").show
+---------+--------+
| Job|max(sal)|
+---------+--------+
| ANALYST| 3000|
| SALESMAN| 1600|
| CLERK| 1300|
| MANAGER| 2975|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").min("sal").show
+---------+--------+
| Job|min(sal)|
+---------+--------+
| ANALYST| 3000|
| SALESMAN| 1250|
| CLERK| 800|
| MANAGER| 2450|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").avg("sal").show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| SALESMAN| 1400.0|
| CLERK| 1037.5|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
scala> df1.groupBy("Job").count.show
+---------+-----+
| Job|count|
+---------+-----+
| ANALYST| 2|
| SALESMAN| 4|
| CLERK| 4|
| MANAGER| 3|
|PRESIDENT| 1|
+---------+-----+
~~~ # 类似having子句
scala> df1.groupBy("Job").avg("sal").where("avg(sal) > 2000").show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
scala> df1.groupBy("Job").avg("sal").where($"avg(sal)" > 2000).show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
~~~ # agg
scala> df1.groupBy("Job").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show
+---------+--------+--------+------------------+--------+----------+
| Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+---------+--------+--------+------------------+--------+----------+
| ANALYST| 3000| 3000| 3000.0| 6000| 2|
| SALESMAN| 1600| 1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975| 2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000| 5000| 5000.0| 5000| 1|
+---------+--------+--------+------------------+--------+----------+
scala> df1.groupBy("deptno").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show
+------+--------+--------+------------------+--------+----------+
|deptno|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+------+--------+--------+------------------+--------+----------+
| 20| 3000| 800| 2175.0| 10875| 5|
| 10| 5000| 1300|2916.6666666666665| 8750| 3|
| 30| 2850| 950|1566.6666666666667| 9400| 6|
+------+--------+--------+------------------+--------+----------+
~~~ # 这种方式更好理解
scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"), sum("sal"), count("sal")).show
+---------+--------+--------+------------------+--------+----------+
| Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+---------+--------+--------+------------------+--------+----------+
| ANALYST| 3000| 3000| 3000.0| 6000| 2|
| SALESMAN| 1600| 1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975| 2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000| 5000| 5000.0| 5000| 1|
+---------+--------+--------+------------------+--------+----------+
~~~ # 给列取别名
scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"),
| sum("sal"), count("sal")).withColumnRenamed("min(sal)",
| "min1").show
+---------+--------+----+------------------+--------+----------+
| Job|max(sal)|min1| avg(sal)|sum(sal)|count(sal)|
+---------+--------+----+------------------+--------+----------+
| ANALYST| 3000|3000| 3000.0| 6000| 2|
| SALESMAN| 1600|1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975|2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000|5000| 5000.0| 5000| 1|
+---------+--------+----+------------------+--------+----------+
~~~ # 给列取别名,最简便
scala> df1.groupBy("Job").agg(max("sal").as("max1"),
| min("sal").as("min2"), avg("sal").as("avg3"),
| sum("sal").as("sum4"), count("sal").as("count5")).show
+---------+----+----+------------------+----+------+
| Job|max1|min2| avg3|sum4|count5|
+---------+----+----+------------------+----+------+
| ANALYST|3000|3000| 3000.0|6000| 2|
| SALESMAN|1600|1250| 1400.0|5600| 4|
| CLERK|1300| 800| 1037.5|4150| 4|
| MANAGER|2975|2450|2758.3333333333335|8275| 3|
|PRESIDENT|5000|5000| 5000.0|5000| 1|
+---------+----+----+------------------+----+------+
### --- 7、orderBy相关
~~~ orderBy == sort
~~~ # orderBy
scala> df1.orderBy("sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal".asc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 降序
scala> df1.orderBy(-$"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy('sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(col("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(df1("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal".desc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(-'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(-'deptno, -'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # sort,以下语句等价
scala> df1.sort("sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal".asc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort('sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(col("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(df1("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal".desc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(-'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(-'deptno, -'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
### --- 8、join相关
~~~ # 1、笛卡尔积
df1.crossJoin(df1).count
~~~ # 2、等值连接(单字段)(连接字段empno,仅显示了一次)
df1.join(df1, "empno").count
~~~ # 3、等值连接(多字段)(连接字段empno、ename,仅显示了一次)
df1.join(df1, Seq("empno", "ename")).show
~~~ # 定义第一个数据集
case class StudentAge(sno: Int, name: String, age: Int)
val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21),
StudentAge(5,"Cindy", 20))
val ds1 = spark.createDataset(lst)
ds1.show()
~~~ # 定义第二个数据集
case class StudentHeight(sname: String, height: Int)
val rdd = sc.makeRDD(List(StudentHeight("Alice", 160),
StudentHeight("Andy", 159), StudentHeight("Bob", 170),
StudentHeight("Cindy", 165), StudentHeight("Rose", 160)))
val ds2 = rdd.toDS
~~~ # 备注:不能使用双引号,而且这里是 ===
ds1.join(ds2, $"name"===$"sname").show
ds1.join(ds2, 'name==='sname).show
ds1.join(ds2, ds1("name")===ds2("sname")).show
ds1.join(ds2, ds1("sname")===ds2("sname"), "inner").show
~~~ # 多种连接方式
ds1.join(ds2, $"name"===$"sname").show
ds1.join(ds2, $"name"===$"sname", "inner").show
ds1.join(ds2, $"name"===$"sname", "left").show
ds1.join(ds2, $"name"===$"sname", "left_outer").show
ds1.join(ds2, $"name"===$"sname", "right").show
ds1.join(ds2, $"name"===$"sname", "right_outer").show
ds1.join(ds2, $"name"===$"sname", "outer").show
ds1.join(ds2, $"name"===$"sname", "full").show
ds1.join(ds2, $"name"===$"sname", "full_outer").show
~~~ # 备注:DS在join操作之后变成了DF
### --- 9、集合相关
~~~ union==unionAll(过期)、intersect、except
~~~ # union、unionAll、intersect、except。集合的交、并、差
scala> val ds3 = ds1.select("name")
scala> val ds4 = ds2.select("sname")
~~~ # union 求并集,不去重
scala> ds3.union(ds4).show
~~~ # unionAll、union 等价;unionAll过期方法,不建议使用
scala> ds3.unionAll(ds4).show
~~~ # intersect 求交
scala> ds3.intersect(ds4).show
~~~ # except 求差
scala> ds3.except(ds4).show
### --- 10、空值处理
~~~ na.fill、na.drop
~~~ # NaN (Not a Number)
scala> math.sqrt(-1.0)
res90: Double = NaN
scala> math.sqrt(-1.0).isNaN()
res91: Boolean = true
scala> df1.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 删除所有列的空值和NaN
scala> df1.na.drop.show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
~~~ # 删除某列的空值和NaN
scala> df1.na.drop(Array("mgr")).show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+--------+----+-------------------+----+----+------+
~~~ # 对全部列填充;对指定单列填充;对指定多列填充
scala> df1.na.fill(1000).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|1000|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.na.fill(1000, Array("comm")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.na.fill(Map("mgr"->2000, "comm"->1000)).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|2000|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 对指定的值进行替换
scala> df1.na.replace("comm" :: "deptno" :: Nil, Map(0 -> 100, 10 -> 100)).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 100|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 100|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 100| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 100|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 查询空值列或非空值列。isNull、isNotNull为内置函数
scala> df1.filter("comm is null").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter($"comm".isNull).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter(col("comm").isNull).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter("comm is not null").show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
scala> df1.filter(col("comm").isNotNull).show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
### --- 11、窗口函数
~~~ 一般情况下窗口函数不用 DSL 处理,直接用SQL更方便
~~~ 参考源码Window.scala、WindowSpec.scala(主要)
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val w1 = Window.partitionBy("cookieid").orderBy("createtime")
w1: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@32a97ef8
scala> val w2 = Window.partitionBy("cookieid").orderBy("pv")
w2: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@dac633
scala> val w3 = w1.rowsBetween(Window.unboundedPreceding,
| Window.currentRow)
w3: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@2b1e2939
scala> val w4 = w1.rowsBetween(-1, 1)
w4: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@217d9e78
~~~ # 聚组函数【用分析函数的数据集】
scala> df.select($"cookieid", $"pv", sum("pv").over(w1).alias("pv1")).show
scala> df.select($"cookieid", $"pv", sum("pv").over(w3).alias("pv1")).show
scala> df.select($"cookieid", $"pv", sum("pv").over(w4).as("pv1")).show
~~~ # 排名
scala> df.select($"cookieid", $"pv", rank().over(w2).alias("rank")).show
scala> df.select($"cookieid", $"pv", dense_rank().over(w2).alias("denserank")).show
scala> df.select($"cookieid", $"pv", row_number().over(w2).alias("rownumber")).show
~~~ # lag、lead
scala> df.select($"cookieid", $"pv", lag("pv", 2).over(w2).alias("rownumber")).show
scala> df.select($"cookieid", $"pv", lag("pv", -2).over(w2).alias("rownumber")).show
### --- 12、内建函数
~~~ http://spark.apache.org/docs/latest/api/sql/index.html
二、编程代码实现
### --- 编程代码实现
package cn.yanqi.sparksql
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object TransformationDemo {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Demo1")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("warn")
import spark.implicits._
val df1: DataFrame = spark.read
.option("header", "true")
.option("inferschema", "true")
.csv("data/emp.dat")
// df1.printSchema()
// df1.map(row=>row.getAs[Int](0)).show
// // randomSplit(与RDD类似,将DF、DS按给定参数分成多份)
// val Array(dfx, dfy, dfz) = df1.randomSplit(Array(0.5, 0.6, 0.7))
// dfx.count
// dfy.count
// dfz.count
//
// // 取10行数据生成新的DataSet
// val df2 = df1.limit(10)
//
// // distinct,去重
// val df3 = df1.union(df1)
// df3.distinct.count
//
// // dropDuplicates,按列值去重
// df2.dropDuplicates.show
// df2.dropDuplicates("mgr", "deptno").show
// df2.dropDuplicates("mgr").show
// df2.dropDuplicates("deptno").show
//
// // 返回全部列的统计(count、mean、stddev、min、max)
// df1.describe().show
//
// // 返回指定列的统计
// df1.describe("sal").show
// df1.describe("sal", "comm").show
//
// df1.createOrReplaceTempView("t1")
// spark.sql("select * from t1").show
// spark.catalog.cacheTable("t1")
// spark.catalog.uncacheTable("t1")
import org.apache.spark.sql.functions._
df1.groupBy("Job").agg(min("sal").as("minsal"), max("sal").as("maxsal")).where($"minsal" > 2000).show
val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21), StudentAge(5,"Cindy", 20))
val ds1 = spark.createDataset(lst)
ds1.show()
// 定义第二个数据集
val rdd = sc.makeRDD(List(StudentHeight("Alice", 160), StudentHeight("Andy", 159), StudentHeight("Bob", 170), StudentHeight("Cindy", 165), StudentHeight("Rose", 160)))
val ds2 = rdd.toDS
spark.close()
}
}
case class StudentAge(sno: Int, sname: String, age: Int)
case class StudentHeight(sname: String, height: Int)
### --- 编译打印
~~~ # 准备数据文件:data/emp.dat
~~~ # 编译打印
D:/JAVA/jdk1.8.0_231/bin/java.exe "-javaagent:D:/IntelliJIDEA/IntelliJ IDEA 2019.3.3/lib/idea_rt.jar=56671:D:/IntelliJIDEA/IntelliJ IDEA 2019.3.3/bin" -Dfile.encoding=UTF-8 -classpath D:/JAVA/jdk1.8.0_231/jre/lib/charsets.jar;D:/JAVA/jdk1.8.0_231/jre/lib/deploy.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/access-bridge-64.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/cldrdata.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/dnsns.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/jaccess.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/jfxrt.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/localedata.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/nashorn.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/sunec.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/sunjce_provider.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/sunmscapi.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/sunpkcs11.jar;D:/JAVA/jdk1.8.0_231/jre/lib/ext/zipfs.jar;D:/JAVA/jdk1.8.0_231/jre/lib/javaws.jar;D:/JAVA/jdk1.8.0_231/jre/lib/jce.jar;D:/JAVA/jdk1.8.0_231/jre/lib/jfr.jar;D:/JAVA/jdk1.8.0_231/jre/lib/jfxswt.jar;D:/JAVA/jdk1.8.0_231/jre/lib/jsse.jar;D:/JAVA/jdk1.8.0_231/jre/lib/management-agent.jar;D:/JAVA/jdk1.8.0_231/jre/lib/plugin.jar;D:/JAVA/jdk1.8.0_231/jre/lib/resources.jar;D:/JAVA/jdk1.8.0_231/jre/lib/rt.jar;E:/NO.Z.80000.Hadoop.spark/SparkBigData/target/classes;C:/Users/Administrator/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-core_2.12/2.4.5/spark-core_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar;C:/Users/Administrator/.m2/repository/org/apache/avro/avro/1.8.2/avro-1.8.2.jar;C:/Users/Administrator/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar;C:/Users/Administrator/.m2/repository/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.jar;C:/Users/Administrator/.m2/repository/org/tukaani/xz/1.5/xz-1.5.jar;C:/Users/Administrator/.m2/repository/org/apache/avro/avro-mapred/1.8.2/avro-mapred-1.8.2-hadoop2.jar;C:/Users/Administrator/.m2/repository/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.jar;C:/Users/Administrator/.m2/repository/com/twitter/chill_2.12/0.9.3/chill_2.12-0.9.3.jar;C:/Users/Administrator/.m2/repository/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.jar;C:/Users/Administrator/.m2/repository/com/esotericsoftware/minlog/1.3.0/minlog-1.3.0.jar;C:/Users/Administrator/.m2/repository/org/objenesis/objenesis/2.5.1/objenesis-2.5.1.jar;C:/Users/Administrator/.m2/repository/com/twitter/chill-java/0.9.3/chill-java-0.9.3.jar;C:/Users/Administrator/.m2/repository/org/apache/xbean/xbean-asm6-shaded/4.8/xbean-asm6-shaded-4.8.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-client/2.6.5/hadoop-client-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-common/2.6.5/hadoop-common-2.6.5.jar;C:/Users/Administrator/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar;C:/Users/Administrator/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar;C:/Users/Administrator/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar;C:/Users/Administrator/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar;C:/Users/Administrator/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar;C:/Users/Administrator/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-auth/2.6.5/hadoop-auth-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar;C:/Users/Administrator/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar;C:/Users/Administrator/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar;C:/Users/Administrator/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar;C:/Users/Administrator/.m2/repository/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar;C:/Users/Administrator/.m2/repository/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar;C:/Users/Administrator/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar;C:/Users/Administrator/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.6.5/hadoop-mapreduce-client-app-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.6.5/hadoop-mapreduce-client-common-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.6.5/hadoop-yarn-client-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.6.5/hadoop-yarn-server-common-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.6.5/hadoop-mapreduce-client-shuffle-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.6.5/hadoop-yarn-api-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.6.5/hadoop-mapreduce-client-core-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.6.5/hadoop-yarn-common-2.6.5.jar;C:/Users/Administrator/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar;C:/Users/Administrator/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar;C:/Users/Administrator/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar;C:/Users/Administrator/.m2/repository/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.6.5/hadoop-mapreduce-client-jobclient-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/hadoop/hadoop-annotations/2.6.5/hadoop-annotations-2.6.5.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-launcher_2.12/2.4.5/spark-launcher_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-kvstore_2.12/2.4.5/spark-kvstore_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar;C:/Users/Administrator/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.6.7/jackson-core-2.6.7.jar;C:/Users/Administrator/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.6.7/jackson-annotations-2.6.7.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-network-common_2.12/2.4.5/spark-network-common_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-network-shuffle_2.12/2.4.5/spark-network-shuffle_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/javax/activation/activation/1.1.1/activation-1.1.1.jar;C:/Users/Administrator/.m2/repository/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.jar;C:/Users/Administrator/.m2/repository/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar;C:/Users/Administrator/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar;C:/Users/Administrator/.m2/repository/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar;C:/Users/Administrator/.m2/repository/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.jar;C:/Users/Administrator/.m2/repository/org/apache/commons/commons-lang3/3.5/commons-lang3-3.5.jar;C:/Users/Administrator/.m2/repository/org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar;C:/Users/Administrator/.m2/repository/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar;C:/Users/Administrator/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar;C:/Users/Administrator/.m2/repository/org/slf4j/jul-to-slf4j/1.7.16/jul-to-slf4j-1.7.16.jar;C:/Users/Administrator/.m2/repository/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.jar;C:/Users/Administrator/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar;C:/Users/Administrator/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar;C:/Users/Administrator/.m2/repository/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3.jar;C:/Users/Administrator/.m2/repository/org/xerial/snappy/snappy-java/1.1.7.3/snappy-java-1.1.7.3.jar;C:/Users/Administrator/.m2/repository/org/lz4/lz4-java/1.4.0/lz4-java-1.4.0.jar;C:/Users/Administrator/.m2/repository/com/github/luben/zstd-jni/1.3.2-2/zstd-jni-1.3.2-2.jar;C:/Users/Administrator/.m2/repository/org/roaringbitmap/RoaringBitmap/0.7.45/RoaringBitmap-0.7.45.jar;C:/Users/Administrator/.m2/repository/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar;C:/Users/Administrator/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar;C:/Users/Administrator/.m2/repository/org/json4s/json4s-jackson_2.12/3.5.3/json4s-jackson_2.12-3.5.3.jar;C:/Users/Administrator/.m2/repository/org/json4s/json4s-core_2.12/3.5.3/json4s-core_2.12-3.5.3.jar;C:/Users/Administrator/.m2/repository/org/json4s/json4s-ast_2.12/3.5.3/json4s-ast_2.12-3.5.3.jar;C:/Users/Administrator/.m2/repository/org/json4s/json4s-scalap_2.12/3.5.3/json4s-scalap_2.12-3.5.3.jar;C:/Users/Administrator/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.0.6/scala-xml_2.12-1.0.6.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/core/jersey-client/2.22.2/jersey-client-2.22.2.jar;C:/Users/Administrator/.m2/repository/javax/ws/rs/javax.ws.rs-api/2.0.1/javax.ws.rs-api-2.0.1.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/hk2-api/2.4.0-b34/hk2-api-2.4.0-b34.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/hk2-utils/2.4.0-b34/hk2-utils-2.4.0-b34.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/external/aopalliance-repackaged/2.4.0-b34/aopalliance-repackaged-2.4.0-b34.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/external/javax.inject/2.4.0-b34/javax.inject-2.4.0-b34.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/hk2-locator/2.4.0-b34/hk2-locator-2.4.0-b34.jar;C:/Users/Administrator/.m2/repository/org/javassist/javassist/3.18.1-GA/javassist-3.18.1-GA.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/core/jersey-common/2.22.2/jersey-common-2.22.2.jar;C:/Users/Administrator/.m2/repository/javax/annotation/javax.annotation-api/1.2/javax.annotation-api-1.2.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/bundles/repackaged/jersey-guava/2.22.2/jersey-guava-2.22.2.jar;C:/Users/Administrator/.m2/repository/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/core/jersey-server/2.22.2/jersey-server-2.22.2.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/media/jersey-media-jaxb/2.22.2/jersey-media-jaxb-2.22.2.jar;C:/Users/Administrator/.m2/repository/javax/validation/validation-api/1.1.0.Final/validation-api-1.1.0.Final.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet/2.22.2/jersey-container-servlet-2.22.2.jar;C:/Users/Administrator/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet-core/2.22.2/jersey-container-servlet-core-2.22.2.jar;C:/Users/Administrator/.m2/repository/io/netty/netty-all/4.1.42.Final/netty-all-4.1.42.Final.jar;C:/Users/Administrator/.m2/repository/io/netty/netty/3.9.9.Final/netty-3.9.9.Final.jar;C:/Users/Administrator/.m2/repository/com/clearspring/analytics/stream/2.7.0/stream-2.7.0.jar;C:/Users/Administrator/.m2/repository/io/dropwizard/metrics/metrics-core/3.1.5/metrics-core-3.1.5.jar;C:/Users/Administrator/.m2/repository/io/dropwizard/metrics/metrics-jvm/3.1.5/metrics-jvm-3.1.5.jar;C:/Users/Administrator/.m2/repository/io/dropwizard/metrics/metrics-json/3.1.5/metrics-json-3.1.5.jar;C:/Users/Administrator/.m2/repository/io/dropwizard/metrics/metrics-graphite/3.1.5/metrics-graphite-3.1.5.jar;C:/Users/Administrator/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.6.7.3/jackson-databind-2.6.7.3.jar;C:/Users/Administrator/.m2/repository/com/fasterxml/jackson/module/jackson-module-scala_2.12/2.6.7.1/jackson-module-scala_2.12-2.6.7.1.jar;C:/Users/Administrator/.m2/repository/org/scala-lang/scala-reflect/2.12.1/scala-reflect-2.12.1.jar;C:/Users/Administrator/.m2/repository/com/fasterxml/jackson/module/jackson-module-paranamer/2.7.9/jackson-module-paranamer-2.7.9.jar;C:/Users/Administrator/.m2/repository/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar;C:/Users/Administrator/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar;C:/Users/Administrator/.m2/repository/net/razorvine/pyrolite/4.13/pyrolite-4.13.jar;C:/Users/Administrator/.m2/repository/net/sf/py4j/py4j/0.10.7/py4j-0.10.7.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-tags_2.12/2.4.5/spark-tags_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar;C:/Users/Administrator/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/com/univocity/univocity-parsers/2.7.3/univocity-parsers-2.7.3.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-sketch_2.12/2.4.5/spark-sketch_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-catalyst_2.12/2.4.5/spark-catalyst_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.0/scala-parser-combinators_2.12-1.1.0.jar;C:/Users/Administrator/.m2/repository/org/codehaus/janino/janino/3.0.9/janino-3.0.9.jar;C:/Users/Administrator/.m2/repository/org/codehaus/janino/commons-compiler/3.0.9/commons-compiler-3.0.9.jar;C:/Users/Administrator/.m2/repository/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar;C:/Users/Administrator/.m2/repository/org/apache/orc/orc-core/1.5.5/orc-core-1.5.5-nohive.jar;C:/Users/Administrator/.m2/repository/org/apache/orc/orc-shims/1.5.5/orc-shims-1.5.5.jar;C:/Users/Administrator/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar;C:/Users/Administrator/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar;C:/Users/Administrator/.m2/repository/io/airlift/aircompressor/0.10/aircompressor-0.10.jar;C:/Users/Administrator/.m2/repository/org/apache/orc/orc-mapreduce/1.5.5/orc-mapreduce-1.5.5-nohive.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-column/1.10.1/parquet-column-1.10.1.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-hadoop/1.10.1/parquet-hadoop-1.10.1.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-format/2.4.0/parquet-format-2.4.0.jar;C:/Users/Administrator/.m2/repository/org/apache/parquet/parquet-jackson/1.10.1/parquet-jackson-1.10.1.jar;C:/Users/Administrator/.m2/repository/org/apache/arrow/arrow-vector/0.10.0/arrow-vector-0.10.0.jar;C:/Users/Administrator/.m2/repository/org/apache/arrow/arrow-format/0.10.0/arrow-format-0.10.0.jar;C:/Users/Administrator/.m2/repository/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0.jar;C:/Users/Administrator/.m2/repository/com/carrotsearch/hppc/0.7.2/hppc-0.7.2.jar;C:/Users/Administrator/.m2/repository/com/vlkan/flatbuffers/1.2.0-3f79e055/flatbuffers-1.2.0-3f79e055.jar;C:/Users/Administrator/.m2/repository/joda-time/joda-time/2.9.7/joda-time-2.9.7.jar;C:/Users/Administrator/.m2/repository/mysql/mysql-connector-java/5.1.44/mysql-connector-java-5.1.44.jar;C:/Users/Administrator/.m2/repository/org/apache/spark/spark-hive_2.12/2.4.5/spark-hive_2.12-2.4.5.jar;C:/Users/Administrator/.m2/repository/com/twitter/parquet-hadoop-bundle/1.6.0/parquet-hadoop-bundle-1.6.0.jar;C:/Users/Administrator/.m2/repository/org/spark-project/hive/hive-exec/1.2.1.spark2/hive-exec-1.2.1.spark2.jar;C:/Users/Administrator/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar;C:/Users/Administrator/.m2/repository/javolution/javolution/5.5.1/javolution-5.5.1.jar;C:/Users/Administrator/.m2/repository/log4j/apache-log4j-extras/1.2.17/apache-log4j-extras-1.2.17.jar;C:/Users/Administrator/.m2/repository/org/antlr/antlr-runtime/3.4/antlr-runtime-3.4.jar;C:/Users/Administrator/.m2/repository/org/antlr/stringtemplate/3.2.1/stringtemplate-3.2.1.jar;C:/Users/Administrator/.m2/repository/antlr/antlr/2.7.7/antlr-2.7.7.jar;C:/Users/Administrator/.m2/repository/org/antlr/ST4/4.0.4/ST4-4.0.4.jar;C:/Users/Administrator/.m2/repository/com/googlecode/javaewah/JavaEWAH/0.3.2/JavaEWAH-0.3.2.jar;C:/Users/Administrator/.m2/repository/org/iq80/snappy/snappy/0.2/snappy-0.2.jar;C:/Users/Administrator/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar;C:/Users/Administrator/.m2/repository/net/sf/opencsv/opencsv/2.3/opencsv-2.3.jar;C:/Users/Administrator/.m2/repository/org/spark-project/hive/hive-metastore/1.2.1.spark2/hive-metastore-1.2.1.spark2.jar;C:/Users/Administrator/.m2/repository/com/jolbox/bonecp/0.8.0.RELEASE/bonecp-0.8.0.RELEASE.jar;C:/Users/Administrator/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar;C:/Users/Administrator/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar;C:/Users/Administrator/.m2/repository/org/datanucleus/datanucleus-api-jdo/3.2.6/datanucleus-api-jdo-3.2.6.jar;C:/Users/Administrator/.m2/repository/org/datanucleus/datanucleus-rdbms/3.2.9/datanucleus-rdbms-3.2.9.jar;C:/Users/Administrator/.m2/repository/commons-pool/commons-pool/1.5.4/commons-pool-1.5.4.jar;C:/Users/Administrator/.m2/repository/commons-dbcp/commons-dbcp/1.4/commons-dbcp-1.4.jar;C:/Users/Administrator/.m2/repository/javax/jdo/jdo-api/3.0.1/jdo-api-3.0.1.jar;C:/Users/Administrator/.m2/repository/javax/transaction/jta/1.1/jta-1.1.jar;C:/Users/Administrator/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar;C:/Users/Administrator/.m2/repository/org/apache/calcite/calcite-avatica/1.2.0-incubating/calcite-avatica-1.2.0-incubating.jar;C:/Users/Administrator/.m2/repository/org/apache/calcite/calcite-core/1.2.0-incubating/calcite-core-1.2.0-incubating.jar;C:/Users/Administrator/.m2/repository/org/apache/calcite/calcite-linq4j/1.2.0-incubating/calcite-linq4j-1.2.0-incubating.jar;C:/Users/Administrator/.m2/repository/net/hydromatic/eigenbase-properties/1.1.5/eigenbase-properties-1.1.5.jar;C:/Users/Administrator/.m2/repository/org/apache/httpcomponents/httpclient/4.5.6/httpclient-4.5.6.jar;C:/Users/Administrator/.m2/repository/org/apache/httpcomponents/httpcore/4.4.10/httpcore-4.4.10.jar;C:/Users/Administrator/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar;C:/Users/Administrator/.m2/repository/commons-codec/commons-codec/1.10/commons-codec-1.10.jar;C:/Users/Administrator/.m2/repository/org/jodd/jodd-core/3.5.2/jodd-core-3.5.2.jar;C:/Users/Administrator/.m2/repository/org/datanucleus/datanucleus-core/3.2.10/datanucleus-core-3.2.10.jar;C:/Users/Administrator/.m2/repository/org/apache/thrift/libthrift/0.9.3/libthrift-0.9.3.jar;C:/Users/Administrator/.m2/repository/org/apache/thrift/libfb303/0.9.3/libfb303-0.9.3.jar;C:/Users/Administrator/.m2/repository/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar cn.yanqi.sparksql.TransformationDemo
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+---------+------+------+
| Job|minsal|maxsal|
+---------+------+------+
| ANALYST| 3000| 3000|
| MANAGER| 2450| 2975|
|PRESIDENT| 5000| 5000|
+---------+------+------+
+---+------+---+
|sno| sname|age|
+---+------+---+
| 1| Alice| 18|
| 2| Andy| 19|
| 3| Bob| 17|
| 4|Justin| 21|
| 5| Cindy| 20|
+---+------+---+
Process finished with exit code 0
===============================END===============================
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm’d both hands before the fire of life.It sinks, and I am ready to depart ——W.S.Landor
原创文章,作者:Carrie001128,如若转载,请注明出处:https://blog.ytso.com/tech/database/244794.html