SparkSQL002

操作Hive

object SparkSessionApp {
  def main(args: Array[String]): Unit = {

    System.setProperty("HADOOP_USER_NAME", "double_happy")
    val spark = SparkSession.builder()
      .master("local")
      .appName("SparkSessionApp")
        .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases").show()

    spark.sql("").write.saveAsTable("")

    spark.sql("").write.insertInto("")

    spark.stop()
  }
}


/**
   * Saves the content of the `DataFrame` as the specified table.
   *
   * In the case the table already exists, behavior of this function depends on the
   * save mode, specified by the `mode` function (default to throwing an exception).
   * When `mode` is `Overwrite`, the schema of the `DataFrame` does not need to be
   * the same as that of the existing table.
   *
   * When `mode` is `Append`, if there is an existing table, we will use the format and options of
   * the existing table. The column order in the schema of the `DataFrame` doesn't need to be same
   * as that of the existing table. Unlike `insertInto`, `saveAsTable` will use the column names to
   * find the correct column positions. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  1|  2|
   *    |  4|  3|
   *    +---+---+
   * }}}
   *
   * In this method, save mode is used to determine the behavior if the data source table exists in
   * Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in
   * JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the
   * underlying data of data source if the table already exists.
   *
   * When the DataFrame is created from a non-partitioned `HadoopFsRelation` with a single input
   * path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
   * and Parquet), the table is persisted in a Hive compatible format, which means other systems
   * like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
   * specific format.
   *
   * @since 1.4.0
   */
  def saveAsTable(tableName: String): Unit = {
    saveAsTable(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }

/**
   * Inserts the content of the `DataFrame` to the specified table. It requires that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based
   * resolution. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
   *    scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  5|  6|
   *    |  3|  4|
   *    |  1|  2|
   *    +---+---+
   * }}}
   *
   * Because it inserts data to an existing table, format or options will be ignored.
   *
   * @since 1.4.0
   */
  def insertInto(tableName: String): Unit = {
    insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }


创建表 也可以直接 spark.sql("create table xxx") 
但是 不建议这样  因为 表 一般都是提前创建好的  
因为 真正生产上 创建表 是在 一个 web页面 创建表的  是有权限 的

Spark操作Hive 代码

大部分人使用spark开发 Hive是使用 spark.sql(“ sql “)
可以的我不喜欢我还是喜欢使用api的方式各有所爱

全局排序：这是使用sql的方式写的 
object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名

    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")
    
    //需求1 ：统计 每个平台 省市下面 traffic的总和       order by  是全局排序的 
    val sql = "select platform, province, city, sum(traffic) as traffics from log group by platform, province, city order by traffics desc"
    spark.sql(sql).show()

    spark.stop()
  }
}

结果是：
+--------+--------+----+--------+
|platform|province|city|traffics|
+--------+--------+----+--------+
|     mac|    香港|    | 2879982|
| windows|    香港|    | 2871537|
| Andriod|    香港|    | 2722363|
|   linux|    香港|    | 2696578|
| Symbain|    香港|    | 2444806|
| Andriod|    山西|忻州|  968255|
|   linux|    台湾|    |  898404|
| windows|    山西|忻州|  894966|
| Andriod|    湖北|武汉|  865758|
|     mac|    湖北|武汉|  848995|
|     mac|    山西|忻州|  837873|
| Symbain|    山西|忻州|  791524|
|   linux|    湖北|武汉|  781347|
| Andriod|    台湾|    |  776642|
|     mac|    台湾|    |  775977|
| Symbain|    台湾|    |  744858|
| windows|    台湾|    |  744558|
| windows|    湖北|武汉|  728412|
| Symbain|    湖北|武汉|  728034|
|   linux|    山西|忻州|  689405|
+--------+--------+----+--------+

Window Functions in Spark SQL

全局排序 ： Api方式   我喜欢的 

 /**
   * Groups the Dataset using the specified columns, so that we can run aggregation on them.
   * See [[RelationalGroupedDataset]] for all the available aggregate functions.
   *
   * This is a variant of groupBy that can only group by existing columns using column names
   * (i.e. cannot construct expressions).
   *
   * {{{
   *   // Compute the average for all numeric columns grouped by department.
   *   ds.groupBy("department").avg()
   *
   *   // Compute the max age and average salary, grouped by department and gender.
   *   ds.groupBy($"department", $"gender").agg(Map(
   *     "salary" -> "avg",
   *     "age" -> "max"
   *   ))
   * }}}
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
    val colNames: Seq[String] = col1 +: cols
    RelationalGroupedDataset(
      toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
  }


 /**
   * Compute aggregates by specifying a series of aggregate columns. Note that this function by
   * default retains the grouping columns in its output. To not retain grouping columns, set
   * `spark.sql.retainGroupColumns` to false.
   *
   * The available aggregate methods are defined in [[org.apache.spark.sql.functions]].
   *
   * {{{
   *   // Selects the age of the oldest employee and the aggregate expense for each department
   *
   *   // Scala:
   *   import org.apache.spark.sql.functions._
   *   df.groupBy("department").agg(max("age"), sum("expense"))
   *
   *   // Java:
   *   import static org.apache.spark.sql.functions.*;
   *   df.groupBy("department").agg(max("age"), sum("expense"));
   * }}}
   *
   * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
   * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`.
   * {{{
   *   // Scala, 1.3.x:
   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
   *
   *   // Java, 1.3.x:
   *   df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
   * }}}
   *
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def agg(expr: Column, exprs: Column*): DataFrame = {
    toDF((expr +: exprs).map {
      case typed: TypedColumn[_, _] =>
        typed.withInputType(df.exprEnc, df.logicalPlan.output).expr
      case c => c.expr
    })
  }

你要好好看看注释 就会明白下面的代码

groupBy:Groups the Dataset using the specified columns, so that we can run aggregation on them.
agg : 
  * Compute aggregates by specifying a series of aggregate columns. Note that this function by
   * default retains the grouping columns in its output. To not retain grouping columns, set
   * `spark.sql.retainGroupColumns` to false.


 /**
   * Returns a new Dataset sorted by the given expressions. For example:
   * {{{
   *   ds.sort($"col1", $"col2".desc)
   * }}}
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def sort(sortExprs: Column*): Dataset[T] = {
    sortInternal(global = true, sortExprs)
  }

object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名

    //需求1 ：统计 每个平台 省市下面 traffic的总和       order by  是全局排序的
        import org.apache.spark.sql.functions._    //spark 内置的函数
        df.groupBy("platform", "province", "city")
            .agg(sum("traffic").as("traffics"))
            .sort('traffics.desc)
            .show()
    spark.stop()
  }
}

注意：
Hive的函数 Spark里也是有的 spark自己内置的 
import org.apache.spark.sql.functions._   


结果是;
+--------+--------+----+--------+
|platform|province|city|traffics|
+--------+--------+----+--------+
|     mac|    香港|    | 2879982|
| windows|    香港|    | 2871537|
| Andriod|    香港|    | 2722363|
|   linux|    香港|    | 2696578|
| Symbain|    香港|    | 2444806|
| Andriod|    山西|忻州|  968255|
|   linux|    台湾|    |  898404|
| windows|    山西|忻州|  894966|
| Andriod|    湖北|武汉|  865758|
|     mac|    湖北|武汉|  848995|
|     mac|    山西|忻州|  837873|
| Symbain|    山西|忻州|  791524|
|   linux|    湖北|武汉|  781347|
| Andriod|    台湾|    |  776642|
|     mac|    台湾|    |  775977|
| Symbain|    台湾|    |  744858|
| windows|    台湾|    |  744558|
| windows|    湖北|武汉|  728412|
| Symbain|    湖北|武汉|  728034|
|   linux|    山西|忻州|  689405|
+--------+--------+----+--------+
only showing top 20 rows

在这里插入图片描述

api方式 开发 你要注意的是：
	Column 和 String  就是你传进去的 列名 要传入 string类型 还是 Column 类型 

   我个人喜欢 ：
   		 Column  ==》 '列名
   		 String    ==》 “列名”
   毕竟有太多种写法 找一个自己喜欢的  跟找女朋友相反 找女朋友 找一个喜欢自己的  明白吗

分组:Top n 
object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名



    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")

    // 需求二 ：platform  组内  province 访问次数最多的TopN
    val sql =
      """
        |
        |select * from
        |(
        |select t.*, row_number() over(partition by platform order by cnt desc) as r
        |from
        |(select platform,province,count(1) cnt from log group by platform,province) t
        |) a where a.r<=3
        |
      """.stripMargin

    spark.sql(sql).show()
    spark.stop()
  }
}

结果是：
+--------+--------+---+---+
|platform|province|cnt|  r|
+--------+--------+---+---+
|   linux|    香港|606|  1|
|   linux|    广东|211|  2|
|   linux|    台湾|173|  3|
| Symbain|    香港|582|  1|
| Symbain|    广东|222|  2|
| Symbain|    福建|153|  3|
| Andriod|    香港|607|  1|
| Andriod|    广东|223|  2|
| Andriod|    北京|150|  3|
|     mac|    香港|646|  1|
|     mac|    广东|213|  2|
|     mac|    台湾|156|  3|
| windows|    香港|657|  1|
| windows|    广东|186|  2|
| windows|    河北|151|  3|
+--------+--------+---+---+

object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名


    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")
    val sql2 =
      """
        |
        |select a.* from
        |(
        |select t.*,
        |row_number() over(partition by platform order by cnt desc) as rn,
        |rank() over(partition by platform order by cnt desc) as r,
        |dense_rank() over(partition by platform order by cnt desc) as dn
        |from
        |(select platform,province,count(1) cnt from log group by platform,province) t
        |) a
        |
      """.stripMargin

    spark.sql(sql2).show()
    spark.stop()
  }
}

结果是：
+--------+--------+---+---+---+---+
|platform|province|cnt| rn|  r| dn|
+--------+--------+---+---+---+---+
|   linux|    香港|606|  1|  1|  1|
|   linux|    广东|211|  2|  2|  2|
|   linux|    台湾|173|  3|  3|  3|
|   linux|    福建|147|  4|  4|  4|
|   linux|    北京|134|  5|  5|  5|
|   linux|    河北|128|  6|  6|  6|
|   linux|    湖北|115|  7|  7|  7|
|   linux|    山西|107|  8|  8|  8|
|   linux|    江西|104|  9|  9|  9|
|   linux|    上海|101| 10| 10| 10|
|   linux|    山东| 26| 11| 11| 11|
| Symbain|    香港|582|  1|  1|  1|
| Symbain|    广东|222|  2|  2|  2|
| Symbain|    福建|153|  3|  3|  3|
| Symbain|    河北|151|  4|  4|  4|
| Symbain|    台湾|146|  5|  5|  5|
| Symbain|    北京|133|  6|  6|  6|
| Symbain|    山西|121|  7|  7|  7|
| Symbain|    湖北|120|  8|  8|  8|
| Symbain|    上海|109|  9|  9|  9|
+--------+--------+---+---+---+---+
only showing top 20 rows

那么： 他们有什么区别？
	row_number   123456  排序的  即使有的值相等 也往下排序
	rank        1233567 排序的  有相同的值  排序号相等 之后会跳过重复的占位 这里就没有4
	dense_rank   12334567 排序的 有相同的值  排序号相等  之后不会跳过重复的占位 这里紧接着4

Catalog
非常非常重要 spark2.0之后才有的我开发了一个csv入Hive 就用到了它

你Hive的元数据存在 MySQl里面的 
如果要代码中使用到元数据 要通过JDBC来取 

但是2.0版本之后 Spark 提供 Catalog 可以拿到 Hive的元数据

开启spark-shell –jars MySQL驱动

scala> val catalog = spark.catalog
catalog: org.apache.spark.sql.catalog.Catalog = org.apache.spark.sql.internal.CatalogImpl@672c4e24

scala> catalog.listDatabases().show
+--------+--------------------+--------------------+
|    name|         description|         locationUri|
+--------+--------------------+--------------------+
| default|Default Hive data...|hdfs://hadoop101:...|
|homework|                    |hdfs://hadoop101:...|
+--------+--------------------+--------------------+


scala> catalog.listDatabases().show(false)
+--------+---------------------+-----------------------------------------------------+
|name    |description          |locationUri                                          |
+--------+---------------------+-----------------------------------------------------+
|default |Default Hive database|hdfs://hadoop101:8020/user/hive/warehouse            |
|homework|                     |hdfs://hadoop101:8020/user/hive/warehouse/homework.db|
+--------+---------------------+-----------------------------------------------------+


scala> catalog.listTables("homework").show(false)
+---------------------------------+--------+-----------+---------+-----------+
|name                             |database|description|tableType|isTemporary|
+---------------------------------+--------+-----------+---------+-----------+
|access_wide                      |homework|null       |EXTERNAL |false      |
|dwd_platform_stat_info           |homework|null       |MANAGED  |false      |
|jf_tmp                           |homework|null       |MANAGED  |false      |
|ods_domain_traffic_info          |homework|null       |EXTERNAL |false      |
|ods_log_info                     |homework|null       |EXTERNAL |false      |
|ods_uid_pid_compression_info     |homework|null       |MANAGED  |false      |
|ods_uid_pid_info                 |homework|null       |EXTERNAL |false      |
|ods_uid_pid_info_compression_test|homework|null       |EXTERNAL |false      |
+---------------------------------+--------+-----------+---------+-----------+


scala> catalog.listFunctions().show(5,false)
+----+--------+-----------+----------------------------------------------------+-----------+
|name|database|description|className                                           |isTemporary|
+----+--------+-----------+----------------------------------------------------+-----------+
|!   |null    |null       |org.apache.spark.sql.catalyst.expressions.Not       |true       |
|%   |null    |null       |org.apache.spark.sql.catalyst.expressions.Remainder |true       |
|&   |null    |null       |org.apache.spark.sql.catalyst.expressions.BitwiseAnd|true       |
|*   |null    |null       |org.apache.spark.sql.catalyst.expressions.Multiply  |true       |
|+   |null    |null       |org.apache.spark.sql.catalyst.expressions.Add       |true       |
+----+--------+-----------+----------------------------------------------------+-----------+
only showing top 5 rows


scala> catalog.listColumns("homework.dwd_platform_stat_info").show(false)
+--------+-----------+--------+--------+-----------+--------+
|name    |description|dataType|nullable|isPartition|isBucket|
+--------+-----------+--------+--------+-----------+--------+
|platform|null       |string  |true    |false      |false   |
|cnt     |null       |int     |true    |false      |false   |
|d       |null       |string  |true    |false      |false   |
|day     |null       |string  |true    |true       |false   |
+--------+-----------+--------+--------+-----------+--------+


scala> 

不仅仅这多多哈 catalog 几乎所有的元数据 信息都能搞到 

但是这些值的返回值 都是 DataSet 接下来 讲讲DataSet

在这里插入图片描述
给你一个使用上面catalog的场景
做一个页面：

在这里插入图片描述

DataSet
这个东西很简单的
Untyped Dataset = Row

DataSet就是你可以把它当作rdd来操作 

object DSApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("DSApp")
      .getOrCreate()

    import spark.implicits._
    val df = spark.read.option("header","true")
      .option("inferSchema","true").csv("file:///C:/IdeaProjects/spark/data/sale.csv")
    val ds = df.as[Sales]

    ds.printSchema()
    ds.show()

    // ROW  DF弱类型
    //    df.select("transactionId").show(false)

    ds.map(columns =>{
        columns.transactionId match {
        case 111 =>Sales(columns.transactionId,columns.customerId,columns.itemId,columns.amountPaid+200)
        case _ => Sales(columns.transactionId,columns.customerId,columns.itemId,columns.amountPaid)
      }
    }).show(false)

    spark.stop()
  }
  case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}

结果是：
root
 |-- transactionId: integer (nullable = true)
 |-- customerId: integer (nullable = true)
 |-- itemId: integer (nullable = true)
 |-- amountPaid: double (nullable = true)

+-------------+----------+------+----------+
|transactionId|customerId|itemId|amountPaid|
+-------------+----------+------+----------+
|          111|         1|     1|     100.0|
|          112|         2|     2|     500.0|
|          113|         3|     3|     400.0|
|          114|         1|     4|     300.0|
|          115|         1|     1|     200.0|
|          116|         1|     2|     700.0|
|          117|         4|     3|     800.0|
|          118|         5|     1|     200.0|
|          119|         3|     4|     200.0|
|          120|         1|     1|     300.0|
+-------------+----------+------+----------+

+-------------+----------+------+----------+
|transactionId|customerId|itemId|amountPaid|
+-------------+----------+------+----------+
|111          |1         |1     |300.0     |
|112          |2         |2     |500.0     |
|113          |3         |3     |400.0     |
|114          |1         |4     |300.0     |
|115          |1         |1     |200.0     |
|116          |1         |2     |700.0     |
|117          |4         |3     |800.0     |
|118          |5         |1     |200.0     |
|119          |3         |4     |200.0     |
|120          |1         |1     |300.0     |
+-------------+----------+------+----------+

Interoperating with RDDs
Interoperating with RDDs
和RDD的交互操作

DS  --》 DF   通过 DS.toDF("列名。。")
DF--》DS   通过 样例类    df.as[样例类]

RDD ---》 DF   两种 
  Spark SQL supports two different methods for converting existing RDDs into Datasets.
	1.反射   就是使用case  class  你的case class 定义的就是 table的信息
	2.The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD.
	编程的方式 ：
		就是你的哪个字段什么类型指定好就可以了

1.反射的方式   RDD -》 DF
object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()
    import spark.implicits._

    // RDD ==> DF/DS
        val peopleDF = spark.sparkContext
          .textFile("file:///C:/IdeaProjects/spark/data/data.txt")
          .map(_.split(","))
          .map(x => Person(x(0), x(1).trim.toInt))
          .toDF()

        peopleDF.show(false)

    spark.stop()
  }
  case class Person(name:String,age:Int)
}

结果是：
+------------+---+
|name        |age|
+------------+---+
|double_happy|25 |
|Kairis      |25 |
|Kite        |32 |
+------------+---+

2.编程的方式
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
  1.Create an RDD of Rows from the original RDD;
  2.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 
  3.Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()

    val peopleRDD = spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/data.txt")

    val schemaString = "name age"

    val fields: Array[StructField] = schemaString.split(" ").map(fieldName => {
      StructField(fieldName, StringType)
    })

    val schema = StructType(fields)

    val rowRDD: RDD[Row] = peopleRDD.map(_.split(",")).map(x=>Row(x(0),x(1).trim))

    val peopleDF: DataFrame = spark.createDataFrame(rowRDD,schema)

    //TODO... 业务逻辑
    peopleDF.show()

    spark.stop()
  }
  case class Person(name:String,age:Int)
}

结果：
+------------+---+
|        name|age|
+------------+---+
|double_happy| 25|
|      Kairis| 25|
|        Kite| 32|
+------------+---+

官网给的例子 不是很好  难道你们生产上全是 String 类型的么？ 我只能说还真是 我上一家公司就是 
建议哈 统计字段 还是采用标准的int 或者 double类型  
我之前统计的时候 全是String 的 就会出现 指标不准的问题  
我遇到过 同事说用String 很爽 我只能说 是的是的 
在他们眼里 spark不就是写sql吗？ 
emmm 统计指标可以的 但是 如果让你做 基础架构开发 呢？ 
不要仅仅局限于指标需求哈 那么你这大数据工程师 就是 sql怪

object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()
 
    val peopleRDD = spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/data.txt")

    val schema = StructType(Array(
      StructField("name",StringType),
      StructField("age",IntegerType)
    ))

    val rowRDD = peopleRDD
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).trim.toInt))
      
    val peopleDF = spark.createDataFrame(rowRDD, schema)

    peopleDF.show()
    spark.stop()
  }
 case class Person(name:String,age:Int)
}

注意：
	Row里面的数据类型 一定要和 schema里的数据类型匹配上

结果：
+------------+---+
|        name|age|
+------------+---+
|double_happy| 25|
|      Kairis| 25|
|        Kite| 32|
+------------+---+

UDF

object UDFApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("SparkSessionApp")
      .getOrCreate()

    import spark.implicits._
    /**
      * step1： 定义 注册
      * step2： 使用
      */
    spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/udf.txt")
      .map(_.split(" "))
      .map(x => FootballTeam(x(0), x(1)))
      .toDF().createOrReplaceTempView("teams")

    spark.udf.register("teams_length",(input:String)=>{
      input.split("，").length
    })

    //统计一个人喜欢的球队的个数

    spark.sql("select name,teams,teams_length(teams) from teams").show()

    spark.stop()
  }

  case class FootballTeam(name:String, teams:String)

}
结果是：
+------+------------------+-----------------------+
|  name|             teams|UDF:teams_length(teams)|
+------+------------------+-----------------------+
|苍老师 |      喵喵喵，红魔  |                      2|
|    pk|小破车，国足，宅团   |                      3|
+------+------------------+-----------------------+



注意：
	  spark.udf.register("teams_length",(input:String)=>{
      input.split("，").length
    })
def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction


Function 就是传进去一个 函数 

还有一种UDF函数的使用就是 api的方式 

functions里面 有个 udf 方法 传进去一个函数    再 结合 withColumns 方法 使用 也是一样的