SparkSQL--TextFile输出多列

2018-04-17

这个问题我在SparkSQL001里面提出过这个问题 
解决的方法有很多 
我看了很多同组的同学解决思路  大部分是用RDD函数式编程解决的 

但是我觉得没有必要那么做 

曾经我在工作的时候 有个人问过我这个问题 我当时解决过  

就在刚刚我查看我的博客时候 决定把我的解决办法给总结一下

数据格式和内容
在这里插入图片描述

object TextFileApp {

  def main(args: Array[String]): Unit = {


    val spark = SparkSession.builder().appName("TextFileApp").master("local[2]").enableHiveSupport().getOrCreate()

    var df = spark.read.format("text").load("file:////Users/double_happy/Downloads/f16/data/text/man.txt")

    df.printSchema()
    df.show()

    df = df.withColumn("data",UDFUtils.string2fields(df.col("value")))
    df = df.withColumn("id",df.col("data.id"))
    df = df.withColumn("age",df.col("data.age"))

    df.printSchema()
    df.show()
  }
}

结果是：
root
 |-- value: string (nullable = true)

+---------------+
|          value|
+---------------+
|double_happy,25|
|      Kairis,25|
|        Kite,32|
+---------------+

root
 |-- value: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- age: integer (nullable = false)
 |-- id: string (nullable = true)
 |-- age: integer (nullable = true)

+---------------+------------------+------------+---+
|          value|              data|          id|age|
+---------------+------------------+------------+---+
|double_happy,25|[double_happy, 25]|double_happy| 25|
|      Kairis,25|      [Kairis, 25]|      Kairis| 25|
|        Kite,32|        [Kite, 32]|        Kite| 32|
+---------------+------------------+------------+---+

object UDFUtils {

  import org.apache.spark.sql.functions._

  def string2fields=udf((data:String)=>{

    DataFormat(data.split(",")(0),data.split(",")(1).toInt)
  })

  case class DataFormat(id:String,age:Int)
}

上面的解决办法明白了吗 ？
定义一个udf函数即可 

这个代码同时也解决了 struct类型的你应该怎么处理 
之前的文章也提到过这个问题 这里给解决了

雅恩资源调优---double_happy

2018-03-17

一台机器能运行多少个container 到底是由谁决定的 ？

官网

生产上一台机器：
	48G物理内存  8core--》16vcore
	Linux系统本身要占内存+空留:  20% =9.6G
	剩余: 80% =38.4G=38G    这些留给大数据组件
	
DN进程: 生产4G  
	1000m
	hadoop-env.sh
	HADOOP_NAMENODE_OPTS=-Xmx1024m
	HADOOP_DATANODE_OPTS=-Xmx1024m
	
NM进程: 生产4G
	yarn-env.sh
	export YARN_RESOURCEMANAGER_HEAPSIZE=1024
	export YARN_NODEMANAGER_HEAPSIZE=1024
NM 与DN 部署在同一台机器上： 数据本地化

NN RM 经常性部署同一台  说白了 集群节点少

*****
资源内存: 38G-4-4=30G   这就是运行container容器 的 
yarn.nodemanager.resource.memory-mb   30*1024   总的  
默认配置
yarn.scheduler.minimum-allocation-mb  1024     给容器最小的 内存     生产上 2g
yarn.scheduler.maximum-allocation-mb  8192   给容器最大的 内存       生产上 30g

按照官网默认的算：
30G 30G/1G=30个container 
30G 30/8G=3个container ...6G    这里面 30这个值不好  30/8   还剩6g  32g 比较好 
30个~3个

在这里插入图片描述
内存：

生产一: 
yarn.nodemanager.resource.memory-mb   30G
yarn.scheduler.minimum-allocation-mb  2G
yarn.scheduler.maximum-allocation-mb  30G

2g--》  yarn给你分配的时候先给你最小的  当计算过程中发现内存不够了 yarn会给你长一个g

15个~1个

30G 是不是太大了 根据你作业来分的  你作业就是需要30g 你只能给30g

生产二：
yarn.nodemanager.resource.memory-mb   32G
yarn.scheduler.minimum-allocation-mb  2G
yarn.scheduler.maximum-allocation-mb  8G

16c~4c

如果container  memory oom      那么调大yarn.scheduler.maximum-allocation-mb 这个 先把oom这个进程kill掉

生产三:
256G:

yarn.nodemanager.resource.memory-mb   168G
yarn.scheduler.minimum-allocation-mb  4G
yarn.scheduler.maximum-allocation-mb  24G



container p memory oom kill

生产默认 
yarn.nodemanager.pmem-check-enabled	true       //物理内存
yarn.nodemanager.vmem-check-enabled	true      
yarn.nodemanager.vmem-pmem-ratio	2.1     //物理内存 和虚拟内存的比例 

物理内存 1m  虚拟内存 2.1m

这两个内存超了 都会抱 oom

cpu

CPU:
yarn.nodemanager.resource.cpu-vcores	 12
yarn.scheduler.minimum-allocation-vcores  1
yarn.scheduler.maximum-allocation-vcores  4




container: 
	memory 16c~4c
	vcores 12c~3c 

所以这个两个变量
cpu 和 mem  你应该怎么调 才能 资源最大化呢?  
	到底是根据mem 来算 还是 core来算呢？ 已加密

SS04

2018-02-22

之前的ss程序都是运行在idea
那么如何提交到服务器上运行呢？
  演示：
  一步一步来  先不管理offset 把代码提交到yarn上 把wc统计出来

数据从Kafka过来然后 ss消费到 把wc统计出来：

object StreamingKakfaDirectYarnApp {

  def main(args: Array[String]): Unit = {

    //参数从外面传进 来    topics groupId brokers
    if(args.size != 3){
      System.err.println("Usage:StreamingKakfaDirectYarnApp <brokers> <topic> <groupId>")
      System.exit(-1)
    }

    val Array(brokers,topic,groupId) = args
    
    val sparkConf: SparkConf = new SparkConf()
    val ssc =new StreamingContext(sparkConf,Seconds(10))
   // val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> brokers, //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer], //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest", //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean) //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent, //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams) //固定写法
    )


    //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    result.print()


    ssc.start()
    ssc.awaitTermination()
  }
}
idea测试结果：
Usage:StreamingKakfaDirectYarnApp <brokers> <topic> <groupId>

注意：idea里怎么把参数传进去呢？

在这里插入图片描述

运行结果：
-------------------------------------------
Time: 1572763410000 ms
-------------------------------------------
(d,19)
(b,18)
(f,21)
(e,17)
(a,24)
(c,21)

-------------------------------------------
Time: 1572763420000 ms
-------------------------------------------

-------------------------------------------
Time: 1572763430000 ms
-------------------------------------------

说明本地改造完成 那么我们打包上传到服务器上运行

提交命令：
./spark-submit \
--master local[2] \
--name StreamingKakfaDirectYarnApp \
--class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
/home/double_happy/lib/spark-core-1.0.jar \
hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3

[double_happy@hadoop101 bin]$ ./spark-submit \
> --master local[2] \
> --name StreamingKakfaDirectYarnApp \
> --class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
> /home/double_happy/lib/spark-core-1.0.jar \
> hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3
19/11/03 15:08:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/03 15:08:42 INFO SparkContext: Running Spark version 2.4.4
19/11/03 15:08:43 INFO SparkContext: Submitted application: StreamingKakfaDirectYarnApp
19/11/03 15:08:43 INFO SecurityManager: Changing view acls to: double_happy
19/11/03 15:08:43 INFO SecurityManager: Changing modify acls to: double_happy
19/11/03 15:08:43 INFO SecurityManager: Changing view acls groups to: 
19/11/03 15:08:43 INFO SecurityManager: Changing modify acls groups to: 
19/11/03 15:08:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/11/03 15:08:43 INFO Utils: Successfully started service 'sparkDriver' on port 40978.
19/11/03 15:08:43 INFO SparkEnv: Registering MapOutputTracker
19/11/03 15:08:43 INFO SparkEnv: Registering BlockManagerMaster
19/11/03 15:08:43 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/11/03 15:08:43 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/11/03 15:08:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d01d319f-1fe4-4025-bcf4-418a06809ccc
19/11/03 15:08:43 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/11/03 15:08:43 INFO SparkEnv: Registering OutputCommitCoordinator
19/11/03 15:08:43 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/11/03 15:08:43 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop101:4040
19/11/03 15:08:43 INFO SparkContext: Added JAR file:/home/double_happy/lib/spark-core-1.0.jar at spark://hadoop101:40978/jars/spark-core-1.0.jar with timestamp 1572764923743
19/11/03 15:08:43 INFO Executor: Starting executor ID driver on host localhost
19/11/03 15:08:43 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33748.
19/11/03 15:08:43 INFO NettyBlockTransferService: Server created on hadoop101:33748
19/11/03 15:08:43 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/03 15:08:43 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop101, 33748, None)
19/11/03 15:08:43 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:33748 with 366.3 MB RAM, BlockManagerId(driver, hadoop101, 33748, None)
19/11/03 15:08:43 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop101, 33748, None)
19/11/03 15:08:43 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop101, 33748, None)
19/11/03 15:08:45 INFO EventLoggingListener: Logging events to hdfs://hadoop101:8020/spark_directory/local-1572764923782
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer
        at com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp$.main(StreamingKakfaDirectYarnApp.scala:36)
        at com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp.main(StreamingKakfaDirectYarnApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.StringDeserializer
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 14 more
19/11/03 15:08:45 INFO SparkContext: Invoking stop() from shutdown hook
19/11/03 15:08:45 INFO SparkUI: Stopped Spark web UI at http://hadoop101:4040
19/11/03 15:08:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/11/03 15:08:45 INFO MemoryStore: MemoryStore cleared
19/11/03 15:08:45 INFO BlockManager: BlockManager stopped
19/11/03 15:08:46 INFO BlockManagerMaster: BlockManagerMaster stopped
19/11/03 15:08:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/11/03 15:08:46 INFO SparkContext: Successfully stopped SparkContext
19/11/03 15:08:46 INFO ShutdownHookManager: Shutdown hook called
19/11/03 15:08:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-6ffb645c-d7fd-44e8-b0e5-256cae7b11ea
19/11/03 15:08:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-8a31f335-3d75-4c59-b7a5-c5bf023d1265
[double_happy@hadoop101 bin]$

注意：
1. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer
为什么呢？ 在idea里都可以的 
StringDeserializer 类是在
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
这个包里面的

而这个包Spark本身是没有的 是我们额外加进来的喽 
那么这个包没有在服务器上 为什么该怎么办呢？ 看看官网怎么说的

Deploying：部署
As with any Spark applications, spark-submit is used to launch your application.

For Scala and Java applications, if you are using SBT or Maven for project management, then package spark-streaming-kafka-0-10_2.12 and its dependencies into the application JAR. Make sure spark-core_2.12 and spark-streaming_2.12 are marked as provided dependencies as those are already present in a Spark installation. Then use spark-submit to launch your application (see Deploying section in the main programming guide).
这种方式不好换一个

因为需要把这个spark-streaming-kafka-0-10_2.11包 传到服务器上
./spark-submit --help  查查 可以加maven 的依赖  怎么加呢？

 --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.

这个参数 可以指向 maven的一些jar包 **** 
修改提交命令：
./spark-submit \
--master local[2] \
--name StreamingKakfaDirectYarnApp \
 --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.4 \
--class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
/home/double_happy/lib/spark-core-1.0.jar \
hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3

但是这个东西 需要联网 不能联网是不行的 一会看日志就清除了 它需要联网去下载 maven依赖

[double_happy@hadoop101 bin]$ ./spark-submit --master local[2] --name StreamingKakfaDirectYarnApp  --packageg.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.4 --class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp /home/double_happy/lib/spark-core-1.0.jar hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3
Ivy Default Cache set to: /home/double_happy/.ivy2/cache
The jars for the packages stored in: /home/double_happy/.ivy2/jars
:: loading settings :: url = jar:file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-10_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c0781f34-3101-4762-8694-9aa38b463184;1.0
        confs: [default]
        found org.apache.spark#spark-streaming-kafka-0-10_2.11;2.4.4 in central
        found org.apache.kafka#kafka-clients;2.0.0 in central
        found org.lz4#lz4-java;1.4.0 in central
        found org.xerial.snappy#snappy-java;1.1.7.3 in central
        found org.slf4j#slf4j-api;1.7.16 in central
        found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 496ms :: artifacts dl 9ms
        :: modules in use:
        org.apache.kafka#kafka-clients;2.0.0 from central in [default]
        org.apache.spark#spark-streaming-kafka-0-10_2.11;2.4.4 from central in [default]
        org.lz4#lz4-java;1.4.0 from central in [default]
        org.slf4j#slf4j-api;1.7.16 from central in [default]
        org.spark-project.spark#unused;1.0.0 from central in [default]
        org.xerial.snappy#snappy-java;1.1.7.3 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   6   |   0   |   0   |   0   ||   6   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c0781f34-3101-4762-8694-9aa38b463184
        confs: [default]
        0 artifacts copied, 6 already retrieved (0kB/10ms)

我截取了一小部分日志 你看 第一次需要下载maven依赖的  
所以这个 参数也有弊端的  (毕竟公司的服务器是不可能连接外网的  )

还有其他的方式可以解决 一会介绍

那么刚刚packages  有小问题  那么怎么办呢 ？

1.先把spark-streaming-kafka-0-10_2.11依赖包 上传到服务器上 
2.通过--jars 来指定

./spark-submit \
--master local[2] \
--name StreamingKakfaDirectYarnApp \
--jars /home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar \
--class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
/home/double_happy/lib/spark-core-1.0.jar \
hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3

结果：
[double_happy@hadoop101 bin]$ ./spark-submit \
> --master local[2] \
> --name StreamingKakfaDirectYarnApp \
> --jars /home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar \
> --class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
> /home/double_happy/lib/spark-core-1.0.jar \
> hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3
19/11/03 15:38:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/03 15:38:39 INFO SparkContext: Running Spark version 2.4.4
19/11/03 15:38:39 INFO SparkContext: Submitted application: StreamingKakfaDirectYarnApp
19/11/03 15:38:39 INFO SecurityManager: Changing view acls to: double_happy
19/11/03 15:38:39 INFO SecurityManager: Changing modify acls to: double_happy
19/11/03 15:38:39 INFO SecurityManager: Changing view acls groups to: 
19/11/03 15:38:39 INFO SecurityManager: Changing modify acls groups to: 
19/11/03 15:38:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/11/03 15:38:39 INFO Utils: Successfully started service 'sparkDriver' on port 45422.
19/11/03 15:38:39 INFO SparkEnv: Registering MapOutputTracker
19/11/03 15:38:39 INFO SparkEnv: Registering BlockManagerMaster
19/11/03 15:38:39 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/11/03 15:38:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/11/03 15:38:39 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ed285e0b-aab6-4fa6-a09d-6776f02d7a71
19/11/03 15:38:39 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/11/03 15:38:39 INFO SparkEnv: Registering OutputCommitCoordinator
19/11/03 15:38:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/11/03 15:38:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop101:4040
19/11/03 15:38:39 INFO SparkContext: Added JAR file:///home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar at spark://hadoop101:45422/jars/spark-streaming-kafka-0-10_2.11-2.4.4.jar with timestamp 1572766719888
19/11/03 15:38:39 INFO SparkContext: Added JAR file:/home/double_happy/lib/spark-core-1.0.jar at spark://hadoop101:45422/jars/spark-core-1.0.jar with timestamp 1572766719889
19/11/03 15:38:39 INFO Executor: Starting executor ID driver on host localhost
19/11/03 15:38:40 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40256.
19/11/03 15:38:40 INFO NettyBlockTransferService: Server created on hadoop101:40256
19/11/03 15:38:40 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/03 15:38:40 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop101, 40256, None)
19/11/03 15:38:40 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:40256 with 366.3 MB RAM, BlockManagerId(driver, hadoop101, 40256, None)
19/11/03 15:38:40 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop101, 40256, None)
19/11/03 15:38:40 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop101, 40256, None)
19/11/03 15:38:40 INFO EventLoggingListener: Logging events to hdfs://hadoop101:8020/spark_directory/local-1572766719940
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer
        at com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp$.main(StreamingKakfaDirectYarnApp.scala:36)
        at com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp.main(StreamingKakfaDirectYarnApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.StringDeserializer
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 14 more
19/11/03 15:38:41 INFO SparkContext: Invoking stop() from shutdown hook
19/11/03 15:38:41 INFO SparkUI: Stopped Spark web UI at http://hadoop101:4040
19/11/03 15:38:41 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/11/03 15:38:41 INFO MemoryStore: MemoryStore cleared
19/11/03 15:38:41 INFO BlockManager: BlockManager stopped
19/11/03 15:38:41 INFO BlockManagerMaster: BlockManagerMaster stopped
19/11/03 15:38:41 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/11/03 15:38:41 INFO SparkContext: Successfully stopped SparkContext
19/11/03 15:38:41 INFO ShutdownHookManager: Shutdown hook called
19/11/03 15:38:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-7780e105-fa8c-4592-ac12-7d27fc631ccd
19/11/03 15:38:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-58bc8e97-e4ea-4708-9819-b98f98cb2212
[double_happy@hadoop101 bin]$

注意：
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer

这个东西和上面一样 因为这个东西是在哪？

在这里插入图片描述

因为sparkStreaming-kafka包里面包含kafka-client 
你idea里pom 配置一个ss-kafka是可以的 但是到服务器上 是需要kafka-client这个jar包的
所以把它 也上传到服务器上

./spark-submit \
--master local[2] \
--name StreamingKakfaDirectYarnApp \
--jars /home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar,/home/double_happy/lib/kafka-clients-2.0.0.jar \
--class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp \
/home/double_happy/lib/spark-core-1.0.jar \
hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3

[double_happy@hadoop101 bin]$ ./spark-submit --master local[2] --name StreamingKakfaDirectYarnApp --jars /home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar,/home/double_happy/lib/kafka-clients-2.0.0.jar --class com.ruozedata.spark.ss04.StreamingKakfaDirectYarnApp /home/double_happy/lib/spark-core-1.0.jar hadoop101:9092,hadoop101:9093,hadoop101:9094 double_happy_offset double_happy_group3
19/11/03 15:51:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/03 15:51:12 INFO SparkContext: Running Spark version 2.4.4
19/11/03 15:51:12 INFO SparkContext: Submitted application: StreamingKakfaDirectYarnApp
19/11/03 15:51:12 INFO SecurityManager: Changing view acls to: double_happy
19/11/03 15:51:12 INFO SecurityManager: Changing modify acls to: double_happy
19/11/03 15:51:12 INFO SecurityManager: Changing view acls groups to: 
19/11/03 15:51:12 INFO SecurityManager: Changing modify acls groups to: 
19/11/03 15:51:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/11/03 15:51:13 INFO Utils: Successfully started service 'sparkDriver' on port 44185.
19/11/03 15:51:13 INFO SparkEnv: Registering MapOutputTracker
19/11/03 15:51:13 INFO SparkEnv: Registering BlockManagerMaster
19/11/03 15:51:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/11/03 15:51:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/11/03 15:51:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d86d96f2-228c-4046-b62b-bbc683c696e8
19/11/03 15:51:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/11/03 15:51:13 INFO SparkEnv: Registering OutputCommitCoordinator
19/11/03 15:51:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/11/03 15:51:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop101:4040
19/11/03 15:51:13 INFO SparkContext: Added JAR file:///home/double_happy/lib/spark-streaming-kafka-0-10_2.11-2.4.4.jar at spark://hadoop101:44185/jars/spark-streaming-kafka-0-10_2.11-2.4.4.jar with timestamp 1572767473466
19/11/03 15:51:13 INFO SparkContext: Added JAR file:///home/double_happy/lib/kafka-clients-2.0.0.jar at spark://hadoop101:44185/jars/kafka-clients-2.0.0.jar with timestamp 1572767473467
19/11/03 15:51:13 INFO SparkContext: Added JAR file:/home/double_happy/lib/spark-core-1.0.jar at spark://hadoop101:44185/jars/spark-core-1.0.jar with timestamp 1572767473467
19/11/03 15:51:13 INFO Executor: Starting executor ID driver on host localhost
19/11/03 15:51:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34854.
19/11/03 15:51:13 INFO NettyBlockTransferService: Server created on hadoop101:34854
19/11/03 15:51:13 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/03 15:51:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop101, 34854, None)
19/11/03 15:51:13 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:34854 with 366.3 MB RAM, BlockManagerId(driver, hadoop101, 34854, None)
19/11/03 15:51:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop101, 34854, None)
19/11/03 15:51:13 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop101, 34854, None)
19/11/03 15:51:14 INFO EventLoggingListener: Logging events to hdfs://hadoop101:8020/spark_directory/local-1572767473528
19/11/03 15:51:14 WARN KafkaUtils: overriding enable.auto.commit to false for executor
19/11/03 15:51:14 WARN KafkaUtils: overriding auto.offset.reset to none for executor
19/11/03 15:51:14 WARN KafkaUtils: overriding executor group.id to spark-executor-double_happy_group3
19/11/03 15:51:14 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
19/11/03 15:51:14 INFO DirectKafkaInputDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO DirectKafkaInputDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO DirectKafkaInputDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@5427abd
19/11/03 15:51:14 INFO MappedDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO MappedDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO MappedDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO MappedDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream@7fbff13b
19/11/03 15:51:14 INFO FlatMappedDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO FlatMappedDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO FlatMappedDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO FlatMappedDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream@57b2814e
19/11/03 15:51:14 INFO MappedDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO MappedDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO MappedDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO MappedDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream@4b876f31
19/11/03 15:51:14 INFO ShuffledDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO ShuffledDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO ShuffledDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO ShuffledDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream@5d06636e
19/11/03 15:51:14 INFO ForEachDStream: Slide time = 10000 ms
19/11/03 15:51:14 INFO ForEachDStream: Storage level = Serialized 1x Replicated
19/11/03 15:51:14 INFO ForEachDStream: Checkpoint interval = null
19/11/03 15:51:14 INFO ForEachDStream: Remember interval = 10000 ms
19/11/03 15:51:14 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@5db077dc
19/11/03 15:51:15 INFO ConsumerConfig: ConsumerConfig values: 
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
        bootstrap.servers = [hadoop101:9092, hadoop101:9093, hadoop101:9094]
        check.crcs = true
        client.id = 
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = false
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = double_happy_group3
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        send.buffer.bytes = 131072
        session.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = https
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

19/11/03 15:51:15 INFO AppInfoParser: Kafka version : 2.0.0
19/11/03 15:51:15 INFO AppInfoParser: Kafka commitId : 3402a8361b734732
19/11/03 15:51:15 INFO Metadata: Cluster ID: QW2v3GZOQYCYmgUBgDaicA
19/11/03 15:51:15 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=double_happy_group3] Discovered group coordinator hadoop101:9092 (id: 2147483647 rack: null)
19/11/03 15:51:15 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=double_happy_group3] Revoking previously assigned partitions []
19/11/03 15:51:15 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=double_happy_group3] (Re-)joining group
19/11/03 15:51:15 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=double_happy_group3] Successfully joined group with generation 7
19/11/03 15:51:15 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=double_happy_group3] Setting newly assigned partitions [double_happy_offset-0, double_happy_offset-1, double_happy_offset-2]
19/11/03 15:51:15 INFO Fetcher: [Consumer clientId=consumer-1, groupId=double_happy_group3] Resetting offset for partition double_happy_offset-1 to offset 0.
19/11/03 15:51:15 INFO Fetcher: [Consumer clientId=consumer-1, groupId=double_happy_group3] Resetting offset for partition double_happy_offset-2 to offset 0.
19/11/03 15:51:15 INFO Fetcher: [Consumer clientId=consumer-1, groupId=double_happy_group3] Resetting offset for partition double_happy_offset-0 to offset 0.
19/11/03 15:51:15 INFO RecurringTimer: Started timer for JobGenerator at time 1572767480000
19/11/03 15:51:15 INFO JobGenerator: Started JobGenerator at 1572767480000 ms
19/11/03 15:51:15 INFO JobScheduler: Started JobScheduler
19/11/03 15:51:15 INFO StreamingContext: StreamingContext started

注意：
StreamingContext started    ok没有问题


这样做 只需要把你需要的依赖包拿过来就可以了 

如果你需要额外的依赖包很多怎么办？

虽然 --packages  不能去中央仓库去下载 但是你公司应该有一个 maven私服 那么你直接用私服里的就可以  

 这样做的好处 就是你spark代码包很小的

[double_happy@hadoop101 lib]$ ll -lh
total 2.4M
-rw-r--r-- 1 double_happy double_happy 1.9M Nov  3 14:16 kafka-clients-2.0.0.jar
-rw-r--r-- 1 double_happy double_happy  48K Oct 24 23:24 local-1571929727692
-rw-r--r-- 1 double_happy double_happy 1.1K Sep 25 19:45 site.log
-rw-r--r-- 1 double_happy double_happy 225K Nov  3 15:05 spark-core-1.0.jar
-rw-r--r-- 1 double_happy double_happy 212K Nov  3 14:16 spark-streaming-kafka-0-10_2.11-2.4.4.jar
-rw-r--r-- 1 double_happy double_happy  37K Sep 23 18:32 udf.jar
-rw-r--r-- 1 double_happy double_happy  36K Sep 23 11:21 wc.jar
[double_happy@hadoop101 lib]$ 


其实还有一种方式 ：
我们开发的的都是 瘦包 ：仅仅只包含你自己开发的代码 不包括其他的依赖
 
 瘦包 ：仅仅只包含你自己开发的代码 不包括其他的依赖
 		包小
 		需要的依赖的包自己来挑选
胖包：不仅仅会把你自己开发的打包 还会把你的指定的依赖包一起打进去 
		包大
		所有的东西(Hadoop/Spark 除外 )都在里面 运行起来方便
那么胖包怎么使用呢？就是我上面不推荐的链接    因为我之前就用这个方式 修改代码的时候 还得把 那个选项打开 我不喜欢

瘦包还有一个好处就是 ： 方便升级   胖包真的不好

transformation

之前写的算子 都是按照每一个批次来处理的 或者是可以累计的等

新需求：
每隔5秒钟统计前10s钟的数据 
每隔1分钟统计前5分钟的数据

就是每隔多久统计前多久的数据  那么
这类需求 就是 Window

Window Operations
As shown in the figure, every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. In this specific case, the operation is applied over the last 3 time units of data, and slides by 2 time units. This shows that any window operation needs to specify two parameters.

window length - The duration of the window (3 in the figure).
sliding interval - The interval at which the window operation is performed (2 in the figure).
These two parameters must be multiples of the batch interval of the source DStream (1 in the figure).
在这里插入图片描述

案列

  /**
   * Return a new DStream by applying `reduceByKey` over a sliding window. This is similar to
   * `DStream.reduceByKey()` but applies it over a sliding window. Hash partitioning is used to
   * generate the RDDs with Spark's default number of partitions.
   * @param reduceFunc associative and commutative reduce function
   * @param windowDuration width of the window; must be a multiple of this DStream's
   *                       batching interval
   * @param slideDuration  sliding interval of the window (i.e., the interval after which
   *                       the new DStream will generate RDDs); must be a multiple of this
   *                       DStream's batching interval
   */
  def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
  }

只要你见Window  参数里一定带 窗口大小 和 滑动大小的

5秒的批次 每隔5秒统计前10秒

object StreamingKakfaWindowApp {

  def main(args: Array[String]): Unit = {


    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 5)
    
    val groupId = "double_happy_group"
    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094", //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer], //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest", //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean) //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent, //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams) //固定写法
    )

    //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1))
      .reduceByKeyAndWindow((a:Int,b:Int) =>
      (a + b),  //窗口内统计两辆相加    业务
      Seconds(10),  //窗口大小
      Seconds(5)) //滑动大小

    result.print()


    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
-------------------------------------------
Time: 1572770060000 ms
-------------------------------------------
(d,2)
(b,1)
(f,3)
(e,2)
(a,1)
(c,1)

-------------------------------------------
Time: 1572770065000 ms
-------------------------------------------
(d,2)
(b,1)
(f,3)
(e,2)
(a,1)
(c,1)

-------------------------------------------
Time: 1572770070000 ms
-------------------------------------------

-------------------------------------------
Time: 1572770075000 ms
-------------------------------------------

业务理解即可  这是最基本的统计

问题：下图

在这里插入图片描述
DataFrame and SQL Operations
DataFrame and SQL Operations

这是批流一体带来的非常大的好处

object StreamingSqlApp {

  def main(args: Array[String]): Unit = {


    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 5)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094", //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer], //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest", //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean) //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent, //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams) //固定写法
    ).map(_.value())


    //TODO...业务逻辑

    stream.foreachRDD(rdd => {   //注意 stream 前面把 value取出来

      // Get the singleton instance of SparkSession
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._

      // Convert RDD[String] to DataFrame
      val wordsDataFrame = rdd.toDF("word")

      wordsDataFrame.groupBy("word").count().show(false)

    })


    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/11/03 16:55:35 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
+----+-----+
|word|count|
+----+-----+
|f   |25   |
|e   |20   |
|d   |25   |
|c   |23   |
|b   |21   |
|a   |26   |
+----+-----+

19/11/03 16:55:40 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
+----+-----+
|word|count|
+----+-----+
+----+-----+

19/11/03 16:55:45 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
+----+-----+
|word|count|
+----+-----+
+----+-----+


那么这个地方我们使用DF的方式 也可以按sql写 

官网也有些累加器广播变量在ss里面的使用 和RDD都是一样的  看官网学习

消费语义****
Definitions
The semantics of streaming systems are often captured in terms of how many times each record can be processed by the system. There are three types of guarantees that a system can provide under all possible operating conditions (despite failures, etc.)

1.At most once: Each record will be either processed once or not processed at all.
2.At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.
3.Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.

1.流系统中 你的数据被处理了多少次  根据处理多少次 分为三大类
	At most once  
		 最多一次
		 数据可能有丢失
   At least once    
   		至少一次
   		数据不会丢失 但是数据可能会重复
   Exactly once
   		仅一次
   		数据不丢失 数据不会重复 数据也不会被处理多次


At most once  ：
	如果ss 消费kafka的数据 先保存offset 再处理结果 (我之前演示的代码 都是最后提交offset) 
	但是结果处理挂了 由于offset已经保存了 再处理结果 数据就丢失了 
	所以 一定要先处理结果再保存offset
	
 At least once ：按着上面的方式提交offset
 	就是结果处理挂了 offset没有提交 再处理结果 数据就重复了 

 Exactly once：
 	这个是最完美的 但是***
 	你要保证它还是有难度的   看官网

Semantics of output operations
Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAsFiles operations (as the file will simply get overwritten with the same data), *additional effort may be necessary to achieve exactly-once semantics. There are two approaches.
1.Idempotent updates: Multiple attempts always write the same data. For example, saveAs*Files** always writes the same data to the generated files.
2.Transactional updates: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following.

Use the batch time (available in foreachRDD) and the partition index of the RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application.
Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else, if this was already committed, skip the update.

	dstream.foreachRDD { (rdd, time) =>   //time就是你当前批次的时间
  rdd.foreachPartition { partitionIterator =>
    val partitionId = TaskContext.get.partitionId()    //task id 
    val uniqueId = generateUniqueId(time.milliseconds, partitionId)   //根据你 的批次的时间 和 task ID 来组成  唯一的一个key (这个key 你每次的操作基于这个key)
    // use this uniqueId to transactionally commit the data in partitionIterator
  }
}

Output operations (like foreachRDD) have at-least once semantics

foreachRDD是保证 at-least onc 这个级别的奥   并不是保证 仅一次的语义

 two approaches:
 	1.Idempotent updates  幂等    幂等可以通过主键来控制  主键设计不好等于0
	2.Transactional updates
   3.自己实现  把我们数据和offset绑定  

也就是说 spark 默认是达到 At least once  

需要借助At least once 去自己实现 Exactly once

Exactly once：其实挺简单的 
1.一个是offset提交
2.第二个是 业务数据写出去 
这个两个东西只要有offset能够关联的上  是没有问题的

调优 ****

Performance Tuning

1.减少每隔批次处理的时间
2.设置合理的批次大小  也就是说  你多久跑一个批次

那么通过案例结合UI讲解

object StreamingTuningApp {

  def main(args: Array[String]): Unit = {


    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 5)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094", //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer], //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest", //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean) //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent, //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams) //固定写法
    )

    //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
查看UI：

在这里插入图片描述
华丽——————————————————————————————————

ok 我们往kafka写10条数据

在这里插入图片描述
华丽——————————————————————————————————

华丽——————————————————————————————————

Input Rate：数据输入的速率
Scheduling Delay：每个批次启动任务等待了多少时间被调度  叫 调度的延迟
Processing Time：每个批次处理花费了多少时间
Total Delay：调度延迟 + 处理时间

这些在ui最下面都能看到

在这里插入图片描述

最佳实践：
	在下一个批次启动任务之前，一定要运行完前一个批次的数据处理     

如果你当前批次数据都没有处理完 下一个批次数据进来 也就意味着 你的数据逐渐逐渐堆积的 
你的数据在堆积 也就意味着 后面的作业肯定 对于Scheduling Delay 要花一些时间的 
整个作业运行时间也就越来越长的

这个就符合官网的两点 ：
	1.合适的batch size  也就是你的这一个批次 尽快的 处理完 不然你这个一个批次 接受数据以后 都不能很快的处理完
		后面的作业逐渐的堆积的 越堆积越多 那么越到后面你的应用程序会完蛋
   那么 batch time 设置多少合适？是根据需求来定的 

影响任务运行时长的要素有哪些？
	1.数据规模       
			数据量大 一定要多放core （多放core 不一定有用 为什么？ 因为你topic的partition 和RDD的partition是一一对应的）
			可以调整topic的分区数  分区数越多 也就意味着RDD的分区越多   RDD的分区越多task也就越多  task多 并行度就上去了
	2.batch time  
			time越长表示 一个批次的数据越多  数据越多你相同的资源下面 处理数据的时长肯定要多一点
	3.业务复杂度
			如果你的算子用的不好 也就意味着整个 带着大量的shuffle 你的性能会差很多很多  
所以这些东西一定要先测
	batch time 设置 需求来定是一方面  另一个一定到环境上测试 测试得到满意的结果 不是像sb产品经理拍脑袋那样 设置的

在这里插入图片描述

这个地方kafka是有一个限速的

为了ss程序7*24小时高性能稳定的跑 所以尽可能的 你的批次处理时间和调度间隔 有一个什么关系呢？ 你的批次处理处理时间 要比调度间隔小

Kafka限速：
  	配置一个参数 
  		 spark.streaming.kafka.maxRatePerPartition  ： 
  		 	Maximum rate (number of records per second) read from kafka
  		 	 when using the new Kafka direct stream API

修改代码：
  def getStreamingContext(appname:String,batch:Int,defalut:String = "local[2]") ={

    val sparkConf: SparkConf = new SparkConf().setAppName(appname).setMaster(defalut)
    
    //
    sparkConf.set("spark.streaming.kafka.maxRatePerPartition","10")

    new StreamingContext(sparkConf,Seconds(batch))
  }

先测试没有修改前的：
 同时我写入kafka一些数据   查看结果

在这里插入图片描述

测试修改后的

在这里插入图片描述

说明这个参数没有生效 emm 
我的问题 因为 我们每次往kafka写的数据才10条 我调大一下在测试   改为1000条  我写了两次往kafka里 查看结果

在这里插入图片描述

看 说面限速成功了  这样第一次处理就很好的限制你能处理的范围内

但是 300 怎么来的？
	而 这个参数 sparkConf.set("spark.streaming.kafka.maxRatePerPartition","10")
	我们设置的是 10 
	为什么ui上面看到的是300呢？

有个计算公式的 
	10s一个批次 
	topic 3 个分区   ==》数据量 = 10 *3*10 =300 	
	topic 1个分区   ===》 数据量 = 10 *1 *10 =100

maxRatePerPartition 指的是每一个分区10条	  那么一个topic就是30条  10s就是 300条

这个参数只适合 direct api  

限速的地方：
	1.当你topic里有大量没有处理的数据的时候 并且  "auto.offset.reset" -> "earliest" 选择earliest （就是从最早消费）
		为了防止第一个批次数据量过大 要设置限速
    2.你的业务高峰期和低峰期的时候数据量是不一样的       高峰期是低峰期数据量的很多倍的
    你不限速 很多作业都会处在等待状态 因为你前面批次的那一点时间已经处理不过来这一批次的数据了


但是有一个问题哟？
 我们把消费进来的最大数据量是控制住了 但是这个值是个静态的值  
 
 假设你的集群吞吐量可以 你的这个值设置小了 怎么办？
 	随着业务的数据量增长，那么这个东西在生产环境上运行一段时间以后 kafka 消费进来的数据最大的量 应该也
 	要随着 业务变化而变化就好了 引出一个东西   背压机制

背压机制  ： backpressure  1.5版本引进来的
什么是背压呢？
	可以在运行时根据前一个批次数据的运行情况，动态调整后续批次读入的数据量
	这样可以很长从容的面对数据量 突增 和波动的情况 

这个东西就是一个参数控制一下就ok了 

spark.streaming.backpressure.enabled

背压：它是根据当前批次决定后一个批次 
	
	如果offset 从头开始消费 而且数据量很多的时候   我们启动的时候是从第一个批次启动的
	但是第一个批次 依据谁呢？  没有的
	所以你第一次处理 没有很好的办法评估读取的量  所以还有一个参数 初始化的一个东西
spark.streaming.backpressure.initialRate  用来控制背压初始化读取的数据量
	但是：看下图

在这里插入图片描述

如果按照官方这个描述 数据是从receiver过来的  
而我们是没有receiver 这个东西的  direct是没有receiver的

这个参数能起作用么？测试一下

我设置为150
 def getStreamingContext(appname:String,batch:Int,defalut:String = "local[2]") ={

    val sparkConf: SparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

    //
    sparkConf.set("spark.streaming.kafka.maxRatePerPartition","10")
    sparkConf.set("spark.streaming.backpressure.enabled","true")
    sparkConf.set("spark.streaming.backpressure.initialRate","150")

    new StreamingContext(sparkConf,Seconds(batch))
  }
  查看ui：

在这里插入图片描述
不能使用

那么该怎么办呢？  自己找找答案

优雅的关闭JVM

spark.streaming.stopGracefullyOnShutdown  
	If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately.
	会缓慢的关闭 而不是直接关闭

def getStreamingContext(appname: String, batch: Int, defalut: String = "local[2]") = {

  val sparkConf: SparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

  //
  sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "10")
  sparkConf.set("spark.streaming.backpressure.enabled", "true")
  sparkConf.set("spark.streaming.stopGracefullyOnShutdown ", "true")
  new StreamingContext(sparkConf, Seconds(batch))
}

SS03

2018-02-21

批流一体：未来的发展趋势
    Spark
    Flink
   他们可以做到 

MR/Spark/Flink on YARN：是现在是主流方式 

但是 k8s是未来的主流  等学到容器 用这个

Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

Some of these advanced sources are as follows.

Kafka: Spark Streaming 2.4.4 is compatible with Kafka broker versions 0.8.2.1 or higher. See the Kafka Integration Guide for more details.

Flume: Spark Streaming 2.4.4 is compatible with Flume 1.6.0. See the Flume Integration Guide for more details.

Kinesis: Spark Streaming 2.4.4 is compatible with Kinesis Client Library 1.2.1. See the Kinesis Integration Guide for more details.

Kafka整合

Spark Streaming + Kafka Integration Guide

The Kafka project introduced a new consumer API between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available. Please choose the correct package for your brokers and desired features; note that the 0.8 integration is compatible with later 0.9 and 0.10 brokers, but the 0.10 integration is not compatible with earlier brokers.

1.高阶Api
2.低阶Api  
	就是offset需要我们自己维护

在这里插入图片描述

一定要用这个：
		spark-streaming-kafka-0-10   和0.8的差别 
		主要在Receiver DStream

区别
spark-streaming-kafka-0-8

spark-streaming-kafka-0-8：
 第一种方式：Receiver-based Approach
 	1.uses a Receiver to receive the data
 		那么Receiver是跑在哪里的？executor里面的 
 	2.using the Kafka high-level consumer API.
 	3. received from Kafka through a Receiver is stored in Spark executors

在这里插入图片描述
华丽的分割线———————————————————————————————————————-

However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write-Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write-ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write-Ahead Logs.

WAL机制：先把日志记录下来 这里就是数据

注意：Kafka的整合只有一个工具类 叫KafkaUtils

  /**
   * Create an input stream that pulls messages from Kafka Brokers.
   * @param ssc       StreamingContext object
   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)
   * @param groupId   The group id for this consumer
   * @param topics    Map of (topic_name to numPartitions) to consume. Each partition is consumed
   *                  in its own thread
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @return DStream of (Kafka message key, Kafka message value)
   */
  def createStream(
      ssc: StreamingContext,
      zkQuorum: String,
      groupId: String,
      topics: Map[String, Int],
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[(String, String)] = {
    val kafkaParams = Map[String, String](
      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
      "zookeeper.connection.timeout.ms" -> "10000")
    createStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics, storageLevel)
  }

Receiver这个方式：
注意：
    storageLevel  ==MEMORY_AND_DISK_SER_2
    1）数据丢失
    2）WAL ==> 数据延迟
    3）offset我们不care


为什么MEMORY_AND_DISK_SER_2设计为 2 ？
	2的原因就是防止数据丢失 但是 问题是 即使是2 也不能保证数据 不丢 

如果数据是通过WAL机制 写到HDFS上去  那么这个storageLevel 还有必要是 2 么？
	一定是没有必要的    官网有写  你想想哈 如果是2  再加上hdfs本身的副本数  数据量是不是太大了 

虽然WAL解决数据丢失问题 但是带来了一个问题？
     就是数据写到HDFS上之后 更新zk 里的offset  
     那么整体的时效性一定是下降的    你的实时跟HDFS挂钩了  实时性降低 数据延迟

在这里插入图片描述
Points to remember:

Topic partitions in Kafka do not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the KafkaUtils.createStream() only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.

Topic是有partition的 
假设 一个Topic 对应3个partition
   Kafka 里的partition 和 SS产生的RDD里面的partition 不是一个概念 
   即：
   	1 Topic ==》 3 parititions    RDD的并行度 并不是3  
2.所以你增加Topic的分区 仅仅增加 使用的线程数 去处理topic的 还是a single receiver
 根本不会增加spark处理数据的并行度

Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

总结：
	也就是 spark-streaming-kafka-0-8 问题很多 别用了  要了解原理

spark-streaming-kafka-0-10 重点
spark-streaming-kafka-0-10

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change.

Offset管理的时候不同

Creating a Direct Stream

Direct
    不需要Receiver
    Topic的partition和RDD的partition是1:1
    自己手工维护offset   (那么默认offset存在哪？知道么）

案例

Producer:  console  （测试的时候 ） 就是使用KafkaApi代码实现

/**
 * A Kafka client that publishes records to the Kafka cluster.
 * <P>
 * The producer is <i>thread safe</i> and sharing a single producer instance across threads will generally be faster than
 * having multiple instances.
 * <p>
 * Here is a simple example of using the producer to send records with strings containing sequential numbers as the key/value
 * pairs.
 * <pre>
 * {@code
 * Properties props = new Properties();
 * props.put("bootstrap.servers", "localhost:9092");
 * props.put("acks", "all");
 * props.put("retries", 0);
 * props.put("batch.size", 16384);
 * props.put("linger.ms", 1);
 * props.put("buffer.memory", 33554432);
 * props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 * props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 *
 * Producer<String, String> producer = new KafkaProducer<>(props);
 * for (int i = 0; i < 100; i++)
 *     producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
 *
 * producer.close();
 * }</pre>
 * <p>
 * The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server
 * as well as a background I/O thread that is responsible for turning these records into requests and transmitting them
 * to the cluster. Failure to close the producer after use will leak these resources.
 * <p>
 * The {@link #send(ProducerRecord) send()} method is asynchronous. When called it adds the record to a buffer of pending record sends
 * and immediately returns. This allows the producer to batch together individual records for efficiency.
 * <p>
 * The <code>acks</code> config controls the criteria under which requests are considered complete. The "all" setting
 * we have specified will result in blocking on the full commit of the record, the slowest but most durable setting.
 * <p>
 * If the request fails, the producer can automatically retry, though since we have specified <code>retries</code>
 * as 0 it won't. Enabling retries also opens up the possibility of duplicates (see the documentation on
 * <a href="http://kafka.apache.org/documentation.html#semantics">message delivery semantics</a> for details).
 * <p>
 * The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by
 * the <code>batch.size</code> config. Making this larger can result in more batching, but requires more memory (since we will
 * generally have one of these buffers for each active partition).
 * <p>
 * By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you
 * want to reduce the number of requests you can set <code>linger.ms</code> to something greater than 0. This will
 * instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will
 * arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP. For example, in the code snippet above,
 * likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting
 * would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer. Note that
 * records that arrive close together in time will generally batch together even with <code>linger.ms=0</code> so under heavy load
 * batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more
 * efficient requests when not under maximal load at the cost of a small amount of latency.
 * <p>
 * The <code>buffer.memory</code> controls the total amount of memory available to the producer for buffering. If records
 * are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is
 * exhausted additional send calls will block. The threshold for time to block is determined by <code>max.block.ms</code> after which it throws
 * a TimeoutException.
 * <p>
 * The <code>key.serializer</code> and <code>value.serializer</code> instruct how to turn the key and value objects the user provides with
 * their <code>ProducerRecord</code> into bytes. You can use the included {@link org.apache.kafka.common.serialization.ByteArraySerializer} or
 * {@link org.apache.kafka.common.serialization.StringSerializer} for simple string or byte types.
 * <p>
 * From Kafka 0.11, the KafkaProducer supports two additional modes: the idempotent producer and the transactional producer.
 * The idempotent producer strengthens Kafka's delivery semantics from at least once to exactly once delivery. In particular
 * producer retries will no longer introduce duplicates. The transactional producer allows an application to send messages
 * to multiple partitions (and topics!) atomically.
 * </p>
 * <p>
 * To enable idempotence, the <code>enable.idempotence</code> configuration must be set to true. If set, the
 * <code>retries</code> config will default to <code>Integer.MAX_VALUE</code> and the <code>acks</code> config will
 * default to <code>all</code>. There are no API changes for the idempotent producer, so existing applications will
 * not need to be modified to take advantage of this feature.
 * </p>
 * <p>
 * To take advantage of the idempotent producer, it is imperative to avoid application level re-sends since these cannot
 * be de-duplicated. As such, if an application enables idempotence, it is recommended to leave the <code>retries</code>
 * config unset, as it will be defaulted to <code>Integer.MAX_VALUE</code>. Additionally, if a {@link #send(ProducerRecord)}
 * returns an error even with infinite retries (for instance if the message expires in the buffer before being sent),
 * then it is recommended to shut down the producer and check the contents of the last produced message to ensure that
 * it is not duplicated. Finally, the producer can only guarantee idempotence for messages sent within a single session.
 * </p>
 * <p>To use the transactional producer and the attendant APIs, you must set the <code>transactional.id</code>
 * configuration property. If the <code>transactional.id</code> is set, idempotence is automatically enabled along with
 * the producer configs which idempotence depends on. Further, topics which are included in transactions should be configured
 * for durability. In particular, the <code>replication.factor</code> should be at least <code>3</code>, and the
 * <code>min.insync.replicas</code> for these topics should be set to 2. Finally, in order for transactional guarantees
 * to be realized from end-to-end, the consumers must be configured to read only committed messages as well.
 * </p>
 * <p>
 * The purpose of the <code>transactional.id</code> is to enable transaction recovery across multiple sessions of a
 * single producer instance. It would typically be derived from the shard identifier in a partitioned, stateful, application.
 * As such, it should be unique to each producer instance running within a partitioned application.
 * </p>
 * <p>All the new transactional APIs are blocking and will throw exceptions on failure. The example
 * below illustrates how the new APIs are meant to be used. It is similar to the example above, except that all
 * 100 messages are part of a single transaction.
 * </p>
 * <p>
 * <pre>
 * {@code
 * Properties props = new Properties();
 * props.put("bootstrap.servers", "localhost:9092");
 * props.put("transactional.id", "my-transactional-id");
 * Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());
 *
 * producer.initTransactions();
 *
 * try {
 *     producer.beginTransaction();
 *     for (int i = 0; i < 100; i++)
 *         producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
 *     producer.commitTransaction();
 * } catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
 *     // We can't recover from these exceptions, so our only option is to close the producer and exit.
 *     producer.close();
 * } catch (KafkaException e) {
 *     // For all other exceptions, just abort the transaction and try again.
 *     producer.abortTransaction();
 * }
 * producer.close();
 * } </pre>
 * </p>
 * <p>
 * As is hinted at in the example, there can be only one open transaction per producer. All messages sent between the
 * {@link #beginTransaction()} and {@link #commitTransaction()} calls will be part of a single transaction. When the
 * <code>transactional.id</code> is specified, all messages sent by the producer must be part of a transaction.
 * </p>
 * <p>
 * The transactional producer uses exceptions to communicate error states. In particular, it is not required
 * to specify callbacks for <code>producer.send()</code> or to call <code>.get()</code> on the returned Future: a
 * <code>KafkaException</code> would be thrown if any of the
 * <code>producer.send()</code> or transactional calls hit an irrecoverable error during a transaction. See the {@link #send(ProducerRecord)}
 * documentation for more details about detecting errors from a transactional send.
 * </p>
 * </p>By calling
 * <code>producer.abortTransaction()</code> upon receiving a <code>KafkaException</code> we can ensure that any
 * successful writes are marked as aborted, hence keeping the transactional guarantees.
 * </p>
 * <p>
 * This client can communicate with brokers that are version 0.10.0 or newer. Older or newer brokers may not support
 * certain client features.  For instance, the transactional APIs need broker versions 0.11.0 or later. You will receive an
 * <code>UnsupportedVersionException</code> when invoking an API that is not available in the running broker version.
 * </p>
 */
public class KafkaProducer<K, V> implements Producer<K, V> {}

KafkaProducer源码里都介绍了怎么使用  之后讲解略过

Consumer:console  （测试的时候 ）

KafkaProducer：

先测试 ：发送abcdef 
object DataGenerator {
  private val logger: Logger = LoggerFactory.getLogger(DataGenerator.getClass)
  def main(args: Array[String]): Unit = {

    val props = new Properties()
    props.put("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    props.put("acks", "all")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

    val producer = new KafkaProducer[String, String](props)

    for (i <- 1 to 10) {
      Thread.sleep(100)
      //拿一个abcdef
      val word: String = String.valueOf((new Random().nextInt(6) + 'a').toChar)
      val part = i % 3   //发到哪个分区 因为是三个分区

      logger.error("word : {}",word)

    }
  }
}
结果： 注意这代码 全是源码注释里有 你不需要记住 
你只需要记住：
	1.KafkaProducer 创建   
	2.发送数据 要序列化
	3.怎么发   
	这三个东西 源码注释里全都是写好的 

19/11/01 13:23:18 ERROR DataGenerator$: word : f
19/11/01 13:23:18 ERROR DataGenerator$: word : d
19/11/01 13:23:18 ERROR DataGenerator$: word : f
19/11/01 13:23:18 ERROR DataGenerator$: word : d
19/11/01 13:23:18 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : e
19/11/01 13:23:19 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : a
19/11/01 13:23:19 ERROR DataGenerator$: word : a

怎么发送呢？
 /**
     * Creates a record to be sent to a specified topic and partition
     *
     * @param topic The topic the record will be appended to
     * @param partition The partition to which the record should be sent
     * @param key The key that will be included in the record
     * @param value The record contents
     */
    public ProducerRecord(String topic, Integer partition, K key, V value) {
        this(topic, partition, null, key, value, null);
    }


object DataGenerator {

  private val logger: Logger = LoggerFactory.getLogger(DataGenerator.getClass)


  def main(args: Array[String]): Unit = {

    val props = new Properties()
    props.put("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    props.put("acks", "all")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    val producer = new KafkaProducer[String, String](props)

    for (i <- 1 to 10) {
      Thread.sleep(100)

      //拿一个abcdef
      val word: String = String.valueOf((new Random().nextInt(6) + 'a').toChar)
      val part = i % 3 //发到哪个分区 因为是三个分区

      logger.error("word : {}", word)

      val record = producer.send(new ProducerRecord[String, String]("double_happy_offset", part, "",word))

    }

    producer.close()
    println("double_happy 数据产生完毕..........")
  }
}

ok进行测试

[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic double_happy_offset \
> --from-beginning

运行idea代码结果：
19/11/01 13:33:56 ERROR DataGenerator$: word : a
19/11/01 13:33:57 ERROR DataGenerator$: word : e
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : a
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : c
19/11/01 13:33:57 ERROR DataGenerator$: word : c
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : e
19/11/01 13:33:58 ERROR DataGenerator$: word : a
double_happy 数据产生完毕..........

kafka控制台结果：
[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic double_happy_offset \
> --from-beginning
a
e
f
a
f
c
c
f
e
a

对接kafka + ss前期准备工作ok

对接

/**
   * :: Experimental ::
   * Scala constructor for a DStream where
   * each given Kafka topic/partition corresponds to an RDD partition.
   * The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number
   *  of messages
   * per second that each '''partition''' will accept.
   * @param locationStrategy In most cases, pass in [[LocationStrategies.PreferConsistent]],
   *   see [[LocationStrategies]] for more details.
   * @param consumerStrategy In most cases, pass in [[ConsumerStrategies.Subscribe]],
   *   see [[ConsumerStrategies]] for more details
   * @tparam K type of Kafka message key
   * @tparam V type of Kafka message value
   */
  @Experimental
  def createDirectStream[K, V](
      ssc: StreamingContext,
      locationStrategy: LocationStrategy,
      consumerStrategy: ConsumerStrategy[K, V]
    ): InputDStream[ConsumerRecord[K, V]] = {
    val ppc = new DefaultPerPartitionConfig(ssc.sparkContext.getConf)
    createDirectStream[K, V](ssc, locationStrategy, consumerStrategy, ppc)
  }

locationStrategy 策略什么意思呢？官网有  一会代码里我写了解释

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )


    stream.map(record => (record.key, record.value))  
      .map(_._2).print()   //因为我们key就没有设置  只取value

    ssc.start()
    ssc.awaitTermination()
  }
}


结果：
-------------------------------------------
Time: 1572587100000 ms
-------------------------------------------
a
a
c
a
f
c
e
e
f
f

-------------------------------------------
Time: 1572587110000 ms
-------------------------------------------

-------------------------------------------
Time: 1572587120000 ms
-------------------------------------------

-------------------------------------------
Time: 1572587130000 ms
-------------------------------------------     这块 我们kafka又发了一次数据 ssc接收到了
a
b

-------------------------------------------
Time: 1572587140000 ms
-------------------------------------------
c
a
e
c
d
d
a
b

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

   //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    //结果入库 写到redis里
    result.foreachRDD(rdd =>{
      rdd.foreachPartition(paritition =>{
        val jedis: Jedis = RedisUtils.getJedis

        paritition.foreach(pair =>{
          jedis.hincrBy("kafka_ss_redis_wc",pair._1,pair._2)
        })

        jedis.close()
      })
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：

hadoop101:6379> keys *
1) "name"
2) "kafka_ss_redis_wc"
3) "doublehappy_redis_wc"
hadoop101:6379> HGETALL kafka_ss_redis_wc
 1) "e"
 2) "3"
 3) "d"
 4) "2"
 5) "a"
 6) "6"
 7) "b"
 8) "2"
 9) "c"
10) "4"
11) "f"
12) "3"
hadoop101:6379>

在这里插入图片描述

这个时候 把实时代码关掉 ：

redis里的数据翻倍了 因为又写了一次嘛  
但是这样是不行的  
	因为代码里 这个控制的 
		"auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
  
那么接下来 看看offset 怎么获取呢？怎么提交offset呢？

在这里插入图片描述

Obtaining Offsets

trait HasOffsetRanges {
  def offsetRanges: Array[OffsetRange]    //拿到offset的范围  返回值是数组 
}

/**
 * Represents a range of offsets from a single Kafka TopicPartition. Instances of this class
 * can be created with `OffsetRange.create()`.
 * @param topic Kafka topic name
 * @param partition Kafka partition id
 * @param fromOffset Inclusive starting offset
 * @param untilOffset Exclusive ending offset
 */
final class OffsetRange private(
    val topic: String,    
    val partition: Int,
    val fromOffset: Long,
    val untilOffset: Long) extends Serializable {

获取offset：
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

   //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    //结果
    result.foreachRDD(rdd =>{   //这块的rdd一定要注意的  

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/11/01 14:18:18 WARN KafkaUtils: overriding enable.auto.commit to false for executor
19/11/01 14:18:18 WARN KafkaUtils: overriding auto.offset.reset to none for executor
19/11/01 14:18:18 WARN KafkaUtils: overriding executor group.id to spark-executor-use_a_separate_group_id_for_each_stream
19/11/01 14:18:18 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
19/11/01 14:18:20 ERROR JobScheduler: Error running job streaming job 1572589100000 ms.0
---------2
java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:48)
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:42)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:48)
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:42)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

在这里插入图片描述

为什么报错呢？  *****  而且分区数还是2  说明不对
java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges

ShuffledRDD 不能转成 HasOffsetRanges 上面图片解释了 

为什么是ShuffledRDD 呢？ 
	因为reduceBykey之后的流里面 的rdd   就是ShuffledRDD 类型的   主要是经过了reduceBykey 明白吗

所以上面的代码 不对 
业务逻辑的代码是要放在里面写  而不是在前面做   ***
所以直接用stream + foreachRDD   (业务逻辑在foreachRDD 里面写)

在这里插入图片描述

获取offset：
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

结果：他是一直在跑的哈
---------3
double_happy_offset 1 0 8
double_happy_offset 2 0 6
double_happy_offset 0 0 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6

看sparkUI   解释 结果

在这里插入图片描述

${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}
---------3
double_happy_offset 1 0 8
double_happy_offset 2 0 6
double_happy_offset 0 0 6
---------3
double_happy_offset 1 8 8         从8 开始   因为没有数据进来了 
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6


20哪里来的？  8 +6+6 = 20

由于数据过来 第一个批次全部处理完了  
所以第二个批次 结果 是从8开始  

那么我再生产10条数据 看结果

---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 2 6 9          也就是说 这个拿了3条数据
double_happy_offset 1 8 12        拿了4条数据
double_happy_offset 0 6 9          拿了3条数据
---------3

那么 我把程序关掉 结果一定是这样的：
---------3
double_happy_offset 1 0 9
double_happy_offset 2 0 12
double_happy_offset 0 0 9

你获得到了偏移量  由于你没有提交 没有保存偏移量 
所以重启之后都是从头开始跑的  就意味着 第一批数据 会很多 

接下来就是提交offset

Storing Offsets

很多方式 ：
1.Checkpoints   别用了 不好用 
2.Kafka itself    
	Kafka has an offset commit API that stores offsets in a special Kafka topic.
3.Your own data store

2.Kafka itself    ：

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })


      //kafka自身的方式  提交 更新的offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9
---------3

关闭重启 之后结果：
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9

说明offset 提交ok了 

而且两次操作sparkUi也证实了

第一次
第二次重启之后
在这里插入图片描述

数据是0 对吧  因为数据已经提交过了 
通过 kafka 命令可以查到offset 

[double_happy@hadoop101 bin]$ ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list hadoop101:9092,hadoop101:9093,hadoop101:9094 --topic double_happy_offset
double_happy_offset:0:9
double_happy_offset:1:12
double_happy_offset:2:9
[double_happy@hadoop101 bin]$ 

那么kafka自身维护的offset存在哪里呢？

在这里插入图片描述

However, you can commit offsets to Kafka after you know your output has been stored, using the commitAsync API. The benefit as compared to checkpoints is that Kafka is a durable store regardless of changes to your application code. However, Kafka is not transactional, so your outputs must still be idempotent.

1.你的业务逻辑完成之后再提交offset
2.kafka并不是事务性的 所以你的输出 必须保证幂等性

假设 
double_happy_offset:0:9
double_happy_offset:1:12
double_happy_offset:2:9

你第一次处理完了 结果也写到redis里了  
因为种种原因我们可以手工指定 kafka的偏移量的  

假设 第一个批次 5 接下来是 5-9   这里 5-9 已经消费了对吧 

那么 我们手工指定 5-9 这个批次再消费一次 也就是说 
redis 里面的结果又是错的  因为结果又重复了呀  这就是上面刚开始的 演示  

这就是说你输出的代码 也要有幂等性  不管你输出跑多少次  即使重复消费 也要保证结果是具有幂等性的

3.Your own data store

这里我使用redis  你选择MySQL也是可以的    为了测试换一个groupid  让他重新消费
注意：这种方式 提交offset
手动提交offset的时候  要与groupid 对应 

key： Topic + groupid 

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      if(!rdd.isEmpty()){

        //获取分区数
        println("---------"+rdd.partitions.size)   //这个值应该是3

        //获取当前批次的offset数据
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x=>{
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })


        //TODO ... 处理业务逻辑 wc

        //ToDO ... 提交Offset到Redis  使用第三种方式
        val jedis: Jedis = RedisUtils.getJedis

        offsetRanges.foreach(x=>{
          val topicGroupId = x.topic + "_"+ groupId    //key = topic + groupId    
          jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")
        })
        jedis.close()

      }else{
        println("当前批次没有数据.....")
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....

hadoop101:6379> keys *
1) "name"
2) "kafka_ss_redis_wc"
3) "doublehappy_redis_wc"
4) "double_happy_offset_double_happy_group"
hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "9"
3) "1"
4) "12"
5) "0"
6) "9"
hadoop101:6379> 
结果是没有问题 的 
key = topic + groupId      
 jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")

在这里插入图片描述

再生产一批数据 查看结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
---------3
double_happy_offset 1 12 16
double_happy_offset 2 9 12
double_happy_offset 0 9 12
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....

hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "9"
3) "1"
4) "12"
5) "0"
6) "9"
hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "12"
3) "1"
4) "16"
5) "0"
6) "12"
hadoop101:6379>

在这里插入图片描述

我把程序关掉 先生成两个批次的数据  再把程序打开  查看结果

---------3
double_happy_offset 0 0 18
double_happy_offset 2 0 18
double_happy_offset 1 0 24
当前批次没有数据.....
当前批次没有数据.....

为什么从0开始消费？？
因为 
"auto.offset.reset" -> "earliest"   控制的 
应该是 你offset保存在哪里  下次启动的时候去哪里取

测试：取出 offset 从redis

object RedisOffsetApp {

  def main(args: Array[String]): Unit = {

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val topics = Array(topic)

    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息

    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0)+"_"+groupId)

    import  scala.collection.JavaConversions._   //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala

    offsets.map(x=> {

      //  offsets map后要获取一种什么样的数据结构呢？ 

    })
  }

}

注意：
//  offsets map后要获取一种什么样的数据结构呢？ 之前KafkaUtils.createDirectStream 里面的消费策略
Subscribe 传入 topics 和 kafkaParams  查看源码发现 

  def Subscribe[K, V](
      topics: Iterable[jl.String],
      kafkaParams: collection.Map[String, Object],
      offsets: collection.Map[TopicPartition, Long]): ConsumerStrategy[K, V] = {
    new Subscribe[K, V](
      new ju.ArrayList(topics.asJavaCollection),
      new ju.HashMap[String, Object](kafkaParams.asJava),
      new ju.HashMap[TopicPartition, jl.Long](offsets.mapValues(l => new jl.Long(l)).asJava))
  }

 offsets: collection.Map[TopicPartition, Long])    即  TopicPartition 和 偏移量  组成的这样的格式  Map[TopicPartition, Long]

    public TopicPartition(String topic, int partition) {
        this.partition = partition;
        this.topic = topic;
    }

所以取出 offset 从redis  取出的数据结构设计要符合 
createDirectStream  里的 Subscribe 参数的数据结构

(因为 取出 offset 从redis  是在  createDirectStream之前执行的 )

object RedisOffsetApp {

  def main(args: Array[String]): Unit = {

    val groupId = "double_happy_group"
    val topic = "double_happy_offset"
    val topics = Array(topic)


    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息

    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0) + "_" + groupId)

    var fromOffsets: Map[TopicPartition, Long] = Map[TopicPartition, Long]()

    import scala.collection.JavaConversions._ //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala
    offsets.map(x => {

      //  offsets map后要获取一种什么样的数据结构呢？ offsets  Map[TopicPartition, Long]()
      fromOffsets += new TopicPartition(topics(0), x._1.toInt) -> x._2.toLong
    })

    fromOffsets.foreach(println)
  }

}

结果：
(double_happy_offset-0,18)
(double_happy_offset-1,24)
(double_happy_offset-2,18)

取出 offset 从redis

所以我们再测试实时的代码  ：
	我们先不提交 看看控制台 
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    var fromOffsets: Map[TopicPartition, Long] = Map[TopicPartition, Long]()

    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息
    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0)+"_"+groupId)

     //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala
    import scala.collection.JavaConversions._
    offsets.map(x => {
      //offsets map后要获取一种什么样的数据结构呢？ offsets  Map[TopicPartition, Long]()
      fromOffsets += new TopicPartition(topics(0), x._1.toInt) -> x._2.toLong
    })


    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams,fromOffsets)  //从已有的offset里读取数据 开始消费
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      if(!rdd.isEmpty()){

        //获取分区数
        println("---------"+rdd.partitions.size)   //这个值应该是3

        //获取当前批次的offset数据
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x=>{
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })

        //TODO ... 处理业务逻辑 wc

/*        //ToDO ... 提交Offset到Redis  使用第三种方式
        val jedis: Jedis = RedisUtils.getJedis

        offsetRanges.foreach(x=>{
          val topicGroupId = x.topic + "_"+ groupId
          jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")
        })
        jedis.close()*/

      }else{

        println("当前批次没有数据.....")
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/11/01 17:14:51 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
当前批次没有数据.....
当前批次没有数据.....

注意：
说明ok  没有重新消费 

那么我们再写一批数据 查看结果：

当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
---------3
double_happy_offset 0 18 21
double_happy_offset 1 24 28
double_happy_offset 2 18 21
当前批次没有数据.....

说明程序ok

那么我们把实时程序停掉 再产生两次数据 再重启实时程序  对比最初的那次测试：

还记得么？最初那次 是从0 开始消费的  那么这次测试结果呢？

结果：
---------3
double_happy_offset 1 24 36
double_happy_offset 0 18 27
double_happy_offset 2 18 27
当前批次没有数据.....

终于ok了 (这是打印在控制台)

那么写入redis offset 测试也是ok的

总结：
虽然上面的东西一大坨 实际上思路很清晰 很简单 主要是几行破代码 

而且大部分演示的都是不能用的 但是目的是让你知道 这些坑 而不是直接拿代码直接用 (了解原理之后可以 要不然之后出错了你都不知道怎么维护)

总结

1.  "auto.offset.reset" -> "earliest"  
  最终成品 这个参数设置选别的还是 这个 都无所谓的 
  因为你的offsets 是走的 fromOffsets 你自己定义的那个 
  	(就是把你们保存的offset 拿出来丢到 fromOffsets  这里(格式是重点)   创建流的时候 把fromOffsets 丢到消费策略那个参数里)

2.业务处理前的第一件事是把偏移量拿到 (就是foreachRDD里面第一件事)

SS02

2018-02-20

Transformations on DStreams

updateStateByKey：

先看一个案例

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,a,d,d
a,a,a,d,d

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val lines = ssc.socketTextStream("hadoop101",9999)

    val result = lines.flatMap(_.split(","))
      .map((_,1))
      .reduceByKey(_+_)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
结果是：
-------------------------------------------
Time: 1572519050000 ms
-------------------------------------------
(d,4)
(a,6)

-------------------------------------------
Time: 1572519060000 ms
-------------------------------------------

-------------------------------------------
Time: 1572519070000 ms
-------------------------------------------

注意：
这个计算 只计算当前批次的 之后批次 没有数据 

需求：
	统计 从现在时间点 到 10分钟之后的 a出现的次数  ？对于
	上面的代码是无法满足 的    (也可以满足 存起来 再加 也可以)

对于累计的需求该这么办呢？

这就引出一个有没有状态的问题。

状态：State
    无状态的        只与当前批次有关的 叫无状态
    有状态的        前后批次是有关系的   eg：需要把之前的历史到当前的时间点 需要累计起来

实现有状态的 需求 使用updateStateByKey算子***

updateStateByKey ：更新你的状态 通过key 来更新   怎么更新 传入一个function 即可 eg:累加 还是别的 


updateStateByKey(func)	：
	Return a new "state" DStream where the state for each key is updated 
	by applying the given function on the previous state of the key 
	and the new values for the key. 
	This can be used to maintain arbitrary state data for each key.

UpdateStateByKey Operation

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.

1.Define the state - The state can be an arbitrary data type.

2.Define the state update function - Specify with a function how to update the state
using the previous state and the new values from an input stream.

In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as:

updateStateByKey operation ：
	1.Define the state
	2.Define the state update function

对于上面给的wc例子 ：
哪个东西是state    应该是 value 

updateStateByKey  通过key 来更新谁 ( 你可以这么理解)

案例：

1. 
 val result = lines.flatMap(_.split(","))
      .map((_,1))
      .reduceByKey(_+_)
      
reduceByKey(_+_)  是对当前批次的累计 所以这里不能这么写

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val lines = ssc.socketTextStream("hadoop101",9999)

    val result = lines.flatMap(_.split(","))
      .map((_,1))
      .updateStateByKey(updateFunction)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }

  /**
    *
    * 1批次：  a a a d d
    * 2批次：  b b b c c a
    *
    *newValues : 当前批次的值
    *           key对应的新值(或者有新的key)  可能有多个 所以是一个Seq
    * preValues : 以前批次的累加值
    *             key已经存在的值  有可能没有 有可能有  所以定义成Option  有就返回some  没有返回none
    *
    */
  def updateFunction(newValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    //newValues : (b,1)(b,1)(b,1)(c,1)(c,1) (a,1)

    val curr = newValues.sum // 当前批次
    val pre = preValues.getOrElse(0)   //老的值   (a,3) (d,2)   拿出值  key没有的  就赋值为0
    Some(curr + pre)
  }

}

结果：
19/10/31 19:21:07 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:243)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:276)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.streaming.DStreamGraph.start(DStreamGraph.scala:51)
	at org.apache.spark.streaming.scheduler.JobGenerator.startFirstTime(JobGenerator.scala:194)
	at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:100)
	at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:103)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
	at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
	at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:578)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
	at com.ruozedata.spark.ss02.StreamingWCApp01$.main(StreamingWCApp01.scala:19)
	at com.ruozedata.spark.ss02.StreamingWCApp01.main(StreamingWCApp01.scala)
19/10/31 19:21:08 WARN ReceiverSupervisorImpl: Skip stopping receiver because it has not yet stared
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:243)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:276)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.streaming.DStreamGraph.start(DStreamGraph.scala:51)
	at org.apache.spark.streaming.scheduler.JobGenerator.startFirstTime(JobGenerator.scala:194)
	at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:100)
	at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:103)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
	at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
	at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:578)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
	at com.ruozedata.spark.ss02.StreamingWCApp01$.main(StreamingWCApp01.scala:19)
	at com.ruozedata.spark.ss02.StreamingWCApp01.main(StreamingWCApp01.scala)

Process finished with exit code 1


注意：
Please set it by StreamingContext.checkpoint().

修改代码

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,b,b,a         第一次输入

a,a,b,b,a         第二次输入

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)
    
    ssc.checkpoint("file:///C:/IdeaProjects/spark/checkponit")

    val lines = ssc.socketTextStream("hadoop101",9999)

    val result = lines.flatMap(_.split(","))
      .map((_,1))
      .updateStateByKey(updateFunction)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }

  /**
    *
    * 1批次：  a a a d d
    * 2批次：  b b b c c a
    *
    *newValues : 当前批次的值
    *           key对应的新值(或者有新的key)  可能有多个 所以是一个Seq
    * preValues : 以前批次的累加值
    *             key已经存在的值  有可能没有 有可能有  所以定义成Option  有就返回some  没有返回none
    *
    */
  def updateFunction(newValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    //newValues : (b,1)(b,1)(b,1)(c,1)(c,1) (a,1)

    val curr = newValues.sum // 当前批次
    val pre = preValues.getOrElse(0)   //老的值   (a,3) (d,2)   拿出值  key没有的  就赋值为0
    Some(curr + pre)
  }

}

结果：
-------------------------------------------
Time: 1572521050000 ms
-------------------------------------------

19/10/31 19:24:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 19:24:13 WARN BlockManager: Block input-0-1572521053200 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572521060000 ms
-------------------------------------------
(b,2)
(a,3)

19/10/31 19:24:25 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 19:24:25 WARN BlockManager: Block input-0-1572521064800 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572521070000 ms
-------------------------------------------
(b,4)
(a,6)

-------------------------------------------
Time: 1572521080000 ms
-------------------------------------------
(b,4)
(a,6)


注意：
为什么要checkpoint呢？
之前的代码都是没有设置checkpoint 的 为什么之前不需要设置 呢？
因为之前的是没有状态的 没有状态 就是当前批次处理完就ok了 

但是现在 需要把当前批次 和 以前批次累加起来的  这个东西在哪里呢？下图

在这里插入图片描述

ok 现在我把程序关掉 重启以后 是多少呢？  
之前值是：
	(b,4)
    (a,6)
重启之后的值是：空的 

-------------------------------------------
Time: 1572521460000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521470000 ms
-------------------------------------------

也就是说 ：
	如果你的作业 中途挂掉了 重启之后 什么都没有了 
为什么呢？
	因为之前的结果写到 checkponit里了 ，而且当前批次 也没有数据输入进来
那么：
	我们有什么办法 把 checkponit里的数据读取出来呢？
	看官网

Checkpointing

最好直接看官网：我只是截取我认为重要的
Spark Streaming needs to checkpoint enough information 
to a fault- tolerant storage system such that it can recover from failures.
 There are two types of data that are checkpointed.

1. a fault- tolerant storage system    可以选用HDFS
2. two types of data that are checkpointed
      1.Metadata checkpointing
      			Configuration     配置文件
      			DStream operations     算子 
      			Incomplete batches    未完成的
      2.Data checkpointing    就是你真正传过来的数据

When to enable Checkpointing？
   1.Usage of stateful transformations 
   2.Recovering from failures of the driver running the application 
   		driver挂了 你的作业就挂了 当你作业挂了 从Checkpoint中恢复

How to configure Checkpointing？
	看代码   就是说什么代码得改动 不能像之前那样写
1.需要定义一个函数 这个函数就是 创建StreamingContext
2.之后 再 val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _) 
才可以解决 重启之后能够拿到之前的值 

这个就是利用了 ：
	从Checkpoint中恢复 StreamingContext思想(driver 里的 )

object StreamingWCApp02 {


  val checkpointDirectory = "file:///C:/IdeaProjects/spark/checkponit"

  def main(args: Array[String]): Unit = {
    // 当作业挂了时，从checkpoint中去获取StreamingContext
    val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
    ssc.start()
    ssc.awaitTermination()
  }

  def functionToCreateContext(): StreamingContext = {
    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)
    ssc.checkpoint(checkpointDirectory)
    val lines = ssc.socketTextStream("hadoop101",9999)
    val result = lines.flatMap(_.split(","))
      .map((_,1))
      .updateStateByKey(updateFunction)
    result.print()

    ssc
  }

  /**
    *
    * 1)  a a a d d
    * 2)  b b b c c a
    *
    * @param newValues  当前批次的值
    *        key对应的新值  可能有多个 所以是一个Seq
    * @param preValues  以前批次的累加值
    *        key已经存在的值  有可能没有 有可能有  所以定义成Option
    * @return
    */
  def updateFunction(newValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    val curr = newValues.sum // 当前
    val pre = preValues.getOrElse(0)
    Some(curr + pre)
  }
}

结果是：
	-------------------------------------------
Time: 1572521710000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521720000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521730000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521740000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521750000 ms
-------------------------------------------

-------------------------------------------
Time: 1572521760000 ms
-------------------------------------------

为什么呢？因为 我改动代码了 虽然 checkpoint目录没有变 
先把之前的 checkpoint 目录删掉 再测试 (第一次 之后关闭程序 再重启)

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,b,b,a    第一次输入 
a,a,b,b,a

a,a,b,b,a   第二次输入
a,a,b,b,a

结果：
-------------------------------------------
Time: 1572523040000 ms
-------------------------------------------
(b,4)
(a,6)

19/10/31 19:57:25 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 19:57:25 WARN BlockManager: Block input-0-1572523045400 replicated to only 0 peer(s) instead of 1 peers
19/10/31 19:57:26 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 19:57:26 WARN BlockManager: Block input-0-1572523046200 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572523050000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523060000 ms
-------------------------------------------
(b,8)
(a,12)

重启后的结果：
	-------------------------------------------
Time: 1572523070000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523080000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523090000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523100000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523110000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523120000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523130000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523140000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523150000 ms
-------------------------------------------
(b,8)
(a,12)

-------------------------------------------
Time: 1572523160000 ms
-------------------------------------------
(b,8)
(a,12)

ok啦

Stream + Kafka == CP
Kafka 的offset肯定是需要手工维护：有哪些呢？很多的 
    1.checkpoint： 就是把offset维护在checkponit里面的    
    	(代码不能发生任何的变化   只要你代码发生了变化 就意味着 checkpoint 的 matadata 发生了变化  )
    2.Kafka     
    3.ZK   
    4.MySQL    
    5.Redis

所以生产上 checkpoint 根本没法用  (你的代码怎么可能不变呢？或者不修改呢？所以用不了 )

把数据写出去： ****

foreachRDD：

foreachRDD(func)：
	The most generic output operator that applies a function, func, to each RDD 
	generated from the stream. 
	This function should push the data in each RDD to an external system, 
	such as saving the RDD to files, 
	or writing it over the network to a database. Note that the function func
	 is executed in the driver process running the streaming application,
	 and will usually have RDD actions in it
	  that will force the computation of the streaming RDDs.

1. such as saving the RDD to files, 
	or writing it over the network to a database.
2.闭包  优雅的方式写出去
3.the function func
	 is executed in the driver process 
	 running the streaming application
	 func是运行在driver process的

driver端到executor端 必然涉及到一个序列化的问题

把数据写到MySQL

MySQL底层引擎有几种？各自什么区别？

在这里插入图片描述
咱们一步一步来由劣到优

object StreamingWCApp03 {

  def main(args: Array[String]): Unit = {


    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input:   socket  Input DStream
    val lines = ssc.socketTextStream("hadoop101", 9999)

    // transformation
    val result = lines.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    // output
    result.foreachRDD( rdd =>{
      val connection: Connection = MySQLUtils.getConnection()

      rdd.foreach(pair =>{
        val sql = s"insert into wc(word,cnt) values('${pair._1}', ${pair._2})"
        connection.createStatement().execute(sql)
      })

      MySQLUtils.closeResource(connection)
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/10/31 20:43:30 ERROR JobScheduler: Error running job streaming job 1572525810000 ms.0
org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1.apply(StreamingWCApp03.scala:32)
	at com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1.apply(StreamingWCApp03.scala:29)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:
	- object not serializable (class: java.lang.Object, value: java.lang.Object@4ffd7c3f)
	- writeObject data (class: java.util.HashMap)
	- object (class java.util.HashMap, {UTF-8=java.lang.Object@4ffd7c3f, US-ASCII=com.mysql.jdbc.SingleByteCharsetConverter@53c22208, WINDOWS-1252=com.mysql.jdbc.SingleByteCharsetConverter@77cd4c6d})
	- field (class: com.mysql.jdbc.ConnectionImpl, name: charsetConverterMap, type: interface java.util.Map)
	- object (class com.mysql.jdbc.JDBC4Connection, com.mysql.jdbc.JDBC4Connection@65b0d4df)
	- field (class: com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1$$anonfun$apply$1, name: connection$1, type: interface java.sql.Connection)
	- object (class com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1$$anonfun$apply$1, <function1>)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
	... 30 more
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1.apply(StreamingWCApp03.scala:32)
	at com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1.apply(StreamingWCApp03.scala:29)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:
	- object not serializable (class: java.lang.Object, value: java.lang.Object@4ffd7c3f)
	- writeObject data (class: java.util.HashMap)
	- object (class java.util.HashMap, {UTF-8=java.lang.Object@4ffd7c3f, US-ASCII=com.mysql.jdbc.SingleByteCharsetConverter@53c22208, WINDOWS-1252=com.mysql.jdbc.SingleByteCharsetConverter@77cd4c6d})
	- field (class: com.mysql.jdbc.ConnectionImpl, name: charsetConverterMap, type: interface java.util.Map)
	- object (class com.mysql.jdbc.JDBC4Connection, com.mysql.jdbc.JDBC4Connection@65b0d4df)
	- field (class: com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1$$anonfun$apply$1, name: connection$1, type: interface java.sql.Connection)
	- object (class com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1$$anonfun$apply$1, <function1>)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
	... 30 more
19/10/31 20:43:30 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.io.InputStreamReader.read(InputStreamReader.java:184)
	at java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.io.BufferedReader.readLine(BufferedReader.java:324)
	at java.io.BufferedReader.readLine(BufferedReader.java:389)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$1.getNext(SocketInputDStream.scala:121)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$1.getNext(SocketInputDStream.scala:119)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:91)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:72)
19/10/31 20:43:30 WARN ReceiverSupervisorImpl: Restarting receiver with delay 2000 ms: Error receiving data
java.net.SocketException: Socket closed
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.io.InputStreamReader.read(InputStreamReader.java:184)
	at java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.io.BufferedReader.readLine(BufferedReader.java:324)
	at java.io.BufferedReader.readLine(BufferedReader.java:389)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$1.getNext(SocketInputDStream.scala:121)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$1.getNext(SocketInputDStream.scala:119)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:91)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:72)
19/10/31 20:43:30 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
19/10/31 20:43:30 WARN ReceiverSupervisorImpl: Receiver has been stopped
Exception in thread "receiver-supervisor-future-0" java.lang.Error: java.lang.InterruptedException: sleep interrupted
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method)
	at org.apache.spark.streaming.receiver.ReceiverSupervisor$$anonfun$restartReceiver$1.apply$mcV$sp(ReceiverSupervisor.scala:196)
	at org.apache.spark.streaming.receiver.ReceiverSupervisor$$anonfun$restartReceiver$1.apply(ReceiverSupervisor.scala:189)
	at org.apache.spark.streaming.receiver.ReceiverSupervisor$$anonfun$restartReceiver$1.apply(ReceiverSupervisor.scala:189)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	... 2 more

注意：
	org.apache.spark.SparkException: Task not serializable
	ClosureCleaner    Closure 闭包的意思
根本原因是：
	Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:
	- object not serializable (class: java.lang.Object, value: java.lang.Object@4ffd7c3f)
	- writeObject data (class: java.util.HashMap)
	- object (class java.util.HashMap, {UTF-8=java.lang.Object@4ffd7c3f, US-ASCII=com.mysql.jdbc.SingleByteCharsetConverter@53c22208, WINDOWS-1252=com.mysql.jdbc.SingleByteCharsetConverter@77cd4c6d})
	- field (class: com.mysql.jdbc.ConnectionImpl, name: charsetConverterMap, type: interface java.util.Map)
	- object (class com.mysql.jdbc.JDBC4Connection, com.mysql.jdbc.JDBC4Connection@65b0d4df)
	- field (class: com.ruozedata.spark.ss02.StreamingWCApp03$$anonfun$main$1$$anonfun$apply$1, name: connection$1, type: interface java.sql.Connection)

就是 object not serializable ：com.mysql.jdbc.SingleByteCharsetConverter 
MySQL的驱动不能序列化    但是事实上 MySQL驱动就是序列化不了 

该怎么办呢？ 看官网  下图

在这里插入图片描述
华丽的分割线————————————————————————————————————

上面的错误明白之后 那么什么叫做闭包？
先看一下官网  RDD篇介绍的

Understanding closures

闭包：在函数内部 引用了一个外部的变量 
eg： 这两行代码

     val connection: Connection = MySQLUtils.getConnection()
      rdd.foreach(pair =>{
        val sql = s"insert into wc(word,cnt) values('${pair._1}', ${pair._2})"
        connection.createStatement().execute(sql)
      })

foreach 内部使用了 connection  而connection 是在foreach的外部

如果 假设哈 connection 可以序列化 的  上面这种写法是没有问题的！！！
很不幸 connection objects are rarely transferable across machines

修改：

    result.foreachRDD( rdd =>{
      rdd.foreach(pair =>{
        val connection: Connection = MySQLUtils.getConnection()
        val sql = s"insert into wc(word,cnt) values('${pair._1}', ${pair._2})"
        connection.createStatement().execute(sql)
        MySQLUtils.closeResource(connection)
      })
    })

connection放到里面去  那么还涉及闭包的问题么？
一定没有闭包的问题了 避免了上次测试 出现的闭包问题

运行结果：
没有日志的 因为 foreachRDD 是没有返回值的  只能查看MySQL数据了

-------------------------------------------------------------------------------------------
[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,b,b,a
a,a,b,b,a
a,a,b,b,a
a,a,b,b,a
--------------------------------------------------------------------------------------------
mysql> select * from wc;
Empty set (0.00 sec)

mysql> select * from wc;
+------+------+
| word | cnt  |
+------+------+
| b    |    8 |
| a    |   12 |
+------+------+
2 rows in set (0.00 sec)

mysql> 

说明写也是ok的 

但是也有个问题的？

在这里插入图片描述

优化

object StreamingWCApp03 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input:   socket  Input DStream
    val lines = ssc.socketTextStream("hadoop101", 9999)

    // transformation
    val result = lines.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    // output
    result.foreachRDD(rdd => {
      rdd.foreachPartition(partition => {
        val connection: Connection = MySQLUtils.getConnection()

        partition.foreach(pair => {
          val sql = s"insert into wc(word,cnt) values('${pair._1}', ${pair._2})"
          connection.createStatement().execute(sql)
        })
        MySQLUtils.closeResource(connection)
      })
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果是：
[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,b,b,a
a,a,b,b,a

mysql> select * from wc;
+------+------+
| word | cnt  |
+------+------+
| b    |    8 |
| a    |   12 |
| b    |    4 |
| a    |    6 |
+------+------+
4 rows in set (0.00 sec)

mysql> 

这种方式比前面的好多了 但是也不行 
分区多了  connection也会多  
那么最好的方式是什么呢？拿一个连接池 用完之后返回回去

在这里插入图片描述

正确的写法会写了 但是 
mysql> select * from wc;
+------+------+
| word | cnt  |
+------+------+
| b    |    8 |
| a    |   12 |
| b    |    4 |
| a    |    6 |
+------+------+
4 rows in set (0.00 sec)

mysql> 

结果咋整 写了两次就这样了  怎么解决呢？这是数据问题

还有一种写法建议使用它

scalikejdbc 自带Connection Pool
在这里插入图片描述

object StreamingWCApp03 {

  def main(args: Array[String]): Unit = {


    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input:   socket  Input DStream
    val lines = ssc.socketTextStream("hadoop101", 9999)


    // transformation
    val result = lines.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    // output
    
    DBs.setupAll() //这样就把配置文件解析出来了
    result.foreachRDD(rdd => {

      rdd.foreachPartition(partition => {
      
        partition.foreach(pair => {
          DB.autoCommit { implicit session => {
            SQL("insert into wc(word,cnt) values(?, ?)")
              .bind(pair._1,pair._2)
              .update().apply()
          }
          }
        })
      })
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
mysql> select * from wc;
+------+------+
| word | cnt  |
+------+------+
| b    |   20 |
| a    |   30 |
+------+------+
2 rows in set (0.00 sec)

mysql> 


注意：
你确定scalikejdbc 默认就使用 连接池么？？？ 留一个坑

之前 我们用state 进行累计的  
因为用state累加 会用到checkpoint   checkpoint自己生成小文件一大堆  等等

那么 不用state 能不能累加？
用redis 
 /**
      * WC这种统计维度来说
      * Redis的使用关键点：如何选择合适的数据类型
      */
这里我们选hash

object StreamingWCApp03 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input:   socket  Input DStream
    val lines = ssc.socketTextStream("hadoop101", 9999)


    // transformation
    val result = lines.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    // output
    /**
      * WC这种统计维度来说
      * Redis的使用关键点：如何选择合适的数据类型
      */
        result.foreachRDD(rdd => {
          rdd.foreachPartition(partition => {
            val jedis = RedisUtils.getJedis  // 获取Redis连接
            partition.foreach(pair => {
              jedis.hincrBy("doublehappy_redis_wc", pair._1, pair._2)   //String key, String field, long value
            })
            jedis.close() // free
          })
        })

    ssc.start()
    ssc.awaitTermination()
  }
}

扩展：
	这里是连接redis  ，那么连接 phoneix 、Cassandra   都一样的 

结果：
hadoop101:6379> keys *
1) "name"
2) "doublehappy_redis_wc"
hadoop101:6379>

在这里插入图片描述

再放一些数据
[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,b,b,a
a,a,b,b,a
a,a,b,b,a

在这里插入图片描述
说明结果ok的哈

transform
transform

transform(func)	;
Return a new DStream by applying a RDD-to-RDD function 
to every RDD of the source DStream.
 This can be used to do arbitrary RDD operations on the DStream.

之前的编程都是基于DStream
    /**
      * 现在的编程都是基于DStream    生产上绝大多数是DStream
      *
      但是 
      * DStream与RDD互操作咋整？ 使用transform
      */

好处就是 把DStream  的RDD 跟我们的RDD进行操作

需求：
     * 流处理的时候，有一个数据来源于文本或者是其他的   这部分数据是 RDD
      * 另外一个数据是来自Kafka、或者其他的数据源 这部分数据是 DStream
      
做这两个关联  你需要用到 transform

例子
黑名单
目的：
只要由黑名单里的东西把黑名单的数据全部过滤掉

先用core的方式;
object CoreBlackListApp {

  def main(args: Array[String]): Unit = {

    val sc = ContextUtils.getSparkContext("CoreBlackListApp")

    /**
      * 构建黑名单  (xx, true) 或者  (xx, 1)
      */
    val blacks = new ListBuffer[(String,Boolean)]()
    blacks.append(("苍老师",true))  // 鉴黄
    val blacksRDD = sc.parallelize(blacks)  // 把数据转成RDD

    /**
      * 构建访问日志
      */
    val input = new ListBuffer[(String,String)]
    input.append(("历史第一人","被小卡干了，000000"))
    input.append(("日天","也被小卡干了，111111"))
    input.append(("苍老师","我们敬爱的老师，111111"))
    val inputRDD = sc.parallelize(input)

    //TODO... 想从访问日志中过滤掉“苍老师”的数据
    inputRDD.leftOuterJoin(blacksRDD)
      .filter(_._2._2.getOrElse(false) != true)
      .map(x =>(x._1, x._2._1))
      .printInfo()


    sc.stop()
  }


}

结果是：
(日天,也被小卡干了，111111)
(历史第一人,被小卡干了，000000)
-------------------------

ssc：很重要 
生产上用的很多  生产上统计结果有些数据 有些是MySQL里的直接拿的 

object StreamingWCApp03 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input:   socket  Input DStream
    val lines = ssc.socketTextStream("hadoop101", 9999)

    /**
      * 构建黑名单  (xx, true)  (xx, 1)
      */
    val blacks = new ListBuffer[(String,Boolean)]()
    blacks.append(("canglaoshi",true))  // 鉴黄
    val blacksRDD = ssc.sparkContext.parallelize(blacks)  // 把数据转成RDD

    // "日天","也被小卡干了，111111"
    lines.map(x => (x.split(",")(0), x))
      .transform(rdd => {
        rdd.leftOuterJoin(blacksRDD)
          .filter(_._2._2.getOrElse(false) != true)
          .map(x=>x._2._1)
      }).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

结果是：
[double_happy@hadoop101 ~]$ nc -lk 9999
canglaoshi,xxooll
longlaoshi,11oooxxx
james,xxxxx

-------------------------------------------
Time: 1572533110000 ms
-------------------------------------------

-------------------------------------------
Time: 1572533120000 ms
-------------------------------------------
james,xxxxx
longlaoshi,11oooxxx

-------------------------------------------
Time: 1572533130000 ms
-------------------------------------------


结果正确 过滤掉 canglaoshi的数据

SS01

2018-02-17

Fault Tolerance：
Stateful exactly-once semantics out of the box.
Spark Streaming recovers both lost work and 
operator state (e.g. sliding windows) out of the box, 
without any extra code on your part.

注意：容错机制
1.recovers  lost executor
2.operator state

Spark Integration：整合
By running on Spark, Spark Streaming lets you reuse the same code 
for batch processing, join streams against historical data,
 or run ad-hoc queries on stream state. 
 Build powerful interactive applications, not just analytics.

流处理
  实时：Storm Flink   event (就是来一条数据处理一个  这是真实时)
   近实时：Spark Streaming   mini-batch  
   		Spark Streaming把过来的数据切割成5s 一个批次  (是小微批次处理 不是真的实时 )
   		Spark Streaming对数据的处理是使用小批处理

批处理：一次性处理某个批次的数据     数据是有始有终(有开始有结束 有头有尾的)
	eg：处理某个文件夹下面数据 处理完就ok了  不可能跑到别的文件夹下面  (可以这么理解)

流处理 ： 流氓的流 流是一直不断的 
	eg：水龙头打开了 水一直流     不流水了说明 水龙头坏了 或者 没水了 
	
你们的生产上面的实时性是多高呢？
Spark Streaming 可以做到0.5s  
你要注意 0.5s 能进来多少数据

官网

1. Spark Streaming is an extension of the core Spark API 

2.Data can be ingested from many sources 
like Kafka, Flume, Kinesis, or TCP sockets,
 and can be processed using complex algorithms expressed 
 with high-level functions like map, reduce, join and window.

数据源：
 Kafka *****  流处理引擎+Kafka  CP
 Flume ==> 流处理引擎   可以的用的  但是 没有缓冲 
 HDFS
 TCP sockets ==> 测试  + 电信运营商(他们用 早期的时候 15年)

在这里插入图片描述

ss：
	Input： Kafka Socket
    Transform：业务逻辑处理
    Output：

Spark Streaming ：他有几件事情
    1）receives live input data streams   接受数据
    2）divides the data into batches         把接受到的数据 拆分成batches
比如说 ：
	1.Spark Streaming 5秒中处理一次数据   5s时间到了  
	2. 那么会把5s中接受的数据 把它切成 batch 
    3. 之后 把batch 交给 sparkEngine 处理  
    4. 处理完的结果也是 batch

在这里插入图片描述

Spark Streaming 的编程模型：
	DStream
        which represents a continuous stream of data

理解不了看源码：
/**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions.
 *
 * A DStream internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */

abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
  ) extends Serializable with Logging {

注意：跟RDD 差不多 
StreamingContext：就是流处理的上下文模型
DStream ： is a continuous sequence of RDDs
  就是一个流进来 按照时间批次(就是几秒一批次) 被拆成一个一个的RDD
  DStream 由一串RDD构成  我们处理的时候 是以 RDD为单位进行处理的 
  底层就是sparkcore 

DStream 这么来的呢？ 跟RDD一样 (看注释)
	1.live data
	2.别的DStream 转换来的 


This class contains the basic operations available on all DStreams：
看看有多少operations

在这里插入图片描述

所以RDD算子一点要熟练掌握

特性三个：
	1.A list of other DStreams that the DStream depends on    
		
	2.A time interval at which the DStream generates an RDD
		   时间间隔产生rdd     也就是  每隔多少时间处理一次
	3. A function that is used to generate an RDD after each time interval
	       	因为你一个DStream 由一堆RDD构成 是有顺序的
	       	最终 你对DStream 做操作 其实就是对RDD做操作 
	       	对RDD做操作 就是对 RDD里的每一个元素做操作

案列代码准备

StreamingContext  有好多附属构造器的 
class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {

  /**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

我们选择 this(conf: SparkConf, batchDuration: Duration)   

case class Duration (private val millis: Long)    单位是millis  

你传Duration 太死板了想穿个秒数 还得算  看看有没有封装好的 

/**
 * Helper object that creates instance of [[org.apache.spark.streaming.Duration]] representing
 * a given number of seconds.
 */
object Seconds {
  def apply(seconds: Long): Duration = new Duration(seconds * 1000)
}

封装一个工具类：
object ContextUtils {

  /**
    * 获取sc
    */
  def getSparkContext(appname:String,defalut:String = "local[2]"): SparkContext = {

    val sparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

    new SparkContext(sparkConf)
  }


  /**
    * 获取ssc
    */

  def getStreamingContext(appname:String,batch:Int,defalut:String = "local") ={

    val sparkConf: SparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

    new StreamingContext(sparkConf,Seconds(batch))
  }
}

object AppName {

  def main(args: Array[String]): Unit = {

    println(this.getClass.getName)    //包名+类名
    println(this.getClass.getSimpleName)    //类名
  }
}

结果是：
com.ruozedata.spark.ss01.AppName$         
AppName$

案例
socket：
在这里插入图片描述

有三个 ：用哪个呢？有什么区别呢？看下面

数据源：socket 
 /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes it interpreted as object using the given
   * converter.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param converter     Function to convert the byte stream to objects
   * @param storageLevel  Storage level to use for storing the received objects
   * @tparam T            Type of the objects received (after converting bytes to objects)
   */
  def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

注意：
	socketTextStream 
		底层调用的是 socketStream
		socketStream底层调用的是 SocketInputDStream
    socketStream
	底层调用的是 SocketInputDStream

socketTextStream 和socketStream 就是入参不一样 用起来是一样的 
那么SocketInputDStream：

class SocketInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T]

都是ReceiverInputDStream 这个 ******

StorageLevel默认的是 MEMORY_AND_DISK_SER_2  
跟sparkcore里是不一样的  为什么是2呢 ？

A Quick Example

测试：

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,a,a
b,b,b,b

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input DStream
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)

   //transformation
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    // output
    result.print()
    
    ssc.start()
    ssc.awaitTermination()
  }
}
结果是：
19/10/31 11:35:31 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
19/10/31 11:35:40 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:35:40 WARN BlockManager: Block input-0-1572492940600 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:35:44 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:35:44 WARN BlockManager: Block input-0-1572492944000 replicated to only 0 peer(s) instead of 1 peers


为什么没有数据？
先把 master local  改成 local[2] 再测试

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,a,a
b,b,b,b        这是我第一次测试
                  第二次测试   第一批次
a,a,a,a         
b,b,b,b

                   第二批次 
a,a,a,a
b,b,b,b
a,a,a,a
b,b,b,b

结果是 ：
-------------------------------------------
Time: 1572493040000 ms
-------------------------------------------

19/10/31 11:37:24 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:24 WARN BlockManager: Block input-0-1572493044200 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:25 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:25 WARN BlockManager: Block input-0-1572493045000 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572493050000 ms
-------------------------------------------
(b,4)
(a,4)

-------------------------------------------
Time: 1572493060000 ms
-------------------------------------------

19/10/31 11:37:42 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:42 WARN BlockManager: Block input-0-1572493062600 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:43 WARN BlockManager: Block input-0-1572493062800 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:43 WARN BlockManager: Block input-0-1572493063400 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572493070000 ms
-------------------------------------------
(b,8)
(a,8)


注意：
1.
 上面的代码是处理  当前批次的
 不是求累加批次的 累加是另外的算子

After a context is defined, you have to do the following.

    1.Define the input sources by creating input DStreams.
	2.Define the streaming computations by applying transformation and output operations to DStreams.
	3.Start receiving data and processing it using streamingContext.start().
	4.Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
	5.The processing can be manually stopped using streamingContext.stop().
Points to remember:
	1.Once a context has been started, no new streaming computations can be set up or added to it.
		就是说：
			 ssc.start()
			 在这加入逻辑处理是没有用的
            ssc.awaitTermination()
            
	2.Once a context has been stopped, it cannot be restarted.
	3.Only one StreamingContext can be active in a JVM at the same time.
	4.stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext
		 set the optional parameter of stop() called stopSparkContext to false.
	5.A SparkContext can be re-used to create multiple StreamingContexts,
	 as long as the previous StreamingContext is stopped (without stopping the SparkContext)
	  before the next StreamingContext is created.

在这里插入图片描述

上面案例讲解：

1.既然是通过上下文ssc 去拿数据去 接收数据
会有一个  接收器   在里面 

socket 起在 9999 端口  需要一个接收器 把数据接收回来 

下面

Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

1. lines was an input DStream
2. Receiver :receives the data from a source and stores it in Spark’s memory for processing.

所以 你假如不知道 哪个算子 里有接收器   看它返回值 返回值里是带Receiver 的 
只要有返回值里是带Receiver   必然是有接收器的
eg：
			 val lines: ReceiverInputDStream[String]
 
 不是所有的接收数据都需要接收器的 ***
 为什么呢？
 	eg：HDFS上的数据  直接通过API读进来就可以了 不需要接收器

所以：
上面的 master 设置 local  不是local[2]
为什么1 不行呢？1的话 你的 jobid 0  就占用一个线程  后面没有资源线程处理了呀

在这里插入图片描述

where > n  因为 有些业务是需要多个流处理的   
eg：你一套代码里面 有多个 socket  就有多个reciver了 明白吗？

所以你 core的数量 要大于 recivers的数量   否则 你的程序只能接收数据 不能处理数据

在这里插入图片描述

active job   ： receiver  是接收数据用的 一直在跑
这个是永远存在的 因为 对于 socket模式 
返回值是 ReceiverInputDStream 所有第一个Job是一直running在那的，职责就是接收数据

在这里插入图片描述

这幅图 调优的时候详细讲解

*操作讲解 *
Transformations on DStreams

只有最后两个和RDD算子不一样其他的都一样

[double_happy@hadoop101 ~]$ nc -lk 9999
a,c,b,b,b
a,a,a,a,a
b,b,b,b,b

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑


    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)


//    result.print()
    

    //1.统计一个批次出现了多少个单词
    lines.count().print() //一个批次有多少条数据
    lines.flatMap(_.split(",")).count().print()


    ssc.start()
    ssc.awaitTermination()
  }
}
结果是：
-------------------------------------------
Time: 1572501470000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1572501470000 ms
-------------------------------------------
0

19/10/31 13:57:55 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 13:57:55 WARN BlockManager: Block input-0-1572501475400 replicated to only 0 peer(s) instead of 1 peers
19/10/31 13:57:59 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 13:57:59 WARN BlockManager: Block input-0-1572501479200 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572501480000 ms
-------------------------------------------
3

-------------------------------------------
Time: 1572501480000 ms
-------------------------------------------
15

-------------------------------------------
Time: 1572501490000 ms
-------------------------------------------

[double_happy@hadoop101 ~]$ nc -lk 9999
b,b,b,b,b
a,a,a,a,a

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑


    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)


//    result.print()


    //1.统计一个批次出现了多少个单词
    lines.count().print() //一个批次有多少条数据
    lines.flatMap(_.split(",")).countByValue().print()


    ssc.start()
    ssc.awaitTermination()
  }
}

结果是：
-------------------------------------------
Time: 1572501630000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1572501630000 ms
-------------------------------------------

19/10/31 14:00:31 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 14:00:31 WARN BlockManager: Block input-0-1572501631200 replicated to only 0 peer(s) instead of 1 peers
19/10/31 14:00:38 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 14:00:38 WARN BlockManager: Block input-0-1572501638400 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572501640000 ms
-------------------------------------------
2

-------------------------------------------
Time: 1572501640000 ms
-------------------------------------------
(b,5)
(a,5)

-------------------------------------------
Time: 1572501650000 ms
-------------------------------------------

Output Operations on DStreams

  /**
   * Save each RDD in this DStream as at text file, using string representation
   * of elements. The file name at each batch interval is generated based on
   * `prefix` and `suffix`: "prefix-TIME_IN_MS.suffix".
   */
  def saveAsTextFiles(prefix: String, suffix: String = ""): Unit = ssc.withScope {
    val saveFunc = (rdd: RDD[T], time: Time) => {
      val file = rddToFileName(prefix, suffix, time)
      rdd.saveAsTextFile(file)
    }
    this.foreachRDD(saveFunc, displayInnerRDDOps = false)
  }

这个方法生产上能用么？
假如你1s 处理一次  1s写一次 你hdfs很容易写爆掉的 
如果要用 把写出去的文件 使用追加的方式写   或者 定期合并生成的文件 


output opearation  和rdd 大部分都类似 
foreachRDD  这个算子 之后讲解

Input DStreams and Receivers

Spark Streaming provides two categories of built-in streaming sources.

	1.Basic sources:
		 Sources directly available in the StreamingContext API. 
		 Examples: file systems, and socket connections.
	2.Advanced sources: 
		Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. 
		These require linking against extra dependencies as discussed in the linking section.

流处理系统 一般对接的是 kafka  读文件用的少 

  /**
   * Create an input stream that monitors a Hadoop-compatible filesystem
   * for new files and reads them as text files (using key as LongWritable, value
   * as Text and input format as TextInputFormat). Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system. File names starting with . are ignored.
   * @param directory HDFS directory to monitor for new file
   */
  def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }
底层fileStream  跟一下源码有兴趣的 前面文章讲过

返回值是 DStream  所以 可以local1来处理 

注意：
 Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system.

All files must be in the same data format.  看官网

在这里插入图片描述

Kafka01

2018-02-15

官网

1.发布和订阅
2.process  ： kafka自身的stream processing   kafkastreams
3.store ：  Store streams of data safely in a distributed, replicated, fault-tolerant cluster.

kafka如何保证高容错 、保证数据不丢失、？
相当于HDFS的副本机制 eg：
一个文件 打成 多个block块 一个block块 多少个副本

1.发布和订阅
	对应的经典架构是什么呢：
	Flume(生产者)-->Kafka-->SS SSS(消费者)

CDK：
CDK官网

CDH版本的Kafka ：Kafka是需要自定义部署的   

2.1.1   |	Apache Kafka	| 0.10.0.0+kafka2.1.1+21  
代表什么意思？
0.10.0.0   =》 Apache Kafka的版本
2.1.1  =》  CDH版本 使用Apache Kafka 0.10.0.0 源码编译打补丁 后的 CDH版Kafka  的版本号
21   =》 打了几次补丁的意思

在这里插入图片描述

Kafka的版本的选择 要基于 Spark
http://spark.apache.org/docs/latest/streaming-kafka-integration.html

在这里插入图片描述

这里我下载最新版本  
el7 和 el6 有什么区别？
el代表 centos的 
 7 -》 7.x  
 6--》6.x  

解压之后 lib里
kafka_2.11-2.2.1-kafka-4.1.0.jar
2.11 scala 版本
2.2.1 apache kafka版本
4.1.0 cdk版本

在这里插入图片描述

kafka：
metadata-->zk 
log-->当前机器的Linux磁盘上  (Kakfa存的数据)


部署;
https://blog.csdn.net/jeffleo/article/details/75736474

broker.id=0
host.name=hadoop101
port=9092
log.dirs=/home/double_happy/tmp/kafka-logs00
zookeeper.connect=hadoop101:2181/kafka

broker.id=1
host.name=hadoop101
port=9093
log.dirs=/home/double_happy/tmp/kafka-logs01
zookeeper.connect=hadoop101:2181/kafka

broker.id=2
host.name=hadoop101
port=9094
log.dirs=/home/double_happy/tmp/kafka-logs02
zookeeper.connect=hadoop101:2181/kafka


启动：
bin/kafka-server-start.sh -daemon config/server.properties
bin/kafka-server-start.sh -daemon config/server-1.properties
bin/kafka-server-start.sh -daemon config/server-2.properties

在这里插入图片描述
启动成功

查看zk：

[zk: localhost:2181(CONNECTED) 4] ls /
[zookeeper, kafka]
[zk: localhost:2181(CONNECTED) 5]

[double_happy@hadoop101 kafka]$ jps
QuorumPeerMain
Kafka
NodeManager
Jps
SecondaryNameNode
Kafka
AzkabanSingleServer
Kafka
HistoryServer
NameNode
ResourceManager
DataNode

Kafka概念：

4.几个概念 
producer: 生产者 Flume Maxwell
consumer：消费者 SS/SSS/Flink/Flume
broker： 消息处理节点  kafka的启动进程

topic: 主题 
	oms订单系统mysql.oms -->maxwell-->kafka  topic: oms  /oms文件夹下
	wms仓库系统mysql.wms -->maxwell-->kafka  topic: wms  /wms文件夹下

	app log---flume-->kafka topic： applog
	systemlog---flume-->kafka topic： systemlog

partition：分区 topic物理上的分组 (对应的Linux文件夹)，
           一个topic 可以分>=1个p，每个p是一个有序的队列：数据落地到磁盘 是0拷贝+按顺序追加写
	  这就是kafka高效的数据中间件的原因
      按顺序追加写 ： 磁盘的那个指针按一个方向追加和读写的 
           oms_0 oms_1 oms_2

这个parition ：
相当于HDFS的副本机制 eg：
一个文件 打成 多个block块 一个block块 多少个副本 

kafka里的分区就相当于 HDFS里的分块的概念 

replication：副本数   和HDFS的副本数 一样的

常用命令：

[double_happy@hadoop101 bin]$ ./kafka-topics.sh 
Create, delete, describe, or change a topic.
Option                                   Description                            
------                                   -----------                            
--alter                                  Alter the number of partitions,        
                                           replica assignment, and/or           
                                           configuration for the topic.         
--bootstrap-server <String: server to    REQUIRED: The Kafka server to connect  
  connect to>                              to. In case of providing this, a     
                                           direct Zookeeper connection won't be 
                                           required.                            
--command-config <String: command        Property file containing configs to be 
  config property file>                    passed to Admin Client. This is used 
                                           only with --bootstrap-server option  
                                           for describing and altering broker   
                                           configs.                             
--config <String: name=value>            A topic configuration override for the 
                                           topic being created or altered.The   
                                           following is a list of valid         
                                           configurations:                      
                                                cleanup.policy                        
                                                compression.type                      
                                                delete.retention.ms                   
                                                file.delete.delay.ms                  
                                                flush.messages                        
                                                flush.ms                              
                                                follower.replication.throttled.       
                                           replicas                             
                                                index.interval.bytes                  
                                                leader.replication.throttled.replicas 
                                                max.message.bytes                     
                                                message.downconversion.enable         
                                                message.format.version                
                                                message.timestamp.difference.max.ms   
                                                message.timestamp.type                
                                                min.cleanable.dirty.ratio             
                                                min.compaction.lag.ms                 
                                                min.insync.replicas                   
                                                preallocate                           
                                                retention.bytes                       
                                                retention.ms                          
                                                segment.bytes                         
                                                segment.index.bytes                   
                                                segment.jitter.ms                     
                                                segment.ms                            
                                                unclean.leader.election.enable        
                                         See the Kafka documentation for full   
                                           details on the topic configs.It is   
                                           supported only in combination with --
                                           create if --bootstrap-server option  
                                           is used.                             
--create                                 Create a new topic.                    
--delete                                 Delete a topic                         
--delete-config <String: name>           A topic configuration override to be   
                                           removed for an existing topic (see   
                                           the list of configurations under the 
                                           --config option). Not supported with 
                                           the --bootstrap-server option.       
--describe                               List details for the given topics.     
--disable-rack-aware                     Disable rack aware replica assignment  
--exclude-internal                       exclude internal topics when running   
                                           list or describe command. The        
                                           internal topics will be listed by    
                                           default                              
--force                                  Suppress console prompts               
--help                                   Print usage information.               
--if-exists                              if set when altering or deleting or    
                                           describing topics, the action will   
                                           only execute if the topic exists.    
                                           Not supported with the --bootstrap-  
                                           server option.                       
--if-not-exists                          if set when creating topics, the       
                                           action will only execute if the      
                                           topic does not already exist. Not    
                                           supported with the --bootstrap-      
                                           server option.                       
--list                                   List all available topics.             
--partitions <Integer: # of partitions>  The number of partitions for the topic 
                                           being created or altered (WARNING:   
                                           If partitions are increased for a    
                                           topic that has a key, the partition  
                                           logic or ordering of the messages    
                                           will be affected                     
--replica-assignment <String:            A list of manual partition-to-broker   
  broker_id_for_part1_replica1 :           assignments for the topic being      
  broker_id_for_part1_replica2 ,           created or altered.                  
  broker_id_for_part2_replica1 :                                                
  broker_id_for_part2_replica2 , ...>                                           
--replication-factor <Integer:           The replication factor for each        
  replication factor>                      partition in the topic being created.
--topic <String: topic>                  The topic to create, alter, describe   
                                           or delete. It also accepts a regular 
                                           expression, except for --create      
                                           option. Put topic name in double     
                                           quotes and use the '\' prefix to     
                                           escape regular expression symbols; e.
                                           g. "test\.topic".                    
--topics-with-overrides                  if set when describing topics, only    
                                           show topics that have overridden     
                                           configs                              
--unavailable-partitions                 if set when describing topics, only    
                                           show partitions whose leader is not  
                                           available                            
--under-replicated-partitions            if set when describing topics, only    
                                           show under replicated partitions     
--zookeeper <String: hosts>              DEPRECATED, The connection string for  
                                           the zookeeper connection in the form 
                                           host:port. Multiple hosts can be     
                                           given to allow fail-over.            
[double_happy@hadoop101 bin]$

1.创建topic

[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --create \
> --zookeeper hadoop101:2181/kafka \
> --replication-factor 3 \
> --partitions 3 \
> --topic test
Created topic test.
[double_happy@hadoop101 kafka]$

在这里插入图片描述
华丽的分割线—————————————————————————————————————————–

华丽的分割线—————————————————————————————————————————–
在这里插入图片描述
华丽的分割线—————————————————————————————————————————–

华丽的分割线—————————————————————————————————————————–

注意：
1.会创建3个文件夹 
test-0
test-1
test-2

test : topic 的名字   012 是partition的 下标从0开始 

--replication-factor 3   三个副本指的是  我们有三台机器 副本数要 <= 3     你三台机器 不能创建多余机器数的副本数

2.查看我们的topic  ： 查看我当前的kafka集群有多少个Topic
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --list \
> --zookeeper hadoop101:2181/kafka 
test
[double_happy@hadoop101 kafka]$

3.查看 topic 明细   ***
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --describe \
> --zookeeper hadoop101:2181/kafka \
> --topic test
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 1       Replicas: 1,0,2 Isr: 1,0,2
        Topic: test     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 2,1,0
        Topic: test     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 0,2,1
[double_happy@hadoop101 kafka]$ 

解释：因为我是单节点部署三台kafka 这样说不好理解 我就把他变成三台机器部署三台kfaka 

1.机器
broker 0  -> Hadoop101
broker 1 -> Hadoop102
broker 2 -> Hadoop103

2.明细
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 1   Hadoop102      Replicas: 1,0,2 Isr: 1,0,2
        Topic: test     Partition: 1    Leader: 2   Hadoop103   Replicas: 2,1,0 Isr: 2,1,0
        Topic: test     Partition: 2    Leader: 0   Hadoop101    Replicas: 0,2,1 Isr: 0,2,1

3.解释
Leader、Replicas、Isr：里面的数字 代表  broker id
Partition：数字 代表 paritition的下标
Leader ：该分区 负责对外的 读写的节点 

3.1    按 分区0 进行解释 
对于test 这个主题来说，我们的数据完整性 是由三个分区 来组成的  012 partition

就是说 我0分区 由分本数 三个副本 三个副本是分散在不同的机器上面的
但是哪个机器进行对外读和写呢 ？是由Leader 的值来表现的 

那么：
0 分区的数据 是由 哪台机器 来进行对外读写 提供的(Leader 1) 说明是 broker 1 号机器 Hadoop102

那么 ：
 Replicas: 1,0,2   表示 复制分区的节点的列表 
 0,2 =》表示 Hadoop103、Hadoop101    这两台机器对parittion 0 进行数据的复制同步

Isr：in-sync 表示正在同步的集群列表

那么：
isr: 1,0,2  
	1表示对外的 leader
	假设分区0 对应的leader 1 机器挂了 ，那么 0 将会作为leader 进行对外提供服务



关键点：
1.分区  topic的数据完整性
2.谁负责对外读写
3.谁负责分区数据同步

测试：isr

[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh --describe --zookeeper hadoop101:2181/kafka --topic test
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 1       Replicas: 1,0,2 Isr: 1,0,2
        Topic: test     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 2,1,0
        Topic: test     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 0,2,1
[double_happy@hadoop101 kafka]$ jps
17088 QuorumPeerMain
4289 NodeManager
4019 SecondaryNameNode
24260 Kafka
23653 Kafka
14999 AzkabanSingleServer
6633 HistoryServer
3721 NameNode
4186 ResourceManager
3853 DataNode
24862 Kafka
31278 Jps
[double_happy@hadoop101 kafka]$ kill -9 24260
[double_happy@hadoop101 kafka]$ 
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh --describe --zookeeper hadoop101:2181/kafka --topic test
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 0       Replicas: 1,0,2 Isr: 0,2
        Topic: test     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 2,0
        Topic: test     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 0,2
[double_happy@hadoop101 kafka]$

很巧 瞎kill一个真的把 1 broker kill掉了 

注意;
	看paritition 0  是不是 1被干掉之后 0上位了 当成了leader
2.你kill -9  是不是太暴力了 
查看

[double_happy@hadoop101 bin]$ cat kafka-server-stop.sh
#!/bin/sh
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
SIGNAL=${SIGNAL:-TERM}
PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}')

if [ -z "$PIDS" ]; then
  echo "No kafka server to stop"
  exit 1
else
  kill -s $SIGNAL $PIDS
fi
[double_happy@hadoop101 bin]$ 


官方使用 kill -s  接受一个信号    有兴趣了解一下 但是官方也用了 kill

重启 broker 1 ：
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh --describe --zookeeper hadoop101:2181/kafka --topic test
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 0       Replicas: 1,0,2 Isr: 0,2
        Topic: test     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 2,0
        Topic: test     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 0,2
[double_happy@hadoop101 kafka]$ bin/kafka-server-start.sh -daemon config/server-1.properties
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh --describe --zookeeper hadoop101:2181/kafka --topic test
 
Topic:test      PartitionCount:3        ReplicationFactor:3     Configs:
        Topic: test     Partition: 0    Leader: 0       Replicas: 1,0,2 Isr: 0,2,1
        Topic: test     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 2,0,1
        Topic: test     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 0,2,1
[double_happy@hadoop101 kafka]$

4.删除 Topic
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --delete \
> --zookeeper hadoop101:2181/kafka \
> --topic test
Topic test is marked for deletion.
Note: This will have no impact if delete.topic.enable is not set to true.
[double_happy@hadoop101 kafka]$ 


注意：
Topic test is marked for deletion.   已经标识了删除  那么我们还能不能查看到呢？

[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh --delete --zookeeper hadoop101:2181/kafka --topic test
Error while executing topic command : Topics in [] does not exist
[2019-10-30 11:06:04,714] ERROR java.lang.IllegalArgumentException: Topics in [] does not exist
        at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:416)
        at kafka.admin.TopicCommand$ZookeeperTopicService.deleteTopic(TopicCommand.scala:377)
        at kafka.admin.TopicCommand$.main(TopicCommand.scala:68)
        at kafka.admin.TopicCommand.main(TopicCommand.scala)
 (kafka.admin.TopicCommand$)
[double_happy@hadoop101 kafka]$

查看不到了 说明Topic 已经不存在了 

去机器上 看一下目录

在这里插入图片描述

正常应该由 delete 标识的  等重启Kafak之后 会彻底删除 delete 标识的文件夹  

我这里没有演示出来  

Note: This will have no impact if delete.topic.enable is not set to true.

这个提示是 表示;
delete.topic.enable  参数没有设值为 true  那么 删除不会立马起作用 
(双重否定 表示 肯定   是true 就会立马删除 )

去官网看看：
delete.topic.enable 默认 就是 true  
所以 提示这句话 就是 放屁 
但是也要注意 生产上别随便删除 Topic 删了可能会导致 kafka起不来 真的 别没事找事做

在这里插入图片描述

我想总结的是这个 删除topic 可能会导致kafka故障？
删除整理:
1.⽣产上命名不要有标点符号的字符，就英⽂ 可以带数字 默认⼩写
2.假如已经在运⾏的kafka 只有1个topic，你可以删除，0⻛险
3.假如已经在运⾏的kafka 只有多个topic，忍，⻛险可能存在  (你把自己的topic 删完 kafka坏了 别人的topic也没法用了 )
4.千万不要⼿贱或者强迫症，看topic名称不舒服，去删除topic，删除需谨慎！
5.删除不可逆，细⼼操作删除命令！

删除可能遇到的问题？当你真的要删除Topic的时候 
       如果可以重启kafka 就重启kafka 不能重启看下面
假如删除不⼲净:
1.linux磁盘⽂件夹  手工删除 
2.zk的/kafka的
ls /kafka/brokers/topics
ls /kafka/config/topics
默认delete.topic.enable=true，执⾏删除命令后，⽆需关⼼。(因为会立马删除掉 )

看一下刚刚删除完的 topic 在zk的元数据

[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper, kafka]
[zk: localhost:2181(CONNECTED) 1] ls /kafka/brokers/topics
[]
[zk: localhost:2181(CONNECTED) 2] ls /kafka/config/topics
[]
[zk: localhost:2181(CONNECTED) 3] 

没有 说明删的 很干净

5..修改topic

[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --create \
> --zookeeper hadoop101:2181/kafka \
> --replication-factor 1 \
> --partitions 1 \
> --topic g7
Created topic g7.
[double_happy@hadoop101 kafka]$ 


（1）修改分区数量 
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --alter \
> --zookeeper hadoop101:2181/kafka \
> --topic g7 --partitions 3
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!
[double_happy@hadoop101 kafka]$

在这里插入图片描述
华丽的分割线————————————————————————————————————

添加分区之后：
在这里插入图片描述

没有 g7-1 g7-2
说明分区的副本只有一份   其他分区落不进来

副本数 一般3分 
broker 8台机器 qbs很高的 生产上 是这么用的   3，4万的qbs

查看明细：
[double_happy@hadoop101 kafka]$ bin/kafka-topics.sh \
> --describe \
> --zookeeper hadoop101:2181/kafka \
> --topic g7 
Topic:g7        PartitionCount:3     ReplicationFactor:1      Configs:
        Topic: g7       Partition: 0 Leader: 2        Replicas: 2     Isr: 2
        Topic: g7       Partition: 1 Leader: 0        Replicas: 0     Isr: 0
        Topic: g7       Partition: 2 Leader: 1        Replicas: 1     Isr: 1
[double_happy@hadoop101 kafka]$ 

说明副本只有一个  isr 只有一个 挂了 就丢数据了

6.自动迁移数据到 新的节点 ：

假如有一天 你的kafka需要添加机器 那么你之前的kafka数据 要进行迁移

迁移官网

1.  kafka-reassign-partitions.sh    使用这个脚本 
重新分配我们的partition  这种情况下主要是针对 
kafka的集群规模 数据量 扛不住了
需要新加机器来平衡我的数据量 

就三步：
1.生成一个json的文件 
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --topics-to-move-json-file topics-to-move.json --broker-list "5,6" --generate

就这个命令 会生成 json文件 


--broker-list "5,6"     表示 1234 下架 只用56   如果你用1356  那么这个参数这么写呢 ？
                    --broker-list "1,3,5,6"   明白了吗？

2.json文件就是一个标准 后面根据它 进行重新分配    (就是根据json 去执行)
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json --execute

3. 验证第二步  (进行校验)
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json --verify

在这里插入图片描述

工作时候 用的很少 一般都是事先规划好的 机器

7.console案例   很简单 

生产者：
[double_happy@hadoop101 kafka]$ bin/kafka-console-producer.sh \
> --broker-list hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic g7
>


消费者：
[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic g7 \
> --from-beginning


这就模拟了 生产者 发送数据到 broker  然后消费者进行消费

生产者：
[double_happy@hadoop101 kafka]$ bin/kafka-console-producer.sh \
> --broker-list hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic g7
>doubleHappy
>1
>2
>3
>4
>5
>6
>7
>8
>9
>

消费者：
[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic g7 \
> --from-beginning
doubleHappy
1
2
3
4
5
6
7
8
9

那么我把消费者 打断 重新 起来 ：

--from-beginning  表示 从头开始消费    表示消息从最初始的位置开始消费 

[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 --topic g7 --from-beginning
1
4
7
2
5
8
doubleHappy
3
6
9




问题：
是不是乱序了   指的是全局乱序的   因为kafka 单分区 数据是有序的 (就是 每个分区内的数据 在各自当前分区是有序的 )

数据在生产的时候 发送到kafka的时候
 数据是发往不同的分区里 

不同分区里的数据应该是这样的：
1 4 7     |    2 5 8  |  doubleHappy  3 6 9

**数据在每个 分区内部是有序的    

那么 如何保证数据 全局有序？
    自己思考思考 我下篇文章写

故障案例：

异构平台Kafka对接使⽤
http://blog.itpub.net/30089851/viewspace-2152671/

Redis

2018-02-10

是NoSQL的
官网

Redis:
	1.NoSQL的数据库
	2.key-value方式存储的  内存数据库
	    key ：通常是String类型
	    value：strings, hashes, lists, sets, sorted sets  等 选择哪一个根据业务来定的 
redis 数据存在内存里的 万一挂掉了怎么办？
虽然内部有机制  ： 定时的把内存数据 刷到磁盘上面 重启之后 可以把磁盘上的数据加载进来的



特性：
	1.速度快 1s 十万次  读写非常好的 
	2. 持久化   两种 RBD(快照的方式 ) 和 AOF(记录所有操作的 写到日志文件里去的)的  
	    重启之后优先AOF  因为 数据更加完整  RBD是定时的快照 
	    	eg：10s 做一次快照 8s的时候挂掉了  这8s的数据就丢了 
	3.数据类型 
	4.多语言的  java scala 都可以
	5.其他功能 ： 发布订阅 事务 pipeline(指的是 多条指令 可以放在一起 最后 发一条指令    代替 而不是一条数据一个指令  网络传输少很多  工作当中批量的 一定要采用pipeline的方式 发送请求)
	6.单线程的 
	7.有集群 有主从复制    来保证分布式和高可用

国内 微博 使用redis的多

操作：

[double_happy@hadoop101 src]$ ls redis-server 
redis-server                服务端的 
[double_happy@hadoop101 src]$ ls redis-cli
redis-cli                         客户端的
[double_happy@hadoop101 src]$


[double_happy@hadoop101 redis-5.0.5]$ make  install
cd src && make install
make[1]: Entering directory `/home/double_happy/app/redis-5.0.5/src'
    CC Makefile.dep
make[1]: Leaving directory `/home/double_happy/app/redis-5.0.5/src'
make[1]: Entering directory `/home/double_happy/app/redis-5.0.5/src'

Hint: It's a good idea to run 'make test' ;)

    INSTALL install
install: cannot create regular file ‘/usr/local/bin/redis-server’: Permission denied
make[1]: *** [install] Error 1
make[1]: Leaving directory `/home/double_happy/app/redis-5.0.5/src'
make: *** [install] Error 2
[double_happy@hadoop101 redis-5.0.5]$ sudo make install
cd src && make install
make[1]: Entering directory `/home/double_happy/app/redis-5.0.5/src'

Hint: It's a good idea to run 'make test' ;)

    INSTALL install
    INSTALL install
    INSTALL install
    INSTALL install
    INSTALL install
make[1]: Leaving directory `/home/double_happy/app/redis-5.0.5/src'


注意：
你把redis编译好了 
‘/usr/local/bin/redis-server 这个目录下也是有 redis-server的

启动：
[double_happy@hadoop101 ~]$ redis-server 
27132:C 29 Oct 2019 13:41:11.883 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
27132:C 29 Oct 2019 13:41:11.883 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=27132, just started
27132:C 29 Oct 2019 13:41:11.883 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 5.0.5 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 27132
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

27132:M 29 Oct 2019 13:41:11.884 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
27132:M 29 Oct 2019 13:41:11.884 # Server initialized
27132:M 29 Oct 2019 13:41:11.884 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
27132:M 29 Oct 2019 13:41:11.884 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
27132:M 29 Oct 2019 13:41:11.884 * Ready to accept connections


注意：
6379  redis经典端口  

[double_happy@hadoop101 redis-5.0.5]$ ps -ef | grep redis
double_+ 27132 16513  0 13:41 pts/1    00:00:00 redis-server *:6379
double_+ 27148 15274  0 13:42 pts/0    00:00:00 grep --color=auto redis
[double_happy@hadoop101 redis-5.0.5]$ 

说明redis  服务端启动了 
再启动客户端 ：


[double_happy@hadoop101 ~]$ redis-cli 
127.0.0.1:6379> keys*
(error) ERR unknown command `keys*`, with args beginning with: 
127.0.0.1:6379> keys *
(empty list or set)
127.0.0.1:6379> set name double_happy
OK
127.0.0.1:6379> keys *
1) "name"
127.0.0.1:6379> get name
"double_happy"
127.0.0.1:6379>

后台运行：需要对redis.conf 文件进行简单的配置 配置一下 logfile位置 

[double_happy@hadoop101 redis-5.0.5]$ src/redis-server redis.conf              
[double_happy@hadoop101 redis-5.0.5]$ tail -200f ~/tmp/redislogs/redis.log 
27337:C 29 Oct 2019 13:53:20.629 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
27337:C 29 Oct 2019 13:53:20.629 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=27337, just started
27337:C 29 Oct 2019 13:53:20.629 # Configuration loaded
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 5.0.5 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 27338
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

27338:M 29 Oct 2019 13:53:20.632 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
27338:M 29 Oct 2019 13:53:20.632 # Server initialized
27338:M 29 Oct 2019 13:53:20.632 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
27338:M 29 Oct 2019 13:53:20.632 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
27338:M 29 Oct 2019 13:53:20.632 * Ready to accept connections
^C
[double_happy@hadoop101 redis-5.0.5]$ ^C

客户端操作：  需要把redis.conf  里的 bind 去掉 或者 配置成 0.0.0.0
[double_happy@hadoop101 redis-5.0.5]$ src/redis-cli -h hadoop101 -p 6379
hadoop101:6379> 

再黑窗口查看 redis还是比较麻烦的 
https://www.jianshu.com/p/6895384d2b9e 安装一下

在这里插入图片描述

Redis多数据库特性：
	1.默认有16个数据库   可以改的   redis.conf       
		通常情况下 数据库之间是 隔离的
	2.

在这里插入图片描述

Redis 切换库 黑窗口：

hadoop101:6379> select 1
OK
hadoop101:6379[1]> select 0
OK
hadoop101:6379> select 16
(error) ERR DB index is out of range
hadoop101:6379>

在这里插入图片描述

1.	通常情况下 数据库之间是 隔离的

hadoop101:6379> select 1
OK
hadoop101:6379[1]> set  bigdata spark
OK
hadoop101:6379[1]> get  bigdata
"spark"
hadoop101:6379[1]> select 0
OK
hadoop101:6379> get bigdata
(nil)
hadoop101:6379> 

注意:
flushall   这个操作没有做隔离的 会把所有的库里的东西 全部干掉 


hadoop101:6379> FLUSHALL
OK
hadoop101:6379> select 1
OK
hadoop101:6379[1]> get bigdata
(nil)
hadoop101:6379[1]>

redis里的基础命令：
	1. keys  *    查你当前库里的所有的key
	   这个keys 是能进行匹配的 
eg：
hadoop101:6379> set name 1
OK
hadoop101:6379> set name2 xx
OK
hadoop101:6379> set name3 ll
OK
hadoop101:6379> keys *
1) "name3"
2) "name2"
3) "name"
hadoop101:6379> keys name*
1) "name3"
2) "name2"
3) "name"
hadoop101:6379> keys name?
1) "name3"
2) "name2"
hadoop101:6379> keys name[0-9]
1) "name3"
2) "name2"
hadoop101:6379>

2.判断某一个key 存不存在 
exists  对应的key      返回1 表示存在 
eg：
hadoop101:6379> EXISTS name
(integer) 1
hadoop101:6379> EXISTS name10
(integer) 0
hadoop101:6379>

3.删除  
del  key 
del key1 key2

hadoop101:6379> DEL name 
(integer) 1
hadoop101:6379> del name1 name2
(integer) 1
hadoop101:6379> 

或者 会到外面删除 
[double_happy@hadoop101 redis-5.0.5]$ src/redis-cli keys "*"
1) "name1"
2) "name3"
3) "name2"
[double_happy@hadoop101 redis-5.0.5]$ src/redis-cli del `redis-cli keys "name*"`
(integer) 3
[double_happy@hadoop101 redis-5.0.5]$ src/redis-cli keys "*"
(empty list or set)
[double_happy@hadoop101 redis-5.0.5]$

命令太多了看官网
Commands

查看 key的类型：
hadoop101:6379> type name
string

但是真的记不住怎么办？
help

hadoop101:6379> help
redis-cli 5.0.5
To get help about Redis commands type:
      "help @<group>" to get a list of commands in <group>
      "help <command>" for help on <command>
      "help <tab>" to get a list of possible help topics
      "quit" to exit

To set redis-cli preferences:
      ":set hints" enable online hints
      ":set nohints" disable online hints
Set your preferences in ~/.redisclirc


注意：
help  @<group>  用的最多

eg：
hadoop101:6379> help @string

  APPEND key value
  summary: Append a value to a key
  since: 2.0.0

  BITCOUNT key [start end]
  summary: Count set bits in a string
  since: 2.6.0

  BITFIELD key [GET type offset] [SET type offset value] [INCRBY type offset increment] [OVERFLOW WRAP|SAT|FAIL]
  summary: Perform arbitrary bitfield integer operations on strings
  since: 3.2.0

  BITOP operation destkey key [key ...]
  summary: Perform bitwise operations between strings
  since: 2.6.0

  BITPOS key bit [start] [end]
  summary: Find first bit set or clear in a string
  since: 2.8.7

  DECR key
  summary: Decrement the integer value of a key by one
  since: 1.0.0

  DECRBY key decrement
  summary: Decrement the integer value of a key by the given number
  since: 1.0.0

  GET key
  summary: Get the value of a key
  since: 1.0.0

  GETBIT key offset
  summary: Returns the bit value at offset in the string value stored at key
  since: 2.2.0

  GETRANGE key start end
  summary: Get a substring of the string stored at a key
  since: 2.4.0

  GETSET key value
  summary: Set the string value of a key and return its old value
  since: 1.0.0

  INCR key
  summary: Increment the integer value of a key by one
  since: 1.0.0

  INCRBY key increment
  summary: Increment the integer value of a key by the given amount
  since: 1.0.0

  INCRBYFLOAT key increment
  summary: Increment the float value of a key by the given amount
  since: 2.6.0

  MGET key [key ...]
  summary: Get the values of all the given keys
  since: 1.0.0

  MSET key value [key value ...]
  summary: Set multiple keys to multiple values
  since: 1.0.1

  MSETNX key value [key value ...]
  summary: Set multiple keys to multiple values, only if none of the keys exist
  since: 1.0.1

  PSETEX key milliseconds value
  summary: Set the value and expiration in milliseconds of a key
  since: 2.6.0

  SET key value [expiration EX seconds|PX milliseconds] [NX|XX]
  summary: Set the string value of a key
  since: 1.0.0

  SETBIT key offset value
  summary: Sets or clears the bit at offset in the string value stored at key
  since: 2.2.0

  SETEX key seconds value
  summary: Set the value and expiration of a key
  since: 2.0.0

  SETNX key value
  summary: Set the value of a key, only if the key does not exist
  since: 1.0.0

  SETRANGE key offset value
  summary: Overwrite part of a string at key starting at the specified offset
  since: 2.2.0

  STRLEN key
  summary: Get the length of the value stored in a key
  since: 2.2.0

hadoop101:6379>

数据类型：
这个需要掌握的

String类型：
setnx ：
SETNX key value
  summary: Set the value of a key, only if the key does not exist
  since: 1.0.0

很多 incr可以用来统计在线人数

Hash类型： ***
场景 多了一个字段  假如 100行数据只有一条记录  多了这个字段
大数据里 偏移量使用它 来存储的 

hmset:
对已有的会修改 对没有的会加上去

在这里插入图片描述

idea开发

public class RedisApp {

    String host = "hadoop101";
    int port = 6379;

    Jedis jedis ;


    @Test
    public void test01() {
        jedis.set("info", "double_happy");

        Assert.assertEquals("double_happy",jedis.get("info"));
    }



    @Before
    public void setUp() {
        jedis = new Jedis(host,port);
    }


    public void tearDown() {
        jedis.close();
    }
}
查看结果： 绿了哈    注意命令和api 一一对应的 

突然想到一个梗 ：
	异地恋最重要的是什么？
	最重要的是 相互信任  这样对四个人都好

在这里插入图片描述

SparkSQL03

2018-02-05

在object LogApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[4]")
      .appName("LogApp")
      .getOrCreate()

    // ETL: 一定保留原有的数据   最完整
    var inputDF = spark.read.json("data/data-test.json")
   inputDF = inputDF.withColumn("province", MydataUDF.getProvince(inputDF.col("ip")))
   inputDF = inputDF.withColumn("city", MydataUDF.getCity(inputDF.col("ip")))

    // ETL==>ODS
  //  inputDF.coalesce(1).write.format("parquet")     //orc /parquet
 //     .option("compression","snappy").save(" path")    //别使用snappy 用lzo

    inputDF.createOrReplaceTempView("log")

    spark.conf.set("spark.sql.shuffle.partitions","400") // --conf  这个东西不建议写在代码里 建议写在 spark-submit --conf 那块 明白吗？
    
    val areaSQL01 = "select province,city, " +
      "sum(case when requestmode=1 and processnode >=1 then 1 else 0 end) origin_request," +
      "sum(case when requestmode=1 and processnode >=2 then 1 else 0 end) valid_request," +
      "sum(case when requestmode=1 and processnode =3 then 1 else 0 end) ad_request," +
      "sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0 then 1 else 0 end) bid_cnt," +
      "sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 then 1 else 0 end) bid_success_cnt," +
      "sum(case when requestmode=2 and iseffective=1 then 1 else 0 end) ad_display_cnt," +
      "sum(case when requestmode=3 and processnode=1 then 1 else 0 end) ad_click_cnt," +
      "sum(case when requestmode=2 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_display_cnt," +
      "sum(case when requestmode=3 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_click_cnt," +
      "sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000  then 1*winprice/1000 else 0 end) ad_consumption," +
      "sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000  then 1*adpayment/1000 else 0 end) ad_cost " +
      "from log group by province,city"
    spark.sql(areaSQL01).show(false)//.createOrReplaceTempView("area_tmp")



    val areaSQL02 = "select province,city, " +
      "origin_request," +
      "valid_request," +
      "ad_request," +
      "bid_cnt," +
      "bid_success_cnt," +
      "bid_success_cnt/bid_cnt bid_success_rate," +
      "ad_display_cnt," +
      "ad_click_cnt," +
      "ad_click_cnt/ad_display_cnt ad_click_rate," +
      "ad_consumption," +
      "ad_cost from area_tmp " +
      "where bid_cnt!=0 and ad_display_cnt!=0"

    Thread.sleep(10000)

    spark.sql(areaSQL02).show(false)

    spark.stop()
  }
}


object MydataUDF {

  import org.apache.spark.sql.functions._

  def getProvince = udf((ip:String)=>{
    IPUtil.getInstance().getInfos(ip)(1)
  })

  def getCity = udf((ip:String)=>{
    IPUtil.getInstance().getInfos(ip)(2)
  })
}


问题：
有什么问题？
1.spark.conf.set("spark.sql.shuffle.partitions","400") // --conf  
这个东西不建议写在代码里 建议写在 spark-submit --conf 那块 明白吗？ 或者通过代码判断输入值 eg:400

2. inputDF.coalesce(1).write.format("parquet")     //orc /parquet
.option("compression","snappy").save(" path")    //别使用snappy 用lzo
spark默认是snappy 别用哈 看压缩篇

coalesce(1) 这个值 看下面给的建议  处理小文件的

演示上面代码可能的问题
在这里插入图片描述

这200 哪里来的 ？

    官网 sparksql 调优章节 
1.spark.sql.shuffle.partitions 参数 默认 200   这是sparksql里面 设置的shuffle参数 

2.RDD里的 reduceByKey(，numPartitions）还有印象吗？rdd是在这里设置的 

sparksql 默认200  生产上绝对是不够的 只要你数据量稍微大一点 200个 一定是扛不住的 

这个参数 你可以理解为mapreduce里的reduce的数量 ，reduce数量如果大了 
会导致上面问题？程序跑起来是快了 但是 小文件过多 

那么这个值 该怎么设置呢？
给你个思路 估计你读进来的数据量大小 + 你预估你每个task处理的数据量是多少 
来设计 这个值

还有一点就是：
加入这个值是400  
400
    大：小文件多点、
    10exe * 2core = 20task   同一时间点 20个task
               400/20=20轮
               600/20=30轮

ETL

ETL
    input:json
    清洗 ==> ODS  大宽表  HDFS/Hive/SparkSQL
    output: 列式存储  ORC/Parquet   这块一定是要落地的 

    Stat
        ==>  一个非常简单的SQL搞定
        ==>  复杂：多个SQL 或者 一个复杂SQL搞定

Choose Parquet for Spark SQL

行式存储：MySQL
    一条记录有多个列  一行数据是存储在一起的
    优点：
        你每次查询都使用到所有的列
    缺点：
        大宽表有N多列，但是我们仅仅使用其中几列
 	
 	因为我使用大宽表(有100列)的时候 假如只用到其中的3个列，
 	如果我使用 行式存储   加载数据的时候会把 你一行的所有列都加载出来 意味着浪费了97%资源
 
 列式存储很好的解决这个问题

列式：Orc Parquet
    特点：把每一列的数据存放在一起
    优点：减少IO 需要哪几列就直接获取哪几列
    缺点：如果你还是要获取每一行中的所有列，那么性能比行式的差

在这里插入图片描述

使用行式存储 spark跑程序的时候官网也列举了很多问题 
eg：

在这里插入图片描述

那么Most of these failures force Spark to re-try by re-queuing tasks：
spark会重试跑失败的task 
注意：
	重试 一般是跑不出来的  如果没有倾斜 和资源够 可能会跑出来
	假设10个task 3个task挂掉了 那么重新起的task 你能确定 
	重启来的task 会在 3个task之前挂掉的executor上面么？
	不能确定 很可能起到别的executor上面 
	（别的executor 可能现在也在跑 其余7个task中的某些task）
	对于这个 executor压力更大 可能会导致你的应用程序被干掉

存储是结合 压缩来用的   eg：orc + lzo
减少disk io

beeline/jdbc

生产上是用的

hiveserver2  beeline/jdbc     Hive里的 
thriftserver beeline/jdbc     spark里的 

怎么用呢？

[double_happy@hadoop101 sbin]$ ./start-thriftserver.sh --jars ~/software/mysql-connector-java-5.1.47.jar 
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/double_happy/app/spark/logs/spark-double_happy-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop101.out

[double_happy@hadoop101 sbin]$ tail -200f /home/double_happy/app/spark/logs/spark-double_happy-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop101.out


Spark Command: /usr/java/java/bin/java -cp /home/double_happy/app/spark/conf/:/home/double_happy/app/spark/jars/*:/home/double_happy/app/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --jars /home/double_happy/software/mysql-connector-java-5.1.47.jar spark-internal
========================================
19/10/28 22:26:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/28 22:26:56 INFO HiveThriftServer2: Started daemon with process name: 2297@hadoop101
19/10/28 22:26:56 INFO SignalUtils: Registered signal handler for TERM
19/10/28 22:26:56 INFO SignalUtils: Registered signal handler for HUP
19/10/28 22:26:56 INFO SignalUtils: Registered signal handler for INT
19/10/28 22:26:56 INFO HiveThriftServer2: Starting SparkContext
19/10/28 22:26:56 INFO SparkContext: Running Spark version 2.4.4
19/10/28 22:26:56 INFO SparkContext: Submitted application: Thrift JDBC/ODBC Server
19/10/28 22:26:57 INFO SecurityManager: Changing view acls to: double_happy
19/10/28 22:26:57 INFO SecurityManager: Changing modify acls to: double_happy
19/10/28 22:26:57 INFO SecurityManager: Changing view acls groups to: 
19/10/28 22:26:57 INFO SecurityManager: Changing modify acls groups to: 
19/10/28 22:26:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/10/28 22:26:57 INFO Utils: Successfully started service 'sparkDriver' on port 35237.
19/10/28 22:26:57 INFO SparkEnv: Registering MapOutputTracker
19/10/28 22:26:57 INFO SparkEnv: Registering BlockManagerMaster
19/10/28 22:26:57 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/10/28 22:26:57 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/10/28 22:26:57 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-cb4561a2-a2c1-42a6-a313-96e3ff47a7fb
19/10/28 22:26:57 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/10/28 22:26:58 INFO SparkEnv: Registering OutputCommitCoordinator
19/10/28 22:26:58 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/28 22:26:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop101:4040
19/10/28 22:26:58 INFO SparkContext: Added JAR file:///home/double_happy/software/mysql-connector-java-5.1.47.jar at spark://hadoop101:35237/jars/mysql-connector-java-5.1.47.jar with timestamp 1572272818597
19/10/28 22:26:58 INFO Executor: Starting executor ID driver on host localhost
19/10/28 22:26:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33661.
19/10/28 22:26:59 INFO NettyBlockTransferService: Server created on hadoop101:33661
19/10/28 22:26:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/28 22:26:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop101, 33661, None)
19/10/28 22:26:59 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:33661 with 366.3 MB RAM, BlockManagerId(driver, hadoop101, 33661, None)
19/10/28 22:26:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop101, 33661, None)
19/10/28 22:26:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop101, 33661, None)
19/10/28 22:27:01 INFO EventLoggingListener: Logging events to hdfs://hadoop101:8020/spark_directory/local-1572272818750
19/10/28 22:27:01 INFO SharedState: loading hive config file: file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/conf/hive-site.xml
19/10/28 22:27:01 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/sbin/spark-warehouse').
19/10/28 22:27:01 INFO SharedState: Warehouse path is 'file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/sbin/spark-warehouse'.
19/10/28 22:27:01 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
19/10/28 22:27:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
19/10/28 22:27:03 INFO ObjectStore: ObjectStore, initialize called
19/10/28 22:27:03 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
19/10/28 22:27:03 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
19/10/28 22:27:05 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
19/10/28 22:27:07 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:07 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:08 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
19/10/28 22:27:08 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
19/10/28 22:27:08 INFO ObjectStore: Initialized ObjectStore
19/10/28 22:27:08 INFO HiveMetaStore: Added admin role in metastore
19/10/28 22:27:08 INFO HiveMetaStore: Added public role in metastore
19/10/28 22:27:09 INFO HiveMetaStore: No user is added in admin role, since config is empty
19/10/28 22:27:09 INFO HiveMetaStore: 0: get_all_databases
19/10/28 22:27:09 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_all_databases
19/10/28 22:27:09 INFO HiveMetaStore: 0: get_functions: db=default pat=*
19/10/28 22:27:09 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*
19/10/28 22:27:09 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:09 INFO HiveMetaStore: 0: get_functions: db=homework pat=*
19/10/28 22:27:09 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_functions: db=homework pat=*
19/10/28 22:27:09 INFO HiveMetaStore: 0: get_function: homework.add_prefix_new
19/10/28 22:27:09 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_function: homework.add_prefix_new
19/10/28 22:27:10 INFO HiveMetaStore: 0: get_function: homework.remove_prefix_new
19/10/28 22:27:10 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_function: homework.remove_prefix_new
19/10/28 22:27:10 INFO SessionState: Created local directory: /tmp/41aaf1c8-5deb-45c7-9c03-ef172a6058a3_resources
19/10/28 22:27:10 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/41aaf1c8-5deb-45c7-9c03-ef172a6058a3
19/10/28 22:27:10 INFO SessionState: Created local directory: /tmp/double_happy/41aaf1c8-5deb-45c7-9c03-ef172a6058a3
19/10/28 22:27:10 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/41aaf1c8-5deb-45c7-9c03-ef172a6058a3/_tmp_space.db
19/10/28 22:27:10 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/sbin/spark-warehouse
19/10/28 22:27:10 INFO HiveMetaStore: 0: get_database: default
19/10/28 22:27:10 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_database: default
19/10/28 22:27:10 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/10/28 22:27:10 INFO HiveUtils: Initializing execution hive, version 1.2.1
19/10/28 22:27:11 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
19/10/28 22:27:11 INFO ObjectStore: ObjectStore, initialize called
19/10/28 22:27:11 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
19/10/28 22:27:11 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
19/10/28 22:27:14 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
19/10/28 22:27:15 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:15 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:17 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:17 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:17 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/10/28 22:27:17 INFO ObjectStore: Initialized ObjectStore
19/10/28 22:27:17 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/10/28 22:27:17 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
19/10/28 22:27:18 INFO HiveMetaStore: Added admin role in metastore
19/10/28 22:27:18 INFO HiveMetaStore: Added public role in metastore
19/10/28 22:27:18 INFO HiveMetaStore: No user is added in admin role, since config is empty
19/10/28 22:27:18 INFO HiveMetaStore: 0: get_all_databases
19/10/28 22:27:18 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_all_databases
19/10/28 22:27:18 INFO HiveMetaStore: 0: get_functions: db=default pat=*
19/10/28 22:27:18 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*
19/10/28 22:27:18 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/10/28 22:27:18 INFO SessionState: Created local directory: /tmp/5fe8af31-b32b-4788-a587-4fbf6cab7b1a_resources
19/10/28 22:27:18 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/5fe8af31-b32b-4788-a587-4fbf6cab7b1a
19/10/28 22:27:18 INFO SessionState: Created local directory: /tmp/double_happy/5fe8af31-b32b-4788-a587-4fbf6cab7b1a
19/10/28 22:27:18 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/5fe8af31-b32b-4788-a587-4fbf6cab7b1a/_tmp_space.db
19/10/28 22:27:18 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/sbin/spark-warehouse
19/10/28 22:27:18 INFO SessionManager: Operation log root directory is created: /tmp/double_happy/operation_logs
19/10/28 22:27:18 INFO SessionManager: HiveServer2: Background operation thread pool size: 100
19/10/28 22:27:18 INFO SessionManager: HiveServer2: Background operation thread wait queue size: 100
19/10/28 22:27:18 INFO SessionManager: HiveServer2: Background operation thread keepalive time: 10 seconds
19/10/28 22:27:18 INFO AbstractService: Service:OperationManager is inited.
19/10/28 22:27:18 INFO AbstractService: Service:SessionManager is inited.
19/10/28 22:27:18 INFO AbstractService: Service: CLIService is inited.
19/10/28 22:27:18 INFO AbstractService: Service:ThriftBinaryCLIService is inited.
19/10/28 22:27:18 INFO AbstractService: Service: HiveServer2 is inited.
19/10/28 22:27:18 INFO AbstractService: Service:OperationManager is started.
19/10/28 22:27:18 INFO AbstractService: Service:SessionManager is started.
19/10/28 22:27:18 INFO AbstractService: Service:CLIService is started.
19/10/28 22:27:18 INFO ObjectStore: ObjectStore, initialize called
19/10/28 22:27:18 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
19/10/28 22:27:18 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/10/28 22:27:18 INFO ObjectStore: Initialized ObjectStore
19/10/28 22:27:18 INFO HiveMetaStore: 0: get_databases: default
19/10/28 22:27:18 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_databases: default
19/10/28 22:27:18 INFO HiveMetaStore: 0: Shutting down the object store...
19/10/28 22:27:18 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=Shutting down the object store...
19/10/28 22:27:18 INFO HiveMetaStore: 0: Metastore shutdown complete.
19/10/28 22:27:18 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=Metastore shutdown complete.
19/10/28 22:27:18 INFO AbstractService: Service:ThriftBinaryCLIService is started.
19/10/28 22:27:18 INFO AbstractService: Service:HiveServer2 is started.
19/10/28 22:27:18 INFO HiveThriftServer2: HiveThriftServer2 started
19/10/28 22:27:18 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads

说明 thriftserver 启动起来了 

sparkui端口 参数
spark.port.maxRetries 16 默认16 也就是同一时间点对一台机器 只能起16个spark-submit 

[double_happy@hadoop101 software]$ jps
2496 Jps
4289 NodeManager
4019 SecondaryNameNode
14999 AzkabanSingleServer
6633 HistoryServer
2297 SparkSubmit
3721 NameNode
4186 ResourceManager
3853 DataNode
[double_happy@hadoop101 software]$ 


也就是这个 2297 SparkSubmit  最多16个

在这里插入图片描述

这是 thriftserver 端起来了 说明服务端有了
所以接下来要通过客户端 连接一下 
客户端怎么链接呢？
使用beeline     用法跟Hive里是一毛一样的

[double_happy@hadoop101 bin]$ ./beeline -u jdbc:hive2://hadoop101:10000/ruozedata_g7 -n double_happy
Connecting to jdbc:hive2://hadoop101:10000/ruozedata_g7
19/10/28 22:40:56 INFO Utils: Supplied authorities: hadoop101:10000
19/10/28 22:40:56 INFO Utils: Resolved authority: hadoop101:10000
19/10/28 22:40:56 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hadoop101:10000/ruozedata_g7
Error: Database 'ruozedata_g7' not found; (state=,code=0)
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hadoop101:10000/ruozedata_g7 (closed)> ^C^C[double_happy@hadoop101 bin]$ 
[double_happy@hadoop101 bin]$ ./beeline -u jdbc:hive2://hadoop101:10000/ -n double_happy            
Connecting to jdbc:hive2://hadoop101:10000/
19/10/28 22:42:15 INFO Utils: Supplied authorities: hadoop101:10000
19/10/28 22:42:15 INFO Utils: Resolved authority: hadoop101:10000
19/10/28 22:42:16 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hadoop101:10000/
Connected to: Spark SQL (version 2.4.4)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hadoop101:10000/> show databases;
+---------------+--+
| databaseName  |
+---------------+--+
| default       |
| homework      |
+---------------+--+
2 rows selected (1.206 seconds)
0: jdbc:hive2://hadoop101:10000/> use homework;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.181 seconds)
0: jdbc:hive2://hadoop101:10000/> show tables;
+-----------+------------------------------------+--------------+--+
| database  |             tableName              | isTemporary  |
+-----------+------------------------------------+--------------+--+
| homework  | access_wide                        | false        |
| homework  | dwd_platform_stat_info             | false        |
| homework  | jf_tmp                             | false        |
| homework  | ods_domain_traffic_info            | false        |
| homework  | ods_log_info                       | false        |
| homework  | ods_uid_pid_compression_info       | false        |
| homework  | ods_uid_pid_info                   | false        |
| homework  | ods_uid_pid_info_compression_test  | false        |
+-----------+------------------------------------+--------------+--+
8 rows selected (0.202 seconds)
0: jdbc:hive2://hadoop101:10000/>

[double_happy@hadoop101 bin]$ ./beeline -u jdbc:hive2://hadoop101:10000/homework -n double_happy
Connecting to jdbc:hive2://hadoop101:10000/homework
19/10/28 22:43:14 INFO Utils: Supplied authorities: hadoop101:10000
19/10/28 22:43:14 INFO Utils: Resolved authority: hadoop101:10000
19/10/28 22:43:14 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hadoop101:10000/homework
Connected to: Spark SQL (version 2.4.4)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hadoop101:10000/homework> show tables;
+-----------+------------------------------------+--------------+--+
| database  |             tableName              | isTemporary  |
+-----------+------------------------------------+--------------+--+
| homework  | access_wide                        | false        |
| homework  | dwd_platform_stat_info             | false        |
| homework  | jf_tmp                             | false        |
| homework  | ods_domain_traffic_info            | false        |
| homework  | ods_log_info                       | false        |
| homework  | ods_uid_pid_compression_info       | false        |
| homework  | ods_uid_pid_info                   | false        |
| homework  | ods_uid_pid_info_compression_test  | false        |
+-----------+------------------------------------+--------------+--+
8 rows selected (0.352 seconds)
0: jdbc:hive2://hadoop101:10000/homework>

这个东西适用在哪里呢？

你的数据是通过UI去访问的：eg：HUE/Zeppelin  (他们后台都有一个服务的 )
   
   之后可以写一个 jdbc代码 (跟hive里是一模一样的  把你的sql 发到服务 服务给你返回结果 
    通过你的ui界面 把数据结果渲染出来 )
    
    如果你发的SQL是一个计算/统计SQL：返回肯定是需要时间
    只拿结果，不计算

参考官网Distributed SQL Engine

Spark On Yarn
Running Spark on YARN

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

在这里插入图片描述
client模式：

在这里插入图片描述

在Spark on YARN中  是没有Worker的概念，是Standalone中的

Spark on YARN client ：
   1.executor是运行在container中的
   2.driver是跑在本地的

cluster模式：

在这里插入图片描述

spark on yarn 总结：
Spark：Driver + Executors

spark on yarn
    cluster
        driver是运行在AM里面的
        AM：AM + Driver   既当爹又当妈 就是既要给executor发task和代码 也要申请资源
        客户端退出   ？作业是没事的 
        日志 是在YARN上的 ***  本地是看不见的 
            yarn logs -applicationId <app ID>

    client
        driver是运行在本地的      
        客户端退出  作业就退出了
        AM：负责从YARN上去申请资源
        日志是在本地的 ***   方便查看 

1.
  但是 日志在本地会有一个场景 本地的进程是有一定的限制的  
加入你提交多个作业 都是以yarn client模式 那么 进程可能扎堆出现 机器可能会挂掉 

eg：
[double_happy@hadoop101 ~]$ jps
4289 NodeManager
4019 SecondaryNameNode
14999 AzkabanSingleServer
17719 CoarseGrainedExecutorBackend
6633 HistoryServer
3721 NameNode
17689 CoarseGrainedExecutorBackend
4186 ResourceManager
17517 SparkSubmit
17645 ExecutorLauncher
3853 DataNode
17966 Jps
[double_happy@hadoop101 ~]$ 

这是在本地 client 就提交作业 CoarseGrainedExecutorBackend 扎堆出现 多了 机器可能会挂掉

2.driver 和 executor是有通信的 client模式 下 可能会有一种场景存在
driver可以在任意一台机器上面 但是如果这个机器 不是 集群里的机器 (跟yarn 没有关系哈 这里只讨论机器和集群)
如果这机器是在 集群外 这台机器一定是有集群的 gateway权限的 
driver 和 executor是有通信的 网络会用影响 
工作中在集群外的 很少哈 这里只是说一下这个场景 

集群内带宽 很高 上面的场景影响不大 

3.就是client模式就一个弱点 就是 本地进程太多

测试：

[double_happy@hadoop101 ~]$ spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      
[double_happy@hadoop101 ~]$

  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).

spark-shell --master yarn  默认不写  --deploy-mode 是 client模式

client模式：测试

[double_happy@hadoop101 ~]$ spark-shell --master yarn
19/10/29 10:41:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hadoop101:4040
Spark context available as 'sc' (master = yarn, app id = application_1570934113711_0037).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

在这里插入图片描述

注意：
写代码跟运行模式是没有关系的 
 --num-executors   默认是2 个
id 是application_xxx 开头的必然是 yarn 模式的 去historyserver看见这个开头的 就是yarn模式跑的任务

在这里插入图片描述

[double_happy@hadoop101 ~]$ spark-sql --jars ~/software/mysql-connector-java-5.1.47.jar --master yarn
19/10/29 10:54:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/29 10:54:19 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
19/10/29 10:54:19 INFO ObjectStore: ObjectStore, initialize called
19/10/29 10:54:20 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
19/10/29 10:54:20 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
19/10/29 10:54:21 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
19/10/29 10:54:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/29 10:54:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/29 10:54:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/10/29 10:54:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/10/29 10:54:23 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
19/10/29 10:54:23 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
19/10/29 10:54:23 INFO ObjectStore: Initialized ObjectStore
19/10/29 10:54:24 INFO HiveMetaStore: Added admin role in metastore
19/10/29 10:54:24 INFO HiveMetaStore: Added public role in metastore
19/10/29 10:54:24 INFO HiveMetaStore: No user is added in admin role, since config is empty
19/10/29 10:54:24 INFO HiveMetaStore: 0: get_all_databases
19/10/29 10:54:24 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_all_databases
19/10/29 10:54:24 INFO HiveMetaStore: 0: get_functions: db=default pat=*
19/10/29 10:54:24 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*
19/10/29 10:54:24 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/10/29 10:54:24 INFO HiveMetaStore: 0: get_functions: db=homework pat=*
19/10/29 10:54:24 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_functions: db=homework pat=*
19/10/29 10:54:24 INFO HiveMetaStore: 0: get_function: homework.add_prefix_new
19/10/29 10:54:24 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_function: homework.add_prefix_new
19/10/29 10:54:25 INFO HiveMetaStore: 0: get_function: homework.remove_prefix_new
19/10/29 10:54:25 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_function: homework.remove_prefix_new
19/10/29 10:54:25 INFO SessionState: Created local directory: /tmp/eed4bfce-e4a6-4683-81a4-9bda791d7822_resources
19/10/29 10:54:25 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/eed4bfce-e4a6-4683-81a4-9bda791d7822
19/10/29 10:54:25 INFO SessionState: Created local directory: /tmp/double_happy/eed4bfce-e4a6-4683-81a4-9bda791d7822
19/10/29 10:54:25 INFO SessionState: Created HDFS directory: /tmp/hive/double_happy/eed4bfce-e4a6-4683-81a4-9bda791d7822/_tmp_space.db
19/10/29 10:54:25 INFO SparkContext: Running Spark version 2.4.4
19/10/29 10:54:25 INFO SparkContext: Submitted application: SparkSQL::172.26.162.56
19/10/29 10:54:25 INFO SecurityManager: Changing view acls to: double_happy
19/10/29 10:54:25 INFO SecurityManager: Changing modify acls to: double_happy
19/10/29 10:54:25 INFO SecurityManager: Changing view acls groups to: 
19/10/29 10:54:25 INFO SecurityManager: Changing modify acls groups to: 
19/10/29 10:54:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/10/29 10:54:26 INFO Utils: Successfully started service 'sparkDriver' on port 44153.
19/10/29 10:54:26 INFO SparkEnv: Registering MapOutputTracker
19/10/29 10:54:26 INFO SparkEnv: Registering BlockManagerMaster
19/10/29 10:54:26 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/10/29 10:54:26 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/10/29 10:54:26 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-73b7d763-b5a3-476a-a93c-d259c46eac97
19/10/29 10:54:26 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/10/29 10:54:26 INFO SparkEnv: Registering OutputCommitCoordinator
19/10/29 10:54:26 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/29 10:54:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop101:4040
19/10/29 10:54:27 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/10/29 10:54:27 INFO Client: Requesting a new application from cluster with 1 NodeManagers
19/10/29 10:54:27 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
19/10/29 10:54:27 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
19/10/29 10:54:27 INFO Client: Setting up container launch context for our AM
19/10/29 10:54:27 INFO Client: Setting up the launch environment for our AM container
19/10/29 10:54:27 INFO Client: Preparing resources for our AM container
19/10/29 10:54:27 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/10/29 10:54:33 INFO Client: Uploading resource file:/tmp/spark-84f5cd2c-d3ab-4e82-8520-7f45d7422e8c/__spark_libs__4819705212614105474.zip -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/__spark_libs__4819705212614105474.zip
19/10/29 10:54:35 INFO Client: Uploading resource file:/home/double_happy/software/mysql-connector-java-5.1.47.jar -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/mysql-connector-java-5.1.47.jar
19/10/29 10:54:35 INFO Client: Uploading resource file:/tmp/spark-84f5cd2c-d3ab-4e82-8520-7f45d7422e8c/__spark_conf__8084381853282513804.zip -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/__spark_conf__.zip
19/10/29 10:54:35 INFO SecurityManager: Changing view acls to: double_happy
19/10/29 10:54:35 INFO SecurityManager: Changing modify acls to: double_happy
19/10/29 10:54:35 INFO SecurityManager: Changing view acls groups to: 
19/10/29 10:54:35 INFO SecurityManager: Changing modify acls groups to: 
19/10/29 10:54:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/10/29 10:54:37 INFO Client: Submitting application application_1570934113711_0038 to ResourceManager
19/10/29 10:54:37 INFO YarnClientImpl: Submitted application application_1570934113711_0038
19/10/29 10:54:37 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1570934113711_0038 and attemptId None
19/10/29 10:54:38 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:38 INFO Client: 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: root.double_happy
         start time: 1572317677273
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1570934113711_0038/
         user: double_happy
19/10/29 10:54:39 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:40 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:41 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:42 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:43 INFO Client: Application report for application_1570934113711_0038 (state: ACCEPTED)
19/10/29 10:54:44 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> hadoop101, PROXY_URI_BASES -> http://hadoop101:8088/proxy/application_1570934113711_0038), /proxy/application_1570934113711_0038
19/10/29 10:54:44 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
19/10/29 10:54:44 INFO Client: Application report for application_1570934113711_0038 (state: RUNNING)
19/10/29 10:54:44 INFO Client: 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 172.26.162.56
         ApplicationMaster RPC port: -1
         queue: root.double_happy
         start time: 1572317677273
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1570934113711_0038/
         user: double_happy
19/10/29 10:54:44 INFO YarnClientSchedulerBackend: Application application_1570934113711_0038 has started running.
19/10/29 10:54:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38768.
19/10/29 10:54:44 INFO NettyBlockTransferService: Server created on hadoop101:38768
19/10/29 10:54:44 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/29 10:54:44 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop101, 38768, None)
19/10/29 10:54:44 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:38768 with 366.3 MB RAM, BlockManagerId(driver, hadoop101, 38768, None)
19/10/29 10:54:44 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop101, 38768, None)
19/10/29 10:54:44 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop101, 38768, None)
19/10/29 10:54:44 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
19/10/29 10:54:45 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
19/10/29 10:54:45 INFO EventLoggingListener: Logging events to hdfs://hadoop101:8020/spark_directory/application_1570934113711_0038
19/10/29 10:54:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.26.162.56:59428) with ID 1
19/10/29 10:54:50 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:36737 with 366.3 MB RAM, BlockManagerId(1, hadoop101, 36737, None)
19/10/29 10:54:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.26.162.56:34236) with ID 2
19/10/29 10:54:52 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
19/10/29 10:54:52 INFO SharedState: loading hive config file: file:/home/double_happy/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/conf/hive-site.xml
19/10/29 10:54:52 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/double_happy/spark-warehouse').
19/10/29 10:54:52 INFO SharedState: Warehouse path is 'file:/home/double_happy/spark-warehouse'.
19/10/29 10:54:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
19/10/29 10:54:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
19/10/29 10:54:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
19/10/29 10:54:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
19/10/29 10:54:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
19/10/29 10:54:52 INFO BlockManagerMasterEndpoint: Registering block manager hadoop101:36948 with 366.3 MB RAM, BlockManagerId(2, hadoop101, 36948, None)
19/10/29 10:54:52 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
19/10/29 10:54:52 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/double_happy/spark-warehouse
19/10/29 10:54:52 INFO metastore: Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/home/double_happy/spark-warehouse
19/10/29 10:54:52 INFO HiveMetaStore: 0: Shutting down the object store...
19/10/29 10:54:52 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=Shutting down the object store...
19/10/29 10:54:52 INFO HiveMetaStore: 0: Metastore shutdown complete.
19/10/29 10:54:52 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=Metastore shutdown complete.
19/10/29 10:54:52 INFO HiveMetaStore: 0: get_database: default
19/10/29 10:54:52 INFO audit: ugi=double_happy  ip=unknown-ip-addr      cmd=get_database: default
19/10/29 10:54:52 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
19/10/29 10:54:52 INFO ObjectStore: ObjectStore, initialize called
19/10/29 10:54:52 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
19/10/29 10:54:52 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
19/10/29 10:54:52 INFO ObjectStore: Initialized ObjectStore
19/10/29 10:54:53 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Spark master: yarn, Application Id: application_1570934113711_0038
19/10/29 10:54:53 INFO SparkSQLCLIDriver: Spark master: yarn, Application Id: application_1570934113711_0038
spark-sql (default)> 


注意：日志里
19/10/29 10:54:27 WARN Client:
 Neither spark.yarn.jars nor spark.yarn.archive is set, 
 falling back to uploading libraries under SPARK_HOME.

1. spark.yarn.jars nor spark.yarn.archive is set 
这个没有设置 会把SPARK_HOME相关的东西 全部传到hdfs上去 
不信看日志 

2.
19/10/29 10:54:33 INFO Client: Uploading resource file:/tmp/spark-84f5cd2c-d3ab-4e82-8520-7f45d7422e8c/__spark_libs__4819705212614105474.zip -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/__spark_libs__4819705212614105474.zip
19/10/29 10:54:35 INFO Client: Uploading resource file:/home/double_happy/software/mysql-connector-java-5.1.47.jar -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/mysql-connector-java-5.1.47.jar
19/10/29 10:54:35 INFO Client: Uploading resource file:/tmp/spark-84f5cd2c-d3ab-4e82-8520-7f45d7422e8c/__spark_conf__8084381853282513804.zip -> hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038/__spark_conf__.zip

打开这个地址看一眼：
 hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0038

这是我又重启了一个 spark-sql  之前的关掉了 
[double_happy@hadoop101 ~]$ hadoop fs -ls  hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0041
Found 3 items
-rw-r--r--   1 double_happy supergroup     211902 2019-10-29 11:18 hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0041/__spark_conf__.zip
-rw-r--r--   1 double_happy supergroup  298846294 2019-10-29 11:18 hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0041/__spark_libs__2528141214285970680.zip
-rw-r--r--   1 double_happy supergroup    1007502 2019-10-29 11:18 hdfs://hadoop101:8020/user/double_happy/.sparkStaging/application_1570934113711_0041/mysql-connector-java-5.1.47.jar
[double_happy@hadoop101 ~]$ 

__spark_conf__.zip
mysql-connector-java-5.1.47.jar
__spark_libs__2528141214285970680.zip    非常大这个 
作业跑完会把这些自动删掉

如果上面提到的两个参数没有设置 会把这些传到HDFS  上传是需要花费时间的
这个不解决 你的每个作业 都要这样

这就是一个调优点：
尽可能的让Spark快速的再yarn上运行起来   该怎么做的呢？

https://guguoyu.blog.csdn.net/article/details/102644376

在这里插入图片描述

spark-shell 和spark-sql 都可以 这不是主要的 主要的是下面的

[double_happy@hadoop101 ~]$ spark-shell --master yarn --deploy-mode cluster
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark shells.
        at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:853)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:281)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[double_happy@hadoop101 ~]$ 

为什么Cluster deploy mode is not applicable to Spark shells.？

因为 spark-shell driver是在本地的 是可以交互代码的  而 yarn-claster  driver是在am里的  明白吗？

SparkSQL002

2018-02-02

操作Hive

object SparkSessionApp {
  def main(args: Array[String]): Unit = {

    System.setProperty("HADOOP_USER_NAME", "double_happy")
    val spark = SparkSession.builder()
      .master("local")
      .appName("SparkSessionApp")
        .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases").show()

    spark.sql("").write.saveAsTable("")

    spark.sql("").write.insertInto("")

    spark.stop()
  }
}


/**
   * Saves the content of the `DataFrame` as the specified table.
   *
   * In the case the table already exists, behavior of this function depends on the
   * save mode, specified by the `mode` function (default to throwing an exception).
   * When `mode` is `Overwrite`, the schema of the `DataFrame` does not need to be
   * the same as that of the existing table.
   *
   * When `mode` is `Append`, if there is an existing table, we will use the format and options of
   * the existing table. The column order in the schema of the `DataFrame` doesn't need to be same
   * as that of the existing table. Unlike `insertInto`, `saveAsTable` will use the column names to
   * find the correct column positions. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  1|  2|
   *    |  4|  3|
   *    +---+---+
   * }}}
   *
   * In this method, save mode is used to determine the behavior if the data source table exists in
   * Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in
   * JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the
   * underlying data of data source if the table already exists.
   *
   * When the DataFrame is created from a non-partitioned `HadoopFsRelation` with a single input
   * path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
   * and Parquet), the table is persisted in a Hive compatible format, which means other systems
   * like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
   * specific format.
   *
   * @since 1.4.0
   */
  def saveAsTable(tableName: String): Unit = {
    saveAsTable(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }

/**
   * Inserts the content of the `DataFrame` to the specified table. It requires that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based
   * resolution. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
   *    scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  5|  6|
   *    |  3|  4|
   *    |  1|  2|
   *    +---+---+
   * }}}
   *
   * Because it inserts data to an existing table, format or options will be ignored.
   *
   * @since 1.4.0
   */
  def insertInto(tableName: String): Unit = {
    insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }


创建表 也可以直接 spark.sql("create table xxx") 
但是 不建议这样  因为 表 一般都是提前创建好的  
因为 真正生产上 创建表 是在 一个 web页面 创建表的  是有权限 的

Spark操作Hive 代码

大部分人使用spark开发 Hive是使用 spark.sql(“ sql “)
可以的我不喜欢我还是喜欢使用api的方式各有所爱

全局排序：这是使用sql的方式写的 
object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名

    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")
    
    //需求1 ：统计 每个平台 省市下面 traffic的总和       order by  是全局排序的 
    val sql = "select platform, province, city, sum(traffic) as traffics from log group by platform, province, city order by traffics desc"
    spark.sql(sql).show()

    spark.stop()
  }
}

结果是：
+--------+--------+----+--------+
|platform|province|city|traffics|
+--------+--------+----+--------+
|     mac|    香港|    | 2879982|
| windows|    香港|    | 2871537|
| Andriod|    香港|    | 2722363|
|   linux|    香港|    | 2696578|
| Symbain|    香港|    | 2444806|
| Andriod|    山西|忻州|  968255|
|   linux|    台湾|    |  898404|
| windows|    山西|忻州|  894966|
| Andriod|    湖北|武汉|  865758|
|     mac|    湖北|武汉|  848995|
|     mac|    山西|忻州|  837873|
| Symbain|    山西|忻州|  791524|
|   linux|    湖北|武汉|  781347|
| Andriod|    台湾|    |  776642|
|     mac|    台湾|    |  775977|
| Symbain|    台湾|    |  744858|
| windows|    台湾|    |  744558|
| windows|    湖北|武汉|  728412|
| Symbain|    湖北|武汉|  728034|
|   linux|    山西|忻州|  689405|
+--------+--------+----+--------+

Window Functions in Spark SQL

全局排序 ： Api方式   我喜欢的 

 /**
   * Groups the Dataset using the specified columns, so that we can run aggregation on them.
   * See [[RelationalGroupedDataset]] for all the available aggregate functions.
   *
   * This is a variant of groupBy that can only group by existing columns using column names
   * (i.e. cannot construct expressions).
   *
   * {{{
   *   // Compute the average for all numeric columns grouped by department.
   *   ds.groupBy("department").avg()
   *
   *   // Compute the max age and average salary, grouped by department and gender.
   *   ds.groupBy($"department", $"gender").agg(Map(
   *     "salary" -> "avg",
   *     "age" -> "max"
   *   ))
   * }}}
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
    val colNames: Seq[String] = col1 +: cols
    RelationalGroupedDataset(
      toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
  }


 /**
   * Compute aggregates by specifying a series of aggregate columns. Note that this function by
   * default retains the grouping columns in its output. To not retain grouping columns, set
   * `spark.sql.retainGroupColumns` to false.
   *
   * The available aggregate methods are defined in [[org.apache.spark.sql.functions]].
   *
   * {{{
   *   // Selects the age of the oldest employee and the aggregate expense for each department
   *
   *   // Scala:
   *   import org.apache.spark.sql.functions._
   *   df.groupBy("department").agg(max("age"), sum("expense"))
   *
   *   // Java:
   *   import static org.apache.spark.sql.functions.*;
   *   df.groupBy("department").agg(max("age"), sum("expense"));
   * }}}
   *
   * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
   * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`.
   * {{{
   *   // Scala, 1.3.x:
   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
   *
   *   // Java, 1.3.x:
   *   df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
   * }}}
   *
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def agg(expr: Column, exprs: Column*): DataFrame = {
    toDF((expr +: exprs).map {
      case typed: TypedColumn[_, _] =>
        typed.withInputType(df.exprEnc, df.logicalPlan.output).expr
      case c => c.expr
    })
  }

你要好好看看注释 就会明白下面的代码

groupBy:Groups the Dataset using the specified columns, so that we can run aggregation on them.
agg : 
  * Compute aggregates by specifying a series of aggregate columns. Note that this function by
   * default retains the grouping columns in its output. To not retain grouping columns, set
   * `spark.sql.retainGroupColumns` to false.


 /**
   * Returns a new Dataset sorted by the given expressions. For example:
   * {{{
   *   ds.sort($"col1", $"col2".desc)
   * }}}
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def sort(sortExprs: Column*): Dataset[T] = {
    sortInternal(global = true, sortExprs)
  }

object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名

    //需求1 ：统计 每个平台 省市下面 traffic的总和       order by  是全局排序的
        import org.apache.spark.sql.functions._    //spark 内置的函数
        df.groupBy("platform", "province", "city")
            .agg(sum("traffic").as("traffics"))
            .sort('traffics.desc)
            .show()
    spark.stop()
  }
}

注意：
Hive的函数 Spark里也是有的 spark自己内置的 
import org.apache.spark.sql.functions._   


结果是;
+--------+--------+----+--------+
|platform|province|city|traffics|
+--------+--------+----+--------+
|     mac|    香港|    | 2879982|
| windows|    香港|    | 2871537|
| Andriod|    香港|    | 2722363|
|   linux|    香港|    | 2696578|
| Symbain|    香港|    | 2444806|
| Andriod|    山西|忻州|  968255|
|   linux|    台湾|    |  898404|
| windows|    山西|忻州|  894966|
| Andriod|    湖北|武汉|  865758|
|     mac|    湖北|武汉|  848995|
|     mac|    山西|忻州|  837873|
| Symbain|    山西|忻州|  791524|
|   linux|    湖北|武汉|  781347|
| Andriod|    台湾|    |  776642|
|     mac|    台湾|    |  775977|
| Symbain|    台湾|    |  744858|
| windows|    台湾|    |  744558|
| windows|    湖北|武汉|  728412|
| Symbain|    湖北|武汉|  728034|
|   linux|    山西|忻州|  689405|
+--------+--------+----+--------+
only showing top 20 rows

在这里插入图片描述

api方式 开发 你要注意的是：
	Column 和 String  就是你传进去的 列名 要传入 string类型 还是 Column 类型 

   我个人喜欢 ：
   		 Column  ==》 '列名
   		 String    ==》 “列名”
   毕竟有太多种写法 找一个自己喜欢的  跟找女朋友相反 找女朋友 找一个喜欢自己的  明白吗

分组:Top n 
object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名



    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")

    // 需求二 ：platform  组内  province 访问次数最多的TopN
    val sql =
      """
        |
        |select * from
        |(
        |select t.*, row_number() over(partition by platform order by cnt desc) as r
        |from
        |(select platform,province,count(1) cnt from log group by platform,province) t
        |) a where a.r<=3
        |
      """.stripMargin

    spark.sql(sql).show()
    spark.stop()
  }
}

结果是：
+--------+--------+---+---+
|platform|province|cnt|  r|
+--------+--------+---+---+
|   linux|    香港|606|  1|
|   linux|    广东|211|  2|
|   linux|    台湾|173|  3|
| Symbain|    香港|582|  1|
| Symbain|    广东|222|  2|
| Symbain|    福建|153|  3|
| Andriod|    香港|607|  1|
| Andriod|    广东|223|  2|
| Andriod|    北京|150|  3|
|     mac|    香港|646|  1|
|     mac|    广东|213|  2|
|     mac|    台湾|156|  3|
| windows|    香港|657|  1|
| windows|    广东|186|  2|
| windows|    河北|151|  3|
+--------+--------+---+---+

object LogApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("LogApp")
      .getOrCreate()

    import spark.implicits._

    val df = spark.read.textFile("file:///C:/IdeaProjects/spark/data/access.log")
      .map(x => {
        val splits = x.split("\t")
        val platform = splits(1)
        val traffic = splits(6).toLong
        val province = splits(8)
        val city = splits(9)
        val isp = splits(10)
        (platform, traffic, province, city, isp)
      }).toDF("platform", "traffic", "province", "city", "isp")   //toDF的字段名


    // 如果你想使用SQL来进行处理，那么就是将df注册成一个临时视图
    df.createOrReplaceTempView("log")
    val sql2 =
      """
        |
        |select a.* from
        |(
        |select t.*,
        |row_number() over(partition by platform order by cnt desc) as rn,
        |rank() over(partition by platform order by cnt desc) as r,
        |dense_rank() over(partition by platform order by cnt desc) as dn
        |from
        |(select platform,province,count(1) cnt from log group by platform,province) t
        |) a
        |
      """.stripMargin

    spark.sql(sql2).show()
    spark.stop()
  }
}

结果是：
+--------+--------+---+---+---+---+
|platform|province|cnt| rn|  r| dn|
+--------+--------+---+---+---+---+
|   linux|    香港|606|  1|  1|  1|
|   linux|    广东|211|  2|  2|  2|
|   linux|    台湾|173|  3|  3|  3|
|   linux|    福建|147|  4|  4|  4|
|   linux|    北京|134|  5|  5|  5|
|   linux|    河北|128|  6|  6|  6|
|   linux|    湖北|115|  7|  7|  7|
|   linux|    山西|107|  8|  8|  8|
|   linux|    江西|104|  9|  9|  9|
|   linux|    上海|101| 10| 10| 10|
|   linux|    山东| 26| 11| 11| 11|
| Symbain|    香港|582|  1|  1|  1|
| Symbain|    广东|222|  2|  2|  2|
| Symbain|    福建|153|  3|  3|  3|
| Symbain|    河北|151|  4|  4|  4|
| Symbain|    台湾|146|  5|  5|  5|
| Symbain|    北京|133|  6|  6|  6|
| Symbain|    山西|121|  7|  7|  7|
| Symbain|    湖北|120|  8|  8|  8|
| Symbain|    上海|109|  9|  9|  9|
+--------+--------+---+---+---+---+
only showing top 20 rows

那么： 他们有什么区别？
	row_number   123456  排序的  即使有的值相等 也往下排序
	rank        1233567 排序的  有相同的值  排序号相等 之后会跳过重复的占位 这里就没有4
	dense_rank   12334567 排序的 有相同的值  排序号相等  之后不会跳过重复的占位 这里紧接着4

Catalog
非常非常重要 spark2.0之后才有的我开发了一个csv入Hive 就用到了它

你Hive的元数据存在 MySQl里面的 
如果要代码中使用到元数据 要通过JDBC来取 

但是2.0版本之后 Spark 提供 Catalog 可以拿到 Hive的元数据

开启spark-shell –jars MySQL驱动

scala> val catalog = spark.catalog
catalog: org.apache.spark.sql.catalog.Catalog = org.apache.spark.sql.internal.CatalogImpl@672c4e24

scala> catalog.listDatabases().show
+--------+--------------------+--------------------+
|    name|         description|         locationUri|
+--------+--------------------+--------------------+
| default|Default Hive data...|hdfs://hadoop101:...|
|homework|                    |hdfs://hadoop101:...|
+--------+--------------------+--------------------+


scala> catalog.listDatabases().show(false)
+--------+---------------------+-----------------------------------------------------+
|name    |description          |locationUri                                          |
+--------+---------------------+-----------------------------------------------------+
|default |Default Hive database|hdfs://hadoop101:8020/user/hive/warehouse            |
|homework|                     |hdfs://hadoop101:8020/user/hive/warehouse/homework.db|
+--------+---------------------+-----------------------------------------------------+


scala> catalog.listTables("homework").show(false)
+---------------------------------+--------+-----------+---------+-----------+
|name                             |database|description|tableType|isTemporary|
+---------------------------------+--------+-----------+---------+-----------+
|access_wide                      |homework|null       |EXTERNAL |false      |
|dwd_platform_stat_info           |homework|null       |MANAGED  |false      |
|jf_tmp                           |homework|null       |MANAGED  |false      |
|ods_domain_traffic_info          |homework|null       |EXTERNAL |false      |
|ods_log_info                     |homework|null       |EXTERNAL |false      |
|ods_uid_pid_compression_info     |homework|null       |MANAGED  |false      |
|ods_uid_pid_info                 |homework|null       |EXTERNAL |false      |
|ods_uid_pid_info_compression_test|homework|null       |EXTERNAL |false      |
+---------------------------------+--------+-----------+---------+-----------+


scala> catalog.listFunctions().show(5,false)
+----+--------+-----------+----------------------------------------------------+-----------+
|name|database|description|className                                           |isTemporary|
+----+--------+-----------+----------------------------------------------------+-----------+
|!   |null    |null       |org.apache.spark.sql.catalyst.expressions.Not       |true       |
|%   |null    |null       |org.apache.spark.sql.catalyst.expressions.Remainder |true       |
|&   |null    |null       |org.apache.spark.sql.catalyst.expressions.BitwiseAnd|true       |
|*   |null    |null       |org.apache.spark.sql.catalyst.expressions.Multiply  |true       |
|+   |null    |null       |org.apache.spark.sql.catalyst.expressions.Add       |true       |
+----+--------+-----------+----------------------------------------------------+-----------+
only showing top 5 rows


scala> catalog.listColumns("homework.dwd_platform_stat_info").show(false)
+--------+-----------+--------+--------+-----------+--------+
|name    |description|dataType|nullable|isPartition|isBucket|
+--------+-----------+--------+--------+-----------+--------+
|platform|null       |string  |true    |false      |false   |
|cnt     |null       |int     |true    |false      |false   |
|d       |null       |string  |true    |false      |false   |
|day     |null       |string  |true    |true       |false   |
+--------+-----------+--------+--------+-----------+--------+


scala> 

不仅仅这多多哈 catalog 几乎所有的元数据 信息都能搞到 

但是这些值的返回值 都是 DataSet 接下来 讲讲DataSet

在这里插入图片描述
给你一个使用上面catalog的场景
做一个页面：

在这里插入图片描述

DataSet
这个东西很简单的
Untyped Dataset = Row

DataSet就是你可以把它当作rdd来操作 

object DSApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("DSApp")
      .getOrCreate()

    import spark.implicits._
    val df = spark.read.option("header","true")
      .option("inferSchema","true").csv("file:///C:/IdeaProjects/spark/data/sale.csv")
    val ds = df.as[Sales]

    ds.printSchema()
    ds.show()

    // ROW  DF弱类型
    //    df.select("transactionId").show(false)

    ds.map(columns =>{
        columns.transactionId match {
        case 111 =>Sales(columns.transactionId,columns.customerId,columns.itemId,columns.amountPaid+200)
        case _ => Sales(columns.transactionId,columns.customerId,columns.itemId,columns.amountPaid)
      }
    }).show(false)

    spark.stop()
  }
  case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}

结果是：
root
 |-- transactionId: integer (nullable = true)
 |-- customerId: integer (nullable = true)
 |-- itemId: integer (nullable = true)
 |-- amountPaid: double (nullable = true)

+-------------+----------+------+----------+
|transactionId|customerId|itemId|amountPaid|
+-------------+----------+------+----------+
|          111|         1|     1|     100.0|
|          112|         2|     2|     500.0|
|          113|         3|     3|     400.0|
|          114|         1|     4|     300.0|
|          115|         1|     1|     200.0|
|          116|         1|     2|     700.0|
|          117|         4|     3|     800.0|
|          118|         5|     1|     200.0|
|          119|         3|     4|     200.0|
|          120|         1|     1|     300.0|
+-------------+----------+------+----------+

+-------------+----------+------+----------+
|transactionId|customerId|itemId|amountPaid|
+-------------+----------+------+----------+
|111          |1         |1     |300.0     |
|112          |2         |2     |500.0     |
|113          |3         |3     |400.0     |
|114          |1         |4     |300.0     |
|115          |1         |1     |200.0     |
|116          |1         |2     |700.0     |
|117          |4         |3     |800.0     |
|118          |5         |1     |200.0     |
|119          |3         |4     |200.0     |
|120          |1         |1     |300.0     |
+-------------+----------+------+----------+

Interoperating with RDDs
Interoperating with RDDs
和RDD的交互操作

DS  --》 DF   通过 DS.toDF("列名。。")
DF--》DS   通过 样例类    df.as[样例类]

RDD ---》 DF   两种 
  Spark SQL supports two different methods for converting existing RDDs into Datasets.
	1.反射   就是使用case  class  你的case class 定义的就是 table的信息
	2.The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD.
	编程的方式 ：
		就是你的哪个字段什么类型指定好就可以了

1.反射的方式   RDD -》 DF
object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()
    import spark.implicits._

    // RDD ==> DF/DS
        val peopleDF = spark.sparkContext
          .textFile("file:///C:/IdeaProjects/spark/data/data.txt")
          .map(_.split(","))
          .map(x => Person(x(0), x(1).trim.toInt))
          .toDF()

        peopleDF.show(false)

    spark.stop()
  }
  case class Person(name:String,age:Int)
}

结果是：
+------------+---+
|name        |age|
+------------+---+
|double_happy|25 |
|Kairis      |25 |
|Kite        |32 |
+------------+---+

2.编程的方式
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
  1.Create an RDD of Rows from the original RDD;
  2.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 
  3.Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()

    val peopleRDD = spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/data.txt")

    val schemaString = "name age"

    val fields: Array[StructField] = schemaString.split(" ").map(fieldName => {
      StructField(fieldName, StringType)
    })

    val schema = StructType(fields)

    val rowRDD: RDD[Row] = peopleRDD.map(_.split(",")).map(x=>Row(x(0),x(1).trim))

    val peopleDF: DataFrame = spark.createDataFrame(rowRDD,schema)

    //TODO... 业务逻辑
    peopleDF.show()

    spark.stop()
  }
  case class Person(name:String,age:Int)
}

结果：
+------------+---+
|        name|age|
+------------+---+
|double_happy| 25|
|      Kairis| 25|
|        Kite| 32|
+------------+---+

官网给的例子 不是很好  难道你们生产上全是 String 类型的么？ 我只能说还真是 我上一家公司就是 
建议哈 统计字段 还是采用标准的int 或者 double类型  
我之前统计的时候 全是String 的 就会出现 指标不准的问题  
我遇到过 同事说用String 很爽 我只能说 是的是的 
在他们眼里 spark不就是写sql吗？ 
emmm 统计指标可以的 但是 如果让你做 基础架构开发 呢？ 
不要仅仅局限于指标需求哈 那么你这大数据工程师 就是 sql怪

object RDDApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("CatalogApp")
      .getOrCreate()
 
    val peopleRDD = spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/data.txt")

    val schema = StructType(Array(
      StructField("name",StringType),
      StructField("age",IntegerType)
    ))

    val rowRDD = peopleRDD
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).trim.toInt))
      
    val peopleDF = spark.createDataFrame(rowRDD, schema)

    peopleDF.show()
    spark.stop()
  }
 case class Person(name:String,age:Int)
}

注意：
	Row里面的数据类型 一定要和 schema里的数据类型匹配上

结果：
+------------+---+
|        name|age|
+------------+---+
|double_happy| 25|
|      Kairis| 25|
|        Kite| 32|
+------------+---+

UDF

object UDFApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .appName("SparkSessionApp")
      .getOrCreate()

    import spark.implicits._
    /**
      * step1： 定义 注册
      * step2： 使用
      */
    spark.sparkContext.textFile("file:///C:/IdeaProjects/spark/data/udf.txt")
      .map(_.split(" "))
      .map(x => FootballTeam(x(0), x(1)))
      .toDF().createOrReplaceTempView("teams")

    spark.udf.register("teams_length",(input:String)=>{
      input.split("，").length
    })

    //统计一个人喜欢的球队的个数

    spark.sql("select name,teams,teams_length(teams) from teams").show()

    spark.stop()
  }

  case class FootballTeam(name:String, teams:String)

}
结果是：
+------+------------------+-----------------------+
|  name|             teams|UDF:teams_length(teams)|
+------+------------------+-----------------------+
|苍老师 |      喵喵喵，红魔  |                      2|
|    pk|小破车，国足，宅团   |                      3|
+------+------------------+-----------------------+



注意：
	  spark.udf.register("teams_length",(input:String)=>{
      input.split("，").length
    })
def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction


Function 就是传进去一个 函数 

还有一种UDF函数的使用就是 api的方式 

functions里面 有个 udf 方法 传进去一个函数    再 结合 withColumns 方法 使用 也是一样的