SS01 | DoubleHappy or Jepson

Fault Tolerance：
Stateful exactly-once semantics out of the box.
Spark Streaming recovers both lost work and 
operator state (e.g. sliding windows) out of the box, 
without any extra code on your part.

注意：容错机制
1.recovers  lost executor
2.operator state

Spark Integration：整合
By running on Spark, Spark Streaming lets you reuse the same code 
for batch processing, join streams against historical data,
 or run ad-hoc queries on stream state. 
 Build powerful interactive applications, not just analytics.

流处理
  实时：Storm Flink   event (就是来一条数据处理一个  这是真实时)
   近实时：Spark Streaming   mini-batch  
   		Spark Streaming把过来的数据切割成5s 一个批次  (是小微批次处理 不是真的实时 )
   		Spark Streaming对数据的处理是使用小批处理

批处理：一次性处理某个批次的数据     数据是有始有终(有开始有结束 有头有尾的)
	eg：处理某个文件夹下面数据 处理完就ok了  不可能跑到别的文件夹下面  (可以这么理解)

流处理 ： 流氓的流 流是一直不断的 
	eg：水龙头打开了 水一直流     不流水了说明 水龙头坏了 或者 没水了 
	
你们的生产上面的实时性是多高呢？
Spark Streaming 可以做到0.5s  
你要注意 0.5s 能进来多少数据

官网

1. Spark Streaming is an extension of the core Spark API 

2.Data can be ingested from many sources 
like Kafka, Flume, Kinesis, or TCP sockets,
 and can be processed using complex algorithms expressed 
 with high-level functions like map, reduce, join and window.

数据源：
 Kafka *****  流处理引擎+Kafka  CP
 Flume ==> 流处理引擎   可以的用的  但是 没有缓冲 
 HDFS
 TCP sockets ==> 测试  + 电信运营商(他们用 早期的时候 15年)

在这里插入图片描述

ss：
	Input： Kafka Socket
    Transform：业务逻辑处理
    Output：

Spark Streaming ：他有几件事情
    1）receives live input data streams   接受数据
    2）divides the data into batches         把接受到的数据 拆分成batches
比如说 ：
	1.Spark Streaming 5秒中处理一次数据   5s时间到了  
	2. 那么会把5s中接受的数据 把它切成 batch 
    3. 之后 把batch 交给 sparkEngine 处理  
    4. 处理完的结果也是 batch

在这里插入图片描述

Spark Streaming 的编程模型：
	DStream
        which represents a continuous stream of data

理解不了看源码：
/**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions.
 *
 * A DStream internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */

abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
  ) extends Serializable with Logging {

注意：跟RDD 差不多 
StreamingContext：就是流处理的上下文模型
DStream ： is a continuous sequence of RDDs
  就是一个流进来 按照时间批次(就是几秒一批次) 被拆成一个一个的RDD
  DStream 由一串RDD构成  我们处理的时候 是以 RDD为单位进行处理的 
  底层就是sparkcore 

DStream 这么来的呢？ 跟RDD一样 (看注释)
	1.live data
	2.别的DStream 转换来的 


This class contains the basic operations available on all DStreams：
看看有多少operations

在这里插入图片描述

所以RDD算子一点要熟练掌握

特性三个：
	1.A list of other DStreams that the DStream depends on    
		
	2.A time interval at which the DStream generates an RDD
		   时间间隔产生rdd     也就是  每隔多少时间处理一次
	3. A function that is used to generate an RDD after each time interval
	       	因为你一个DStream 由一堆RDD构成 是有顺序的
	       	最终 你对DStream 做操作 其实就是对RDD做操作 
	       	对RDD做操作 就是对 RDD里的每一个元素做操作

案列代码准备

StreamingContext  有好多附属构造器的 
class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {

  /**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

我们选择 this(conf: SparkConf, batchDuration: Duration)   

case class Duration (private val millis: Long)    单位是millis  

你传Duration 太死板了想穿个秒数 还得算  看看有没有封装好的 

/**
 * Helper object that creates instance of [[org.apache.spark.streaming.Duration]] representing
 * a given number of seconds.
 */
object Seconds {
  def apply(seconds: Long): Duration = new Duration(seconds * 1000)
}

封装一个工具类：
object ContextUtils {

  /**
    * 获取sc
    */
  def getSparkContext(appname:String,defalut:String = "local[2]"): SparkContext = {

    val sparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

    new SparkContext(sparkConf)
  }


  /**
    * 获取ssc
    */

  def getStreamingContext(appname:String,batch:Int,defalut:String = "local") ={

    val sparkConf: SparkConf = new SparkConf().setAppName(appname).setMaster(defalut)

    new StreamingContext(sparkConf,Seconds(batch))
  }
}

object AppName {

  def main(args: Array[String]): Unit = {

    println(this.getClass.getName)    //包名+类名
    println(this.getClass.getSimpleName)    //类名
  }
}

结果是：
com.ruozedata.spark.ss01.AppName$         
AppName$

案例
socket：
在这里插入图片描述

有三个 ：用哪个呢？有什么区别呢？看下面

数据源：socket 
 /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes it interpreted as object using the given
   * converter.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param converter     Function to convert the byte stream to objects
   * @param storageLevel  Storage level to use for storing the received objects
   * @tparam T            Type of the objects received (after converting bytes to objects)
   */
  def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

注意：
	socketTextStream 
		底层调用的是 socketStream
		socketStream底层调用的是 SocketInputDStream
    socketStream
	底层调用的是 SocketInputDStream

socketTextStream 和socketStream 就是入参不一样 用起来是一样的 
那么SocketInputDStream：

class SocketInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T]

都是ReceiverInputDStream 这个 ******

StorageLevel默认的是 MEMORY_AND_DISK_SER_2  
跟sparkcore里是不一样的  为什么是2呢 ？

A Quick Example

测试：

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,a,a
b,b,b,b

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑
    // Input DStream
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)

   //transformation
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    // output
    result.print()
    
    ssc.start()
    ssc.awaitTermination()
  }
}
结果是：
19/10/31 11:35:31 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
19/10/31 11:35:40 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:35:40 WARN BlockManager: Block input-0-1572492940600 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:35:44 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:35:44 WARN BlockManager: Block input-0-1572492944000 replicated to only 0 peer(s) instead of 1 peers


为什么没有数据？
先把 master local  改成 local[2] 再测试

[double_happy@hadoop101 ~]$ nc -lk 9999
a,a,a,a
b,b,b,b        这是我第一次测试
                  第二次测试   第一批次
a,a,a,a         
b,b,b,b

                   第二批次 
a,a,a,a
b,b,b,b
a,a,a,a
b,b,b,b

结果是 ：
-------------------------------------------
Time: 1572493040000 ms
-------------------------------------------

19/10/31 11:37:24 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:24 WARN BlockManager: Block input-0-1572493044200 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:25 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:25 WARN BlockManager: Block input-0-1572493045000 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572493050000 ms
-------------------------------------------
(b,4)
(a,4)

-------------------------------------------
Time: 1572493060000 ms
-------------------------------------------

19/10/31 11:37:42 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:42 WARN BlockManager: Block input-0-1572493062600 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:43 WARN BlockManager: Block input-0-1572493062800 replicated to only 0 peer(s) instead of 1 peers
19/10/31 11:37:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 11:37:43 WARN BlockManager: Block input-0-1572493063400 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572493070000 ms
-------------------------------------------
(b,8)
(a,8)


注意：
1.
 上面的代码是处理  当前批次的
 不是求累加批次的 累加是另外的算子

After a context is defined, you have to do the following.

    1.Define the input sources by creating input DStreams.
	2.Define the streaming computations by applying transformation and output operations to DStreams.
	3.Start receiving data and processing it using streamingContext.start().
	4.Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
	5.The processing can be manually stopped using streamingContext.stop().
Points to remember:
	1.Once a context has been started, no new streaming computations can be set up or added to it.
		就是说：
			 ssc.start()
			 在这加入逻辑处理是没有用的
            ssc.awaitTermination()
            
	2.Once a context has been stopped, it cannot be restarted.
	3.Only one StreamingContext can be active in a JVM at the same time.
	4.stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext
		 set the optional parameter of stop() called stopSparkContext to false.
	5.A SparkContext can be re-used to create multiple StreamingContexts,
	 as long as the previous StreamingContext is stopped (without stopping the SparkContext)
	  before the next StreamingContext is created.

在这里插入图片描述

上面案例讲解：

1.既然是通过上下文ssc 去拿数据去 接收数据
会有一个  接收器   在里面 

socket 起在 9999 端口  需要一个接收器 把数据接收回来 

下面

Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

1. lines was an input DStream
2. Receiver :receives the data from a source and stores it in Spark’s memory for processing.

所以 你假如不知道 哪个算子 里有接收器   看它返回值 返回值里是带Receiver 的 
只要有返回值里是带Receiver   必然是有接收器的
eg：
			 val lines: ReceiverInputDStream[String]
 
 不是所有的接收数据都需要接收器的 ***
 为什么呢？
 	eg：HDFS上的数据  直接通过API读进来就可以了 不需要接收器

所以：
上面的 master 设置 local  不是local[2]
为什么1 不行呢？1的话 你的 jobid 0  就占用一个线程  后面没有资源线程处理了呀

在这里插入图片描述

where > n  因为 有些业务是需要多个流处理的   
eg：你一套代码里面 有多个 socket  就有多个reciver了 明白吗？

所以你 core的数量 要大于 recivers的数量   否则 你的程序只能接收数据 不能处理数据

在这里插入图片描述

active job   ： receiver  是接收数据用的 一直在跑
这个是永远存在的 因为 对于 socket模式 
返回值是 ReceiverInputDStream 所有第一个Job是一直running在那的，职责就是接收数据

在这里插入图片描述

这幅图 调优的时候详细讲解

*操作讲解 *
Transformations on DStreams

只有最后两个和RDD算子不一样其他的都一样

[double_happy@hadoop101 ~]$ nc -lk 9999
a,c,b,b,b
a,a,a,a,a
b,b,b,b,b

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑


    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)


//    result.print()
    

    //1.统计一个批次出现了多少个单词
    lines.count().print() //一个批次有多少条数据
    lines.flatMap(_.split(",")).count().print()


    ssc.start()
    ssc.awaitTermination()
  }
}
结果是：
-------------------------------------------
Time: 1572501470000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1572501470000 ms
-------------------------------------------
0

19/10/31 13:57:55 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 13:57:55 WARN BlockManager: Block input-0-1572501475400 replicated to only 0 peer(s) instead of 1 peers
19/10/31 13:57:59 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 13:57:59 WARN BlockManager: Block input-0-1572501479200 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572501480000 ms
-------------------------------------------
3

-------------------------------------------
Time: 1572501480000 ms
-------------------------------------------
15

-------------------------------------------
Time: 1572501490000 ms
-------------------------------------------

[double_happy@hadoop101 ~]$ nc -lk 9999
b,b,b,b,b
a,a,a,a,a

object StreamingWCApp01 {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    //TODO... 填写我们的业务逻辑


    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101",9999)
    val result: DStream[(String, Int)] = lines.flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)


//    result.print()


    //1.统计一个批次出现了多少个单词
    lines.count().print() //一个批次有多少条数据
    lines.flatMap(_.split(",")).countByValue().print()


    ssc.start()
    ssc.awaitTermination()
  }
}

结果是：
-------------------------------------------
Time: 1572501630000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1572501630000 ms
-------------------------------------------

19/10/31 14:00:31 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 14:00:31 WARN BlockManager: Block input-0-1572501631200 replicated to only 0 peer(s) instead of 1 peers
19/10/31 14:00:38 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/10/31 14:00:38 WARN BlockManager: Block input-0-1572501638400 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1572501640000 ms
-------------------------------------------
2

-------------------------------------------
Time: 1572501640000 ms
-------------------------------------------
(b,5)
(a,5)

-------------------------------------------
Time: 1572501650000 ms
-------------------------------------------

Output Operations on DStreams

  /**
   * Save each RDD in this DStream as at text file, using string representation
   * of elements. The file name at each batch interval is generated based on
   * `prefix` and `suffix`: "prefix-TIME_IN_MS.suffix".
   */
  def saveAsTextFiles(prefix: String, suffix: String = ""): Unit = ssc.withScope {
    val saveFunc = (rdd: RDD[T], time: Time) => {
      val file = rddToFileName(prefix, suffix, time)
      rdd.saveAsTextFile(file)
    }
    this.foreachRDD(saveFunc, displayInnerRDDOps = false)
  }

这个方法生产上能用么？
假如你1s 处理一次  1s写一次 你hdfs很容易写爆掉的 
如果要用 把写出去的文件 使用追加的方式写   或者 定期合并生成的文件 


output opearation  和rdd 大部分都类似 
foreachRDD  这个算子 之后讲解

Input DStreams and Receivers

Spark Streaming provides two categories of built-in streaming sources.

	1.Basic sources:
		 Sources directly available in the StreamingContext API. 
		 Examples: file systems, and socket connections.
	2.Advanced sources: 
		Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. 
		These require linking against extra dependencies as discussed in the linking section.

流处理系统 一般对接的是 kafka  读文件用的少 

  /**
   * Create an input stream that monitors a Hadoop-compatible filesystem
   * for new files and reads them as text files (using key as LongWritable, value
   * as Text and input format as TextInputFormat). Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system. File names starting with . are ignored.
   * @param directory HDFS directory to monitor for new file
   */
  def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }
底层fileStream  跟一下源码有兴趣的 前面文章讲过

返回值是 DStream  所以 可以local1来处理 

注意：
 Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system.

All files must be in the same data format.  看官网

在这里插入图片描述