SS03 | DoubleHappy or Jepson

批流一体：未来的发展趋势
    Spark
    Flink
   他们可以做到 

MR/Spark/Flink on YARN：是现在是主流方式 

但是 k8s是未来的主流  等学到容器 用这个

Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

Some of these advanced sources are as follows.

Kafka: Spark Streaming 2.4.4 is compatible with Kafka broker versions 0.8.2.1 or higher. See the Kafka Integration Guide for more details.

Flume: Spark Streaming 2.4.4 is compatible with Flume 1.6.0. See the Flume Integration Guide for more details.

Kinesis: Spark Streaming 2.4.4 is compatible with Kinesis Client Library 1.2.1. See the Kinesis Integration Guide for more details.

Kafka整合

Spark Streaming + Kafka Integration Guide

The Kafka project introduced a new consumer API between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available. Please choose the correct package for your brokers and desired features; note that the 0.8 integration is compatible with later 0.9 and 0.10 brokers, but the 0.10 integration is not compatible with earlier brokers.

1.高阶Api
2.低阶Api  
	就是offset需要我们自己维护

在这里插入图片描述

一定要用这个：
		spark-streaming-kafka-0-10   和0.8的差别 
		主要在Receiver DStream

区别
spark-streaming-kafka-0-8

spark-streaming-kafka-0-8：
 第一种方式：Receiver-based Approach
 	1.uses a Receiver to receive the data
 		那么Receiver是跑在哪里的？executor里面的 
 	2.using the Kafka high-level consumer API.
 	3. received from Kafka through a Receiver is stored in Spark executors

在这里插入图片描述
华丽的分割线———————————————————————————————————————-

However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write-Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write-ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write-Ahead Logs.

WAL机制：先把日志记录下来 这里就是数据

注意：Kafka的整合只有一个工具类 叫KafkaUtils

  /**
   * Create an input stream that pulls messages from Kafka Brokers.
   * @param ssc       StreamingContext object
   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)
   * @param groupId   The group id for this consumer
   * @param topics    Map of (topic_name to numPartitions) to consume. Each partition is consumed
   *                  in its own thread
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @return DStream of (Kafka message key, Kafka message value)
   */
  def createStream(
      ssc: StreamingContext,
      zkQuorum: String,
      groupId: String,
      topics: Map[String, Int],
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[(String, String)] = {
    val kafkaParams = Map[String, String](
      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
      "zookeeper.connection.timeout.ms" -> "10000")
    createStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics, storageLevel)
  }

Receiver这个方式：
注意：
    storageLevel  ==MEMORY_AND_DISK_SER_2
    1）数据丢失
    2）WAL ==> 数据延迟
    3）offset我们不care


为什么MEMORY_AND_DISK_SER_2设计为 2 ？
	2的原因就是防止数据丢失 但是 问题是 即使是2 也不能保证数据 不丢 

如果数据是通过WAL机制 写到HDFS上去  那么这个storageLevel 还有必要是 2 么？
	一定是没有必要的    官网有写  你想想哈 如果是2  再加上hdfs本身的副本数  数据量是不是太大了 

虽然WAL解决数据丢失问题 但是带来了一个问题？
     就是数据写到HDFS上之后 更新zk 里的offset  
     那么整体的时效性一定是下降的    你的实时跟HDFS挂钩了  实时性降低 数据延迟

在这里插入图片描述
Points to remember:

Topic partitions in Kafka do not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the KafkaUtils.createStream() only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.

Topic是有partition的 
假设 一个Topic 对应3个partition
   Kafka 里的partition 和 SS产生的RDD里面的partition 不是一个概念 
   即：
   	1 Topic ==》 3 parititions    RDD的并行度 并不是3  
2.所以你增加Topic的分区 仅仅增加 使用的线程数 去处理topic的 还是a single receiver
 根本不会增加spark处理数据的并行度

Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

总结：
	也就是 spark-streaming-kafka-0-8 问题很多 别用了  要了解原理

spark-streaming-kafka-0-10 重点
spark-streaming-kafka-0-10

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change.

Offset管理的时候不同

Creating a Direct Stream

Direct
    不需要Receiver
    Topic的partition和RDD的partition是1:1
    自己手工维护offset   (那么默认offset存在哪？知道么）

案例

Producer:  console  （测试的时候 ） 就是使用KafkaApi代码实现

/**
 * A Kafka client that publishes records to the Kafka cluster.
 * <P>
 * The producer is <i>thread safe</i> and sharing a single producer instance across threads will generally be faster than
 * having multiple instances.
 * <p>
 * Here is a simple example of using the producer to send records with strings containing sequential numbers as the key/value
 * pairs.
 * <pre>
 * {@code
 * Properties props = new Properties();
 * props.put("bootstrap.servers", "localhost:9092");
 * props.put("acks", "all");
 * props.put("retries", 0);
 * props.put("batch.size", 16384);
 * props.put("linger.ms", 1);
 * props.put("buffer.memory", 33554432);
 * props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 * props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 *
 * Producer<String, String> producer = new KafkaProducer<>(props);
 * for (int i = 0; i < 100; i++)
 *     producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
 *
 * producer.close();
 * }</pre>
 * <p>
 * The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server
 * as well as a background I/O thread that is responsible for turning these records into requests and transmitting them
 * to the cluster. Failure to close the producer after use will leak these resources.
 * <p>
 * The {@link #send(ProducerRecord) send()} method is asynchronous. When called it adds the record to a buffer of pending record sends
 * and immediately returns. This allows the producer to batch together individual records for efficiency.
 * <p>
 * The <code>acks</code> config controls the criteria under which requests are considered complete. The "all" setting
 * we have specified will result in blocking on the full commit of the record, the slowest but most durable setting.
 * <p>
 * If the request fails, the producer can automatically retry, though since we have specified <code>retries</code>
 * as 0 it won't. Enabling retries also opens up the possibility of duplicates (see the documentation on
 * <a href="http://kafka.apache.org/documentation.html#semantics">message delivery semantics</a> for details).
 * <p>
 * The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by
 * the <code>batch.size</code> config. Making this larger can result in more batching, but requires more memory (since we will
 * generally have one of these buffers for each active partition).
 * <p>
 * By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you
 * want to reduce the number of requests you can set <code>linger.ms</code> to something greater than 0. This will
 * instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will
 * arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP. For example, in the code snippet above,
 * likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting
 * would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer. Note that
 * records that arrive close together in time will generally batch together even with <code>linger.ms=0</code> so under heavy load
 * batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more
 * efficient requests when not under maximal load at the cost of a small amount of latency.
 * <p>
 * The <code>buffer.memory</code> controls the total amount of memory available to the producer for buffering. If records
 * are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is
 * exhausted additional send calls will block. The threshold for time to block is determined by <code>max.block.ms</code> after which it throws
 * a TimeoutException.
 * <p>
 * The <code>key.serializer</code> and <code>value.serializer</code> instruct how to turn the key and value objects the user provides with
 * their <code>ProducerRecord</code> into bytes. You can use the included {@link org.apache.kafka.common.serialization.ByteArraySerializer} or
 * {@link org.apache.kafka.common.serialization.StringSerializer} for simple string or byte types.
 * <p>
 * From Kafka 0.11, the KafkaProducer supports two additional modes: the idempotent producer and the transactional producer.
 * The idempotent producer strengthens Kafka's delivery semantics from at least once to exactly once delivery. In particular
 * producer retries will no longer introduce duplicates. The transactional producer allows an application to send messages
 * to multiple partitions (and topics!) atomically.
 * </p>
 * <p>
 * To enable idempotence, the <code>enable.idempotence</code> configuration must be set to true. If set, the
 * <code>retries</code> config will default to <code>Integer.MAX_VALUE</code> and the <code>acks</code> config will
 * default to <code>all</code>. There are no API changes for the idempotent producer, so existing applications will
 * not need to be modified to take advantage of this feature.
 * </p>
 * <p>
 * To take advantage of the idempotent producer, it is imperative to avoid application level re-sends since these cannot
 * be de-duplicated. As such, if an application enables idempotence, it is recommended to leave the <code>retries</code>
 * config unset, as it will be defaulted to <code>Integer.MAX_VALUE</code>. Additionally, if a {@link #send(ProducerRecord)}
 * returns an error even with infinite retries (for instance if the message expires in the buffer before being sent),
 * then it is recommended to shut down the producer and check the contents of the last produced message to ensure that
 * it is not duplicated. Finally, the producer can only guarantee idempotence for messages sent within a single session.
 * </p>
 * <p>To use the transactional producer and the attendant APIs, you must set the <code>transactional.id</code>
 * configuration property. If the <code>transactional.id</code> is set, idempotence is automatically enabled along with
 * the producer configs which idempotence depends on. Further, topics which are included in transactions should be configured
 * for durability. In particular, the <code>replication.factor</code> should be at least <code>3</code>, and the
 * <code>min.insync.replicas</code> for these topics should be set to 2. Finally, in order for transactional guarantees
 * to be realized from end-to-end, the consumers must be configured to read only committed messages as well.
 * </p>
 * <p>
 * The purpose of the <code>transactional.id</code> is to enable transaction recovery across multiple sessions of a
 * single producer instance. It would typically be derived from the shard identifier in a partitioned, stateful, application.
 * As such, it should be unique to each producer instance running within a partitioned application.
 * </p>
 * <p>All the new transactional APIs are blocking and will throw exceptions on failure. The example
 * below illustrates how the new APIs are meant to be used. It is similar to the example above, except that all
 * 100 messages are part of a single transaction.
 * </p>
 * <p>
 * <pre>
 * {@code
 * Properties props = new Properties();
 * props.put("bootstrap.servers", "localhost:9092");
 * props.put("transactional.id", "my-transactional-id");
 * Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());
 *
 * producer.initTransactions();
 *
 * try {
 *     producer.beginTransaction();
 *     for (int i = 0; i < 100; i++)
 *         producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
 *     producer.commitTransaction();
 * } catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
 *     // We can't recover from these exceptions, so our only option is to close the producer and exit.
 *     producer.close();
 * } catch (KafkaException e) {
 *     // For all other exceptions, just abort the transaction and try again.
 *     producer.abortTransaction();
 * }
 * producer.close();
 * } </pre>
 * </p>
 * <p>
 * As is hinted at in the example, there can be only one open transaction per producer. All messages sent between the
 * {@link #beginTransaction()} and {@link #commitTransaction()} calls will be part of a single transaction. When the
 * <code>transactional.id</code> is specified, all messages sent by the producer must be part of a transaction.
 * </p>
 * <p>
 * The transactional producer uses exceptions to communicate error states. In particular, it is not required
 * to specify callbacks for <code>producer.send()</code> or to call <code>.get()</code> on the returned Future: a
 * <code>KafkaException</code> would be thrown if any of the
 * <code>producer.send()</code> or transactional calls hit an irrecoverable error during a transaction. See the {@link #send(ProducerRecord)}
 * documentation for more details about detecting errors from a transactional send.
 * </p>
 * </p>By calling
 * <code>producer.abortTransaction()</code> upon receiving a <code>KafkaException</code> we can ensure that any
 * successful writes are marked as aborted, hence keeping the transactional guarantees.
 * </p>
 * <p>
 * This client can communicate with brokers that are version 0.10.0 or newer. Older or newer brokers may not support
 * certain client features.  For instance, the transactional APIs need broker versions 0.11.0 or later. You will receive an
 * <code>UnsupportedVersionException</code> when invoking an API that is not available in the running broker version.
 * </p>
 */
public class KafkaProducer<K, V> implements Producer<K, V> {}

KafkaProducer源码里都介绍了怎么使用  之后讲解略过

Consumer:console  （测试的时候 ）

KafkaProducer：

先测试 ：发送abcdef 
object DataGenerator {
  private val logger: Logger = LoggerFactory.getLogger(DataGenerator.getClass)
  def main(args: Array[String]): Unit = {

    val props = new Properties()
    props.put("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    props.put("acks", "all")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

    val producer = new KafkaProducer[String, String](props)

    for (i <- 1 to 10) {
      Thread.sleep(100)
      //拿一个abcdef
      val word: String = String.valueOf((new Random().nextInt(6) + 'a').toChar)
      val part = i % 3   //发到哪个分区 因为是三个分区

      logger.error("word : {}",word)

    }
  }
}
结果： 注意这代码 全是源码注释里有 你不需要记住 
你只需要记住：
	1.KafkaProducer 创建   
	2.发送数据 要序列化
	3.怎么发   
	这三个东西 源码注释里全都是写好的 

19/11/01 13:23:18 ERROR DataGenerator$: word : f
19/11/01 13:23:18 ERROR DataGenerator$: word : d
19/11/01 13:23:18 ERROR DataGenerator$: word : f
19/11/01 13:23:18 ERROR DataGenerator$: word : d
19/11/01 13:23:18 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : e
19/11/01 13:23:19 ERROR DataGenerator$: word : c
19/11/01 13:23:19 ERROR DataGenerator$: word : a
19/11/01 13:23:19 ERROR DataGenerator$: word : a

怎么发送呢？
 /**
     * Creates a record to be sent to a specified topic and partition
     *
     * @param topic The topic the record will be appended to
     * @param partition The partition to which the record should be sent
     * @param key The key that will be included in the record
     * @param value The record contents
     */
    public ProducerRecord(String topic, Integer partition, K key, V value) {
        this(topic, partition, null, key, value, null);
    }


object DataGenerator {

  private val logger: Logger = LoggerFactory.getLogger(DataGenerator.getClass)


  def main(args: Array[String]): Unit = {

    val props = new Properties()
    props.put("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    props.put("acks", "all")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    val producer = new KafkaProducer[String, String](props)

    for (i <- 1 to 10) {
      Thread.sleep(100)

      //拿一个abcdef
      val word: String = String.valueOf((new Random().nextInt(6) + 'a').toChar)
      val part = i % 3 //发到哪个分区 因为是三个分区

      logger.error("word : {}", word)

      val record = producer.send(new ProducerRecord[String, String]("double_happy_offset", part, "",word))

    }

    producer.close()
    println("double_happy 数据产生完毕..........")
  }
}

ok进行测试

[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic double_happy_offset \
> --from-beginning

运行idea代码结果：
19/11/01 13:33:56 ERROR DataGenerator$: word : a
19/11/01 13:33:57 ERROR DataGenerator$: word : e
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : a
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : c
19/11/01 13:33:57 ERROR DataGenerator$: word : c
19/11/01 13:33:57 ERROR DataGenerator$: word : f
19/11/01 13:33:57 ERROR DataGenerator$: word : e
19/11/01 13:33:58 ERROR DataGenerator$: word : a
double_happy 数据产生完毕..........

kafka控制台结果：
[double_happy@hadoop101 kafka]$ bin/kafka-console-consumer.sh \
> --bootstrap-server hadoop101:9092,hadoop101:9093,hadoop101:9094 \
> --topic double_happy_offset \
> --from-beginning
a
e
f
a
f
c
c
f
e
a

对接kafka + ss前期准备工作ok

对接

/**
   * :: Experimental ::
   * Scala constructor for a DStream where
   * each given Kafka topic/partition corresponds to an RDD partition.
   * The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number
   *  of messages
   * per second that each '''partition''' will accept.
   * @param locationStrategy In most cases, pass in [[LocationStrategies.PreferConsistent]],
   *   see [[LocationStrategies]] for more details.
   * @param consumerStrategy In most cases, pass in [[ConsumerStrategies.Subscribe]],
   *   see [[ConsumerStrategies]] for more details
   * @tparam K type of Kafka message key
   * @tparam V type of Kafka message value
   */
  @Experimental
  def createDirectStream[K, V](
      ssc: StreamingContext,
      locationStrategy: LocationStrategy,
      consumerStrategy: ConsumerStrategy[K, V]
    ): InputDStream[ConsumerRecord[K, V]] = {
    val ppc = new DefaultPerPartitionConfig(ssc.sparkContext.getConf)
    createDirectStream[K, V](ssc, locationStrategy, consumerStrategy, ppc)
  }

locationStrategy 策略什么意思呢？官网有  一会代码里我写了解释

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )


    stream.map(record => (record.key, record.value))  
      .map(_._2).print()   //因为我们key就没有设置  只取value

    ssc.start()
    ssc.awaitTermination()
  }
}


结果：
-------------------------------------------
Time: 1572587100000 ms
-------------------------------------------
a
a
c
a
f
c
e
e
f
f

-------------------------------------------
Time: 1572587110000 ms
-------------------------------------------

-------------------------------------------
Time: 1572587120000 ms
-------------------------------------------

-------------------------------------------
Time: 1572587130000 ms
-------------------------------------------     这块 我们kafka又发了一次数据 ssc接收到了
a
b

-------------------------------------------
Time: 1572587140000 ms
-------------------------------------------
c
a
e
c
d
d
a
b

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

   //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    //结果入库 写到redis里
    result.foreachRDD(rdd =>{
      rdd.foreachPartition(paritition =>{
        val jedis: Jedis = RedisUtils.getJedis

        paritition.foreach(pair =>{
          jedis.hincrBy("kafka_ss_redis_wc",pair._1,pair._2)
        })

        jedis.close()
      })
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：

hadoop101:6379> keys *
1) "name"
2) "kafka_ss_redis_wc"
3) "doublehappy_redis_wc"
hadoop101:6379> HGETALL kafka_ss_redis_wc
 1) "e"
 2) "3"
 3) "d"
 4) "2"
 5) "a"
 6) "6"
 7) "b"
 8) "2"
 9) "c"
10) "4"
11) "f"
12) "3"
hadoop101:6379>

在这里插入图片描述

这个时候 把实时代码关掉 ：

redis里的数据翻倍了 因为又写了一次嘛  
但是这样是不行的  
	因为代码里 这个控制的 
		"auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
  
那么接下来 看看offset 怎么获取呢？怎么提交offset呢？

在这里插入图片描述

Obtaining Offsets

trait HasOffsetRanges {
  def offsetRanges: Array[OffsetRange]    //拿到offset的范围  返回值是数组 
}

/**
 * Represents a range of offsets from a single Kafka TopicPartition. Instances of this class
 * can be created with `OffsetRange.create()`.
 * @param topic Kafka topic name
 * @param partition Kafka partition id
 * @param fromOffset Inclusive starting offset
 * @param untilOffset Exclusive ending offset
 */
final class OffsetRange private(
    val topic: String,    
    val partition: Int,
    val fromOffset: Long,
    val untilOffset: Long) extends Serializable {

获取offset：
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

   //TODO...业务逻辑
    val result: DStream[(String, Int)] = stream.map(_.value()).
      flatMap(_.split(","))
      .map((_, 1)).reduceByKey(_ + _)

    //结果
    result.foreachRDD(rdd =>{   //这块的rdd一定要注意的  

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/11/01 14:18:18 WARN KafkaUtils: overriding enable.auto.commit to false for executor
19/11/01 14:18:18 WARN KafkaUtils: overriding auto.offset.reset to none for executor
19/11/01 14:18:18 WARN KafkaUtils: overriding executor group.id to spark-executor-use_a_separate_group_id_for_each_stream
19/11/01 14:18:18 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
19/11/01 14:18:20 ERROR JobScheduler: Error running job streaming job 1572589100000 ms.0
---------2
java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:48)
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:42)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:48)
	at com.ruozedata.spark.ss03.StreamingKakfaDirectApp$$anonfun$main$1.apply(StreamingKakfaDirectApp.scala:42)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

在这里插入图片描述

为什么报错呢？  *****  而且分区数还是2  说明不对
java.lang.ClassCastException: org.apache.spark.rdd.ShuffledRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges

ShuffledRDD 不能转成 HasOffsetRanges 上面图片解释了 

为什么是ShuffledRDD 呢？ 
	因为reduceBykey之后的流里面 的rdd   就是ShuffledRDD 类型的   主要是经过了reduceBykey 明白吗

所以上面的代码 不对 
业务逻辑的代码是要放在里面写  而不是在前面做   ***
所以直接用stream + foreachRDD   (业务逻辑在foreachRDD 里面写)

在这里插入图片描述

获取offset：
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

结果：他是一直在跑的哈
---------3
double_happy_offset 1 0 8
double_happy_offset 2 0 6
double_happy_offset 0 0 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6

看sparkUI   解释 结果

在这里插入图片描述

${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}
---------3
double_happy_offset 1 0 8
double_happy_offset 2 0 6
double_happy_offset 0 0 6
---------3
double_happy_offset 1 8 8         从8 开始   因为没有数据进来了 
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6


20哪里来的？  8 +6+6 = 20

由于数据过来 第一个批次全部处理完了  
所以第二个批次 结果 是从8开始  

那么我再生产10条数据 看结果

---------3
double_happy_offset 1 8 8
double_happy_offset 2 6 6
double_happy_offset 0 6 6
---------3
double_happy_offset 2 6 9          也就是说 这个拿了3条数据
double_happy_offset 1 8 12        拿了4条数据
double_happy_offset 0 6 9          拿了3条数据
---------3

那么 我把程序关掉 结果一定是这样的：
---------3
double_happy_offset 1 0 9
double_happy_offset 2 0 12
double_happy_offset 0 0 9

你获得到了偏移量  由于你没有提交 没有保存偏移量 
所以重启之后都是从头开始跑的  就意味着 第一批数据 会很多 

接下来就是提交offset

Storing Offsets

很多方式 ：
1.Checkpoints   别用了 不好用 
2.Kafka itself    
	Kafka has an offset commit API that stores offsets in a special Kafka topic.
3.Your own data store

2.Kafka itself    ：

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array("double_happy_offset")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      //获取分区数
      println("---------"+rdd.partitions.size)   //这个值应该是3

      //获取当前批次的offset数据
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offsetRanges.foreach(x=>{
        println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
      })


      //kafka自身的方式  提交 更新的offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9
---------3

关闭重启 之后结果：
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9
---------3
double_happy_offset 2 9 9
double_happy_offset 1 12 12
double_happy_offset 0 9 9

说明offset 提交ok了 

而且两次操作sparkUi也证实了

第一次
第二次重启之后
在这里插入图片描述

数据是0 对吧  因为数据已经提交过了 
通过 kafka 命令可以查到offset 

[double_happy@hadoop101 bin]$ ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list hadoop101:9092,hadoop101:9093,hadoop101:9094 --topic double_happy_offset
double_happy_offset:0:9
double_happy_offset:1:12
double_happy_offset:2:9
[double_happy@hadoop101 bin]$ 

那么kafka自身维护的offset存在哪里呢？

在这里插入图片描述

However, you can commit offsets to Kafka after you know your output has been stored, using the commitAsync API. The benefit as compared to checkpoints is that Kafka is a durable store regardless of changes to your application code. However, Kafka is not transactional, so your outputs must still be idempotent.

1.你的业务逻辑完成之后再提交offset
2.kafka并不是事务性的 所以你的输出 必须保证幂等性

假设 
double_happy_offset:0:9
double_happy_offset:1:12
double_happy_offset:2:9

你第一次处理完了 结果也写到redis里了  
因为种种原因我们可以手工指定 kafka的偏移量的  

假设 第一个批次 5 接下来是 5-9   这里 5-9 已经消费了对吧 

那么 我们手工指定 5-9 这个批次再消费一次 也就是说 
redis 里面的结果又是错的  因为结果又重复了呀  这就是上面刚开始的 演示  

这就是说你输出的代码 也要有幂等性  不管你输出跑多少次  即使重复消费 也要保证结果是具有幂等性的

3.Your own data store

这里我使用redis  你选择MySQL也是可以的    为了测试换一个groupid  让他重新消费
注意：这种方式 提交offset
手动提交offset的时候  要与groupid 对应 

key： Topic + groupid 

object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams)  //固定写法
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      if(!rdd.isEmpty()){

        //获取分区数
        println("---------"+rdd.partitions.size)   //这个值应该是3

        //获取当前批次的offset数据
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x=>{
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })


        //TODO ... 处理业务逻辑 wc

        //ToDO ... 提交Offset到Redis  使用第三种方式
        val jedis: Jedis = RedisUtils.getJedis

        offsetRanges.foreach(x=>{
          val topicGroupId = x.topic + "_"+ groupId    //key = topic + groupId    
          jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")
        })
        jedis.close()

      }else{
        println("当前批次没有数据.....")
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....

hadoop101:6379> keys *
1) "name"
2) "kafka_ss_redis_wc"
3) "doublehappy_redis_wc"
4) "double_happy_offset_double_happy_group"
hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "9"
3) "1"
4) "12"
5) "0"
6) "9"
hadoop101:6379> 
结果是没有问题 的 
key = topic + groupId      
 jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")

在这里插入图片描述

再生产一批数据 查看结果：
---------3
double_happy_offset 2 0 9
double_happy_offset 1 0 12
double_happy_offset 0 0 9
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
---------3
double_happy_offset 1 12 16
double_happy_offset 2 9 12
double_happy_offset 0 9 12
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....

hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "9"
3) "1"
4) "12"
5) "0"
6) "9"
hadoop101:6379> HGETALL double_happy_offset_double_happy_group
1) "2"
2) "12"
3) "1"
4) "16"
5) "0"
6) "12"
hadoop101:6379>

在这里插入图片描述

我把程序关掉 先生成两个批次的数据  再把程序打开  查看结果

---------3
double_happy_offset 0 0 18
double_happy_offset 2 0 18
double_happy_offset 1 0 24
当前批次没有数据.....
当前批次没有数据.....

为什么从0开始消费？？
因为 
"auto.offset.reset" -> "earliest"   控制的 
应该是 你offset保存在哪里  下次启动的时候去哪里取

测试：取出 offset 从redis

object RedisOffsetApp {

  def main(args: Array[String]): Unit = {

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val topics = Array(topic)

    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息

    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0)+"_"+groupId)

    import  scala.collection.JavaConversions._   //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala

    offsets.map(x=> {

      //  offsets map后要获取一种什么样的数据结构呢？ 

    })
  }

}

注意：
//  offsets map后要获取一种什么样的数据结构呢？ 之前KafkaUtils.createDirectStream 里面的消费策略
Subscribe 传入 topics 和 kafkaParams  查看源码发现 

  def Subscribe[K, V](
      topics: Iterable[jl.String],
      kafkaParams: collection.Map[String, Object],
      offsets: collection.Map[TopicPartition, Long]): ConsumerStrategy[K, V] = {
    new Subscribe[K, V](
      new ju.ArrayList(topics.asJavaCollection),
      new ju.HashMap[String, Object](kafkaParams.asJava),
      new ju.HashMap[TopicPartition, jl.Long](offsets.mapValues(l => new jl.Long(l)).asJava))
  }

 offsets: collection.Map[TopicPartition, Long])    即  TopicPartition 和 偏移量  组成的这样的格式  Map[TopicPartition, Long]

    public TopicPartition(String topic, int partition) {
        this.partition = partition;
        this.topic = topic;
    }

所以取出 offset 从redis  取出的数据结构设计要符合 
createDirectStream  里的 Subscribe 参数的数据结构

(因为 取出 offset 从redis  是在  createDirectStream之前执行的 )

object RedisOffsetApp {

  def main(args: Array[String]): Unit = {

    val groupId = "double_happy_group"
    val topic = "double_happy_offset"
    val topics = Array(topic)


    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息

    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0) + "_" + groupId)

    var fromOffsets: Map[TopicPartition, Long] = Map[TopicPartition, Long]()

    import scala.collection.JavaConversions._ //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala
    offsets.map(x => {

      //  offsets map后要获取一种什么样的数据结构呢？ offsets  Map[TopicPartition, Long]()
      fromOffsets += new TopicPartition(topics(0), x._1.toInt) -> x._2.toLong
    })

    fromOffsets.foreach(println)
  }

}

结果：
(double_happy_offset-0,18)
(double_happy_offset-1,24)
(double_happy_offset-2,18)

取出 offset 从redis

所以我们再测试实时的代码  ：
	我们先不提交 看看控制台 
object StreamingKakfaDirectApp {

  def main(args: Array[String]): Unit = {

    val ssc = ContextUtils.getStreamingContext(this.getClass.getSimpleName, 10)

    val groupId = "double_happy_group"

    val topic = "double_happy_offset"

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "hadoop101:9092,hadoop101:9093,hadoop101:9094",   //Kafka地址
      "key.deserializer" -> classOf[StringDeserializer],      //反序列化  接收端是反序列化   数据发送是要序列化
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest",    //偏移量 从哪开始
      "enable.auto.commit" -> (false: java.lang.Boolean)  //自动提交么？ 选择不自动提交  手工来管理
    )

    val topics = Array(topic)
    var fromOffsets: Map[TopicPartition, Long] = Map[TopicPartition, Long]()

    //TODO...从保存offset的地方 eg：redis 去获取已经提交的offset的记录信息
    val jedis: Jedis = RedisUtils.getJedis
    val offsets: util.Map[String, String] = jedis.hgetAll(topics(0)+"_"+groupId)

     //offsets 想使用  scala 里集合的map方法 要进行隐式转换 java-> scala
    import scala.collection.JavaConversions._
    offsets.map(x => {
      //offsets map后要获取一种什么样的数据结构呢？ offsets  Map[TopicPartition, Long]()
      fromOffsets += new TopicPartition(topics(0), x._1.toInt) -> x._2.toLong
    })


    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,  //数据尽量均匀分布到各个executor上去
      Subscribe[String, String](topics, kafkaParams,fromOffsets)  //从已有的offset里读取数据 开始消费
    )

    //结果
    stream.foreachRDD(rdd =>{   //这块的rdd一定要注意的  因为

      if(!rdd.isEmpty()){

        //获取分区数
        println("---------"+rdd.partitions.size)   //这个值应该是3

        //获取当前批次的offset数据
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x=>{
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })

        //TODO ... 处理业务逻辑 wc

/*        //ToDO ... 提交Offset到Redis  使用第三种方式
        val jedis: Jedis = RedisUtils.getJedis

        offsetRanges.foreach(x=>{
          val topicGroupId = x.topic + "_"+ groupId
          jedis.hset(topicGroupId,x.partition+"",x.untilOffset+"")
        })
        jedis.close()*/

      }else{

        println("当前批次没有数据.....")
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

结果：
19/11/01 17:14:51 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
当前批次没有数据.....
当前批次没有数据.....

注意：
说明ok  没有重新消费 

那么我们再写一批数据 查看结果：

当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
当前批次没有数据.....
---------3
double_happy_offset 0 18 21
double_happy_offset 1 24 28
double_happy_offset 2 18 21
当前批次没有数据.....

说明程序ok

那么我们把实时程序停掉 再产生两次数据 再重启实时程序  对比最初的那次测试：

还记得么？最初那次 是从0 开始消费的  那么这次测试结果呢？

结果：
---------3
double_happy_offset 1 24 36
double_happy_offset 0 18 27
double_happy_offset 2 18 27
当前批次没有数据.....

终于ok了 (这是打印在控制台)

那么写入redis offset 测试也是ok的

总结：
虽然上面的东西一大坨 实际上思路很清晰 很简单 主要是几行破代码 

而且大部分演示的都是不能用的 但是目的是让你知道 这些坑 而不是直接拿代码直接用 (了解原理之后可以 要不然之后出错了你都不知道怎么维护)

总结

1.  "auto.offset.reset" -> "earliest"  
  最终成品 这个参数设置选别的还是 这个 都无所谓的 
  因为你的offsets 是走的 fromOffsets 你自己定义的那个 
  	(就是把你们保存的offset 拿出来丢到 fromOffsets  这里(格式是重点)   创建流的时候 把fromOffsets 丢到消费策略那个参数里)

2.业务处理前的第一件事是把偏移量拿到 (就是foreachRDD里面第一件事)