Flink02--double_happy

概述

Function
    map ==> MapFunction
    filter ==> FilterFunction
    xxx ==>  XxxFunction
    RichXxxFunction  *****

SourceFunction     non-parallel 1
ParallelSourceFunction
RichParallelSourceFunction  *****

//测试使用

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //测试使用
    env.fromCollection(List(
      Access(201912120010L, "ruozedata.com", 2000),
      Access(201912120011L, "ruozedata.com", 3000)
    )
    ).print()

    env.execute(this.getClass.getSimpleName)
  }
}

结果：
6> Access(201912120011,ruozedata.com,3000)
5> Access(201912120010,ruozedata.com,2000)

/**
   * Creates a DataStream that contains the given elements. The elements must all be of the
   * same type.
   *
   * Note that this operation will result in a non-parallel data source, i.e. a data source with
   * a parallelism of one.
   */
  def fromElements[T: TypeInformation](data: T*): DataStream[T] = {
    fromCollection(data)
  }

注意：
	前面的泛型 如果加上 会类型限定的

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //测试使用
    env.fromElements(1, 2L, 3D, 4F, "5").print()

    env.execute(this.getClass.getSimpleName)
  }
}

结果：
1> 4.0
8> 3.0
2> 5
7> 2
6> 1

注意：
	如果加上泛型  会类型限定的

在这里插入图片描述
Custom:

env.addSource()

1.SourceFunction 
    cancel()
    run()
注意：
	1.自定义数据源
	      继承SourceFunction    要传一个泛型
	2.实现 下面这两个方法
		cancel()
			用于关闭某个东西
        run()  里面有ctx 上下文
          用于产生数据的 
	         因为这是SourceFunction 是数据源
	         可以利用ctx 产生数据的

注意：
	SourceFunction 是没有并行度可言的

这个很重要用于测试产生数据

package com.sx.flink02

import com.sx.bean.Domain.Access
import org.apache.flink.streaming.api.functions.source.SourceFunction

import scala.util.Random

class AccessSource extends SourceFunction[Access]{

  var running = true

  override def cancel(): Unit = {
    running = false
  }

  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
    val random = new Random()
    val domains = Array("ruozedata.com","zhibo8.cc","dongqiudi.com")

    while(running) {
      val timestamp = System.currentTimeMillis()
      1.to(10).map(x => {
        ctx.collect(Access(timestamp, domains(random.nextInt(domains.length)), random.nextInt(1000)+x))
      })

      // 休息下 每个5s
      Thread.sleep(5000)
    }
  }
}

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //测试使用
    env.addSource(new AccessSource).print()
    
    env.execute(this.getClass.getSimpleName)
  }
}

结果是：
7> Access(1577502637884,zhibo8.cc,991)
2> Access(1577502637884,ruozedata.com,252)
5> Access(1577502637884,dongqiudi.com,428)
8> Access(1577502637884,dongqiudi.com,565)
4> Access(1577502637884,ruozedata.com,140)
6> Access(1577502637884,ruozedata.com,718)
1> Access(1577502637884,dongqiudi.com,66)
3> Access(1577502637884,dongqiudi.com,475)
4> Access(1577502637884,zhibo8.cc,694)
3> Access(1577502637884,zhibo8.cc,972)

这就是每个5s发送10条数据

测试:
SourceFunction 是没有并行度可言的 
object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //测试使用

    env.addSource(new AccessSource).setParallelism(3).print()

    env.execute(this.getClass.getSimpleName)
  }
}
结果：
Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55)
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:31)
	at org.apache.flink.streaming.api.scala.DataStream.setParallelism(DataStream.scala:130)
	at com.sx.flink02.SourceApp$.main(SourceApp.scala:15)
	at com.sx.flink02.SourceApp.main(SourceApp.scala)

注意：
Source: 1 is not a parallel source

查看源码：
debug查看

在这里插入图片描述

所以在大数据场景下 没有并行度 这个就不能用

2.ParallelSourceFunction  带并行度的 

class AccessSource02 extends ParallelSourceFunction[Access]{

  var running = true

  override def cancel(): Unit = {
    running = false
  }

  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
    val random = new Random()
    val domains = Array("ruozedata.com","zhibo8.cc","dongqiudi.com")

    while(running) {
      val timestamp = System.currentTimeMillis()
      1.to(10).map(x => {
        ctx.collect(Access(timestamp, domains(random.nextInt(domains.length)), random.nextInt(1000)+x))
      })

      // 休息下
      Thread.sleep(5000)
    }
  }
}

注意：
代码是一样的 就换个继承

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //测试使用
    env.addSource(new AccessSource02).setParallelism(3).print()

    env.execute(this.getClass.getSimpleName)
  }
}

结果是：
6> Access(1577503201868,dongqiudi.com,704)
8> Access(1577503201868,zhibo8.cc,562)
2> Access(1577503201868,dongqiudi.com,526)
7> Access(1577503201868,ruozedata.com,952)
3> Access(1577503201868,ruozedata.com,525)
2> Access(1577503201868,dongqiudi.com,728)
7> Access(1577503201868,dongqiudi.com,185)
5> Access(1577503201868,ruozedata.com,737)
2> Access(1577503201868,zhibo8.cc,47)
1> Access(1577503201868,dongqiudi.com,77)
4> Access(1577503201868,ruozedata.com,275)
5> Access(1577503201868,ruozedata.com,47)
7> Access(1577503201868,dongqiudi.com,289)
3> Access(1577503201868,dongqiudi.com,934)
8> Access(1577503201868,ruozedata.com,69)
6> Access(1577503201868,dongqiudi.com,123)
4> Access(1577503201868,zhibo8.cc,719)
1> Access(1577503201868,zhibo8.cc,254)
6> Access(1577503201868,zhibo8.cc,1002)
3> Access(1577503201868,zhibo8.cc,358)
7> Access(1577503201868,dongqiudi.com,700)
5> Access(1577503201868,dongqiudi.com,692)
6> Access(1577503201868,ruozedata.com,408)
5> Access(1577503201868,ruozedata.com,986)
1> Access(1577503201868,ruozedata.com,284)
4> Access(1577503201868,zhibo8.cc,900)
5> Access(1577503201868,ruozedata.com,343)
8> Access(1577503201868,ruozedata.com,84)
4> Access(1577503201868,zhibo8.cc,967)
8> Access(1577503201868,ruozedata.com,14)

注意：
这就是每个5s发送10条数据 但是并行度是3的情况下 
就是 每个5s 发送 10*3 = 30 条数据

3.RichParallelSourceFunction ****
注意：
	和上面的核心代码一样
	但是 这个Rich里面

/**
 * Base class for implementing a parallel data source. Upon execution, the runtime will
 * execute as many parallel instances of this function as configured parallelism
 * of the source.
 *
 * <p>The data source has access to context information (such as the number of parallel
 * instances of the source, and which parallel instance the current instance is)
 * via {@link #getRuntimeContext()}. It also provides additional life-cycle methods
 * ({@link #open(org.apache.flink.configuration.Configuration)} and {@link #close()}.</p>
 *
 * @param <OUT> The type of the records produced by this source.
 */
@Public
public abstract class RichParallelSourceFunction<OUT> extends AbstractRichFunction
		implements ParallelSourceFunction<OUT> {

	private static final long serialVersionUID = 1L;
}


1. extends AbstractRichFunction 
而AbstractRichFunction 
是有生命周期的方法
    open()
    close()

所以：
	如果做一些
	1.文件系统 
	2.初始化
	3.io
	4.mysql 等等
就得在open里和close里面获取连接 和关闭连接

因为：
	open()  close() 一个task会执行一次 （前面文章有例子）task就是并行度
	所以要使用这个Function 
	生命周期是可以控制的

class AccessSource03 extends RichParallelSourceFunction[Access]{

  var running = true

  override def cancel(): Unit = {
    running = false
  }

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
  }

  override def close(): Unit = {
    super.close()
  }
  
  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
    val random = new Random()
    val domains = Array("ruozedata.com","zhibo8.cc","dongqiudi.com")

    while(running) {
      val timestamp = System.currentTimeMillis()
      1.to(10).map(x => {
        ctx.collect(Access(timestamp, domains(random.nextInt(domains.length)), random.nextInt(1000)+x))
      })

      // 休息下
      Thread.sleep(5000)
    }
  }
}

上面就是 Source的最基本的使用

那么现在有一个需求：
  连接MySQL 
  查看官网 没有MySQL的Connector

那么只能去github上找 或者 自己写一个Source即可

需求就是：
  把MySQL数据读取进来 

操作MySQL pom里要加入MySQL依赖的

MySQL里面的数据：
mysql> select * from student;
+----+--------+----------+-----+
| id | name   | password | age |
+----+--------+----------+-----+
|  1 | kairis | wsx111   |  17 |
|  2 | dbh    | 11       |  90 |
|  3 | double | 44       |  12 |
|  4 | happy  | 11       |  48 |
+----+--------+----------+-----+
4 rows in set (0.00 sec)

mysql>

class MySQLSource extends RichSourceFunction[Student]{
  var connection:Connection = _
  var pstmt:PreparedStatement = _
  // 在open方法中建立连接
  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    connection = MySQLUtils.getConnection()
    pstmt = connection.prepareStatement("select * from student")
  }

  // 释放
  override def close(): Unit = {
    super.close()
    MySQLUtils.closeConnection(connection, pstmt)
  }
  override def cancel(): Unit = {

  }
  override def run(ctx: SourceFunction.SourceContext[Student]): Unit = {
    val rs = pstmt.executeQuery()
    while(rs.next()){
      val student = Student(rs.getInt("id"), rs.getString("name"), rs.getString("password"),rs.getInt("age"))
      ctx.collect(student)
    }
  }
}

注意：
	连接MySQL 生产上是有并行度的 
	你不能在run里面建立连接   为什么呢？什么会触发run呢？主要的逻辑会触发run 
	这就类似Spark里面的 foreach 和foreachPartiion
得借助于 open方法：
	 去拿到连接   这个是每个task会执行一次

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.addSource(new MySQLSource).print()
    env.execute(this.getClass.getSimpleName)
  }
}

结果：
~~~~run~~~~~~
5> Student(4,happy,11,48)
4> Student(3,double,44,12)
3> Student(2,dbh,11,90)
2> Student(1,kairis,wsx111,17)

env.addSource(new MySQLSource).setParallelism(3).print()
会报错的：
Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source
为什么呢？
但是这种方式 太low了 


使用scalikjdbc： 这种方式更优雅些
在配置文件里可以配置连接池

class ScalikeJDBCMySQLSource extends RichSourceFunction[Student]{

  override def cancel(): Unit = {

  }

  override def run(ctx: SourceFunction.SourceContext[Student]): Unit = {
    println("~~~run~~~~")
    DBs.setupAll()  // parse configuration file

    DB.readOnly{ implicit session => {
      SQL("select * from student").map(rs => {
        val student = Student(rs.int("id"),rs.string("name"),rs.string("password"),rs.int("age"))
        ctx.collect(student)
      }).list().apply()
    }

    }
  }
}

结果：
~~~run~~~~
7> Student(4,happy,11,48)
5> Student(2,dbh,11,90)
4> Student(1,kairis,wsx111,17)
6> Student(3,double,44,12)

读取kafka的数据

注意：
不同的kafka版本 Flink 使用的api不同

在这里插入图片描述

但是在 0.11之后就统一了

在这里插入图片描述

我的kafka是 2.2.1系列的

所以我使用的 pom是：
我的flink 是 1.9.0的
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.11</artifactId>
  <version>1.9.0</version>
</dependency>

Flink’s Kafka consumer is
 called FlinkKafkaConsumer08 (or 09 for
  Kafka 0.9.0.x versions, etc. 
 or 
 just FlinkKafkaConsumer for Kafka >= 1.0.0 versions).
  It provides access to one or more Kafka topics.



Source：
	Flink作为Kafka的消费者

/**
	 * Creates a new Kafka streaming source consumer.
	 *
	 * @param topic             The name of the topic that should be consumed.
	 * @param valueDeserializer The de-/serializer used to convert between Kafka's byte messages and Flink's objects.
	 * @param props
	 */
	public FlinkKafkaConsumer(String topic, DeserializationSchema<T> valueDeserializer, Properties props) {
		this(Collections.singletonList(topic), valueDeserializer, props);
	}

1.topic
2.DeserializationSchema   接收数据 所以是反序列化器 消费数据 所以是反序列化的
3.Properties

[double_happy@hadoop101 kafka]$ jps
Kafka
Kafka
Jps
QuorumPeerMain
Kafka

先开启：
object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    properties.setProperty("group.id", "sxwang")

    val consumer = new FlinkKafkaConsumer[String]("double_happy_offset", new SimpleStringSchema(), properties)

    env.addSource(consumer).print()

    env.execute(this.getClass.getSimpleName)
  }
}
结果：
1> b
2> d
3> d
1> c
2> f
3> b

往kafka发送数据：
object DataGenerator {

  private val logger: Logger = LoggerFactory.getLogger(DataGenerator.getClass)


  def main(args: Array[String]): Unit = {

    val props = new Properties()
    props.put("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    props.put("acks", "all")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    val producer = new KafkaProducer[String, String](props)

    for (i <- 0 to 5) {
      Thread.sleep(100)

      //拿一个abcdef
      val word: String = String.valueOf((new Random().nextInt(6) + 'a').toChar)
      val part = i % 3 //发到哪个分区 因为是三个分区

      logger.error("word : {}", word)

      val record = producer.send(new ProducerRecord[String, String]("double_happy_offset", part, "",word))

    }

    producer.close()
    println("double_happy 数据产生完毕..........")
  }
}

Flink里面的kafka的offset 非常 好管理的 
Spark里面的 是批次 每个批次的偏移量的管理 
Flink是一条数据 进来的 一条数据进来的 
控制偏移量就是api的使用 很简单

val myConsumer = new FlinkKafkaConsumer08[String](...)
myConsumer.setStartFromEarliest()      // start from the earliest record possible
myConsumer.setStartFromLatest()        // start from the latest record
myConsumer.setStartFromTimestamp(...)  // start from specified epoch timestamp (milliseconds)
myConsumer.setStartFromGroupOffsets()  // the default behaviour

val stream = env.addSource(myConsumer)

这是官网上的

那么先讲使用 里面的状态和checkpoint 容错等 后续写入文章

object SourceApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "hadoop101:9092,hadoop101:9093,hadoop101:9094")
    properties.setProperty("group.id", "sxwang")

    val consumer = new FlinkKafkaConsumer[String]("double_happy_offset", new SimpleStringSchema(), properties)

    consumer.setStartFromEarliest()

    env.addSource(consumer).print()

    env.execute(this.getClass.getSimpleName)
  }
}

结果：
1> a
2> f
3> b
2> f
1> b
2> d
3> e
2> f
1> c
3> d
3> b

看 这就把我们前面写入kafka的数据读出来了

所以Flink里的偏移量很简单 底层管理的很好

问题：
 我能不能指定 一个分区 分区里面的 offset呢？

在这里插入图片描述
DataStream Transformations

和Spark大部分都差不多 
下面说一下 Spark里没有的

在这里插入图片描述

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

        val stream = env.readTextFile("C:\\IdeaProjects\\flink\\data\\access.log").map(x => {
          val splits = x.split(",")
          Access(splits(0).toLong, splits(1), splits(2).toLong)
        })

    stream.keyBy(_.domain).sum("traffic").print("sum")

    env.execute(this.getClass.getSimpleName)
  }
}
结果是：
sum:5> Access(201912120010,ruozedata.com,2000)
sum:6> Access(201912120010,dongqiudi.com,1000)
sum:6> Access(201912120010,zhibo8.com,5000)
sum:5> Access(201912120010,ruozedata.com,6000)
sum:6> Access(201912120010,dongqiudi.com,7000)

为什么结果是这样子的：
   因为这是进来一条统计一次 
注意和Spark的区别： Spark里是一个批次的

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment


        val stream = env.readTextFile("C:\\IdeaProjects\\flink\\data\\access.log").map(x => {
          val splits = x.split(",")
          Access(splits(0).toLong, splits(1), splits(2).toLong)
        })

        stream.keyBy(_.domain).reduce((x,y) => {
          Access(x.time, x.domain, (x.traffic+y.traffic+100))
        }).print()

    env.execute(this.getClass.getSimpleName)
  }

}

结果：
5> Access(201912120010,ruozedata.com,4000)
6> Access(201912120010,dongqiudi.com,1000)
6> Access(201912120010,dongqiudi.com,7100)
5> Access(201912120010,ruozedata.com,6100)
6> Access(201912120010,zhibo8.com,5000)

注意：
1.reduce 比sum灵活   keyby之后 是把相同的domain 放到一起了
2.为什么zhibo8.com 结果没有＋100

进来一条统计一次  为什么结果没有+100呢？
因为 就1个呀   reduce 把相邻的两个做操作 1个操作啥

分流：

在这里插入图片描述

这个功能 可能会用的到：
这里使用split 后续 是使用  侧输出  因为split过时了  不过没有关系

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val stream = env.readTextFile("C:\\IdeaProjects\\flink\\data\\access.log").map(x => {
      val splits = x.split(",")
      Access(splits(0).toLong, splits(1), splits(2).toLong)
    })

    // 5000 6000 7000
    val splitStream = stream.keyBy("domain").sum("traffic").split(x => {
      if (x.traffic > 6000) {
        Seq("大客户")
      } else {
        Seq("一般客户")
      }
    })

    splitStream.select("大客户").print("大客户")
    splitStream.select("一般客户").print("一般客户")
  
    env.execute(this.getClass.getSimpleName)
  }
}

结果是：
一般客户:5> Access(201912120010,ruozedata.com,2000)
一般客户:6> Access(201912120010,dongqiudi.com,1000)
一般客户:6> Access(201912120010,zhibo8.com,5000)
一般客户:5> Access(201912120010,ruozedata.com,6000)
大客户:6> Access(201912120010,dongqiudi.com,7000)

splitStream.select("大客户", "一般客户").print("ALL")

结果是：
一般客户:5> Access(201912120010,ruozedata.com,2000)
一般客户:6> Access(201912120010,dongqiudi.com,1000)
ALL:5> Access(201912120010,ruozedata.com,2000)
ALL:6> Access(201912120010,dongqiudi.com,1000)
一般客户:6> Access(201912120010,zhibo8.com,5000)
ALL:6> Access(201912120010,zhibo8.com,5000)
一般客户:5> Access(201912120010,ruozedata.com,6000)
大客户:6> Access(201912120010,dongqiudi.com,7000)
ALL:5> Access(201912120010,ruozedata.com,6000)
ALL:6> Access(201912120010,dongqiudi.com,7000)

合流

两个流是可以合的 ：
union
connect

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment


    val stream = env.readTextFile("C:\\IdeaProjects\\flink\\data\\access.log").map(x => {
      val splits = x.split(",")
      Access(splits(0).toLong, splits(1), splits(2).toLong)
    })
    

        val stream1 = env.addSource(new AccessSource)
        val stream2 = env.addSource(new AccessSource)

//1.数据类型是一样的 
        stream1.union(stream2).map(x=>{
          println("接收到的数据:" + x)
          x
        }).print()
  

    env.execute(this.getClass.getSimpleName)
  }

}

结果：不重要哈 结果不重要 
接收到的数据:Access(1577510963163,dongqiudi.com,656)
接收到的数据:Access(1577510963163,dongqiudi.com,523)
接收到的数据:Access(1577510963163,dongqiudi.com,340)
接收到的数据:Access(1577510963163,ruozedata.com,450)
接收到的数据:Access(1577510963163,ruozedata.com,790)
2> Access(1577510963163,ruozedata.com,450)
接收到的数据:Access(1577510963163,dongqiudi.com,211)
接收到的数据:Access(1577510963163,ruozedata.com,79)
接收到的数据:Access(1577510963163,dongqiudi.com,673)
4> Access(1577510963163,ruozedata.com,79)
6> Access(1577510963163,dongqiudi.com,211)

  /**
   * Creates a new DataStream by merging DataStream outputs of
   * the same type with each other. The DataStreams merged using this operator
   * will be transformed simultaneously.
   *
   */
  def union(dataStreams: DataStream[T]*): DataStream[T] =
    asScalaStream(stream.union(dataStreams.map(_.javaStream): _*))
union:  he same type with each other.  要求数据类型是一样的 两个流

但是生产上 要合并两个流 很少 两个流数据类型是一样的 

 /**
   * Creates a new ConnectedStreams by connecting
   * DataStream outputs of different type with each other. The
   * DataStreams connected using this operators can be used with CoFunctions.
   */
  def connect[T2](dataStream: DataStream[T2]): ConnectedStreams[T, T2] =
    asScalaStream(stream.connect(dataStream.javaStream))


两个流 数据类型不一样的： 使用 connect

两个流 数据类型不一样的：

def map[R: TypeInformation](fun1: IN1 => R, fun2: IN2 => R):


注意：
fun1: IN1 => R, fun2: IN2 => R

fun1: IN1 => R   第一个流  做的操作
 fun2: IN2 => R  第二个流 做的操作

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment


    val stream = env.readTextFile("C:\\IdeaProjects\\flink\\data\\access.log").map(x => {
      val splits = x.split(",")
      Access(splits(0).toLong, splits(1), splits(2).toLong)
    })

    val stream1 = env.addSource(new AccessSource)
    val stream2 = env.addSource(new AccessSource)
   
   val stream2New = stream2.map(x => ("J哥", x))
   
   stream1.connect(stream2New).map(x=>x,y=>y).print()

    env.execute(this.getClass.getSimpleName)
  }

}

结果是：
1> Access(1577511438837,zhibo8.cc,823)
8> Access(1577511438837,zhibo8.cc,379)
5> Access(1577511438837,zhibo8.cc,176)
5> Access(1577511438837,zhibo8.cc,414)
1> (J哥,Access(1577511438837,zhibo8.cc,553))
2> Access(1577511438837,ruozedata.com,775)
1> (J哥,Access(1577511438837,ruozedata.com,561))
6> Access(1577511438837,dongqiudi.com,657)
6> Access(1577511438837,zhibo8.cc,131)
6> (J哥,Access(1577511438837,dongqiudi.com,826))
7> Access(1577511438837,ruozedata.com,153)
7> (J哥,Access(1577511438837,ruozedata.com,98))
2> (J哥,Access(1577511438837,ruozedata.com,588))
2> (J哥,Access(1577511438837,ruozedata.com,351))
4> Access(1577511438837,ruozedata.com,24)
8> (J哥,Access(1577511438837,dongqiudi.com,812))
4> (J哥,Access(1577511438837,dongqiudi.com,686))
5> (J哥,Access(1577511438837,zhibo8.cc,825))
3> Access(1577511438837,dongqiudi.com,333)
3> (J哥,Access(1577511438837,zhibo8.cc,140))

connect: 
     Connects  two data streams retaining their types 
     数据结构可以不同
     two data streams  **
union：
       Union of two or more data streams
        数据结构要相同
        two or more data streams

Physical partitioning

class DoubleHappyPartitioner extends Partitioner[String]{
  
  override def partition(key: String, numPartitions: Int): Int = {
    println("partitions: " + numPartitions)
// 注意 scala 里面 不用使用 equals  直接 ==  即可
    if(key == "ruozedata.com"){
      0
    } else if(key == "dongqiudi.com"){
      1
    } else {
      2
    }

  }
}


注意：
Partitioner[String]

传进去的泛型 是 key的类型 

分区的前提 是 kv

/**
   * Partitions a tuple DataStream on the specified key fields using a custom partitioner.
   * This method takes the key position to partition on, and a partitioner that accepts the key
   * type.
   *
   * Note: This method works only on single field keys.
   */
  def partitionCustom[K: TypeInformation](partitioner: Partitioner[K], field: Int) : DataStream[T] =
    asScalaStream(stream.partitionCustom(partitioner, field))

field: Int  注意这个

object TranformationApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(3)

        env.addSource(new AccessSource)
          .map(x=>(x.domain, x))
          .partitionCustom(new DoubleHappyPartitioner, 0)
          .map(x => {
            println("current thread id is: " + Thread.currentThread().getId + " , value is: " + x)
            x._2
          }).print()


    env.execute(this.getClass.getSimpleName)
  }

}

结果是：
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
partitions: 3
current thread id is: 69 , value is: (zhibo8.cc,Access(1577512344598,zhibo8.cc,827))
current thread id is: 68 , value is: (ruozedata.com,Access(1577512344598,ruozedata.com,143))
1> Access(1577512344598,ruozedata.com,143)
3> Access(1577512344598,zhibo8.cc,827)
current thread id is: 68 , value is: (ruozedata.com,Access(1577512344598,ruozedata.com,672))
1> Access(1577512344598,ruozedata.com,672)
current thread id is: 70 , value is: (dongqiudi.com,Access(1577512344598,dongqiudi.com,450))
2> Access(1577512344598,dongqiudi.com,450)
current thread id is: 68 , value is: (ruozedata.com,Access(1577512344598,ruozedata.com,689))
current thread id is: 69 , value is: (zhibo8.cc,Access(1577512344598,zhibo8.cc,608))
current thread id is: 70 , value is: (dongqiudi.com,Access(1577512344598,dongqiudi.com,26))
3> Access(1577512344598,zhibo8.cc,608)
2> Access(1577512344598,dongqiudi.com,26)
1> Access(1577512344598,ruozedata.com,689)
current thread id is: 69 , value is: (zhibo8.cc,Access(1577512344598,zhibo8.cc,430))
current thread id is: 70 , value is: (dongqiudi.com,Access(1577512344598,dongqiudi.com,514))
3> Access(1577512344598,zhibo8.cc,430)
2> Access(1577512344598,dongqiudi.com,514)
current thread id is: 69 , value is: (zhibo8.cc,Access(1577512344598,zhibo8.cc,345))
3> Access(1577512344598,zhibo8.cc,345)

注意：线程
1-> ruozedata.com
2->dongqiudi.com
3>zhibo8.cc  
结果是没有问题的

这就是分区器的简单使用
让每个线程处理不同的数据