Spark002-transform&action

RDD操作

转换

转换操作不会立即执行的，不触发作业的执行

RDD操作
    transformation  转换    它不会立即执行  你写了1亿个转换  白写   lazy
    action          动作    只有遇到action才会提交作业开始执行      eager

官网RDD算子介绍：

RDD Operations：
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

This design enables Spark to run more efficiently：
对比Hadoop举一个例子：
	1+1+1+1 这个动作
MR：1+1=2  -->落地-->读取 +1=3  -->落地--->读取+1=4
Spark：1+1+1+1 这块全部用transformations 来完成 ，真正计算的时候，才一次性提交上去。一个流水先就全都执行完了。

Transformations讲解前的说明

先说明我们的程序里创建SparkContex的方式，由于每次创建都要写appname，master，以及RDD数据集在Driver端打印出来查看都要写foreach(println)，每次都要写很麻烦，这里我们给封装一下。
效果展示：
在这里插入图片描述
如果我不想输出我输入一个1进去即可：

这是自己写的哈，老师上课留的作业人家要求动手能力。不会全部给你，让你做一个伸手党就没意义。

package com.ruozedata.spark.homework.utils
import org.apache.spark.{SparkConf, SparkContext}
object ContextUtils {
  /**
    * 获取sc
    */
  def getSparkContext(appname:String,defalut:String = "local[2]"): SparkContext = {
    val sparkConf = new SparkConf().setAppName(appname).setMaster(defalut)
    new SparkContext(sparkConf)
  }
}

下面这个主要使用隐式转换，看不懂可以查看Scala博客：

package com.ruozedata.spark.homework.utils
import org.apache.spark.rdd.RDD
object ImplicitAspect {
  implicit def rdd2RichRDD[T](rdd : RDD[T]) : RichRDD[T] = new RichRDD[T](rdd)
}
class RichRDD[T](rdd : RDD[T]){
 def printInfo(num : Int =0): Unit ={
    num match {
      case 0 => rdd.foreach(println);println("-------------------------")
      case _ =>
    }
  }
}

Transformations

源码面前，了无秘密。通过源码进行学习

（1）Map相关的算子

1.makeRDD / parallelize 

 def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }
注意：makeRDD底层调用的就是parallelize 

 /** Distribute a local Scala collection to form an RDD.
   *
   * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
   * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
   * modified collection. Pass a copy of the argument to avoid this.
   * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
   * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
   * @param seq Scala collection to distribute
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed collection
   */
  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }

2.map : 处理每一条数据
/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

在这里插入图片描述

RDD是有多个partition所构成的，
3.mapPartitions: Return a new RDD by applying a function to each partition of this RDD. 对分区做处理的，一个分区里有很多的元素的。
 /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

在这里插入图片描述
一个是作用于每个元素，一个是作用于每个分区。

这是作用于每个分区里的每个元素

结果和map是一样的。

总结：
 /**
      * map 处理每一条数据
      * mapPartitions 对每个分区进行处理
      *
      * map：100个元素  10个分区 ==> 知识点：要把RDD的数据写入MySQL  Connection次数
      */

4.mapPartitionsWithIndex
你想看哪个分区里的东西 这个算子可以拿的到
/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

在这里插入图片描述

元素为什么这么存放的呢？之后再来讲解。
生产上是不关注这个分区里的哪个元素的只是用来学。

5.mapValues
 /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

在这里插入图片描述

6.flatmap = map + flatten   就是打扁以后 做map

 /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }


TraversableOnce:  一次遍历的意思   flattening the results.也就是压扁

对比map：
在这里插入图片描述

map对里面元素做处理，是不会改变内部的结构的。

flatMap:
在这里插入图片描述
结果：

秀操作的：没什么用
scala> sc.parallelize(1 to 5).flatMap(1 to _).collect
res3: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)

（2）glom

glom：把每一个分区里的数据 形成一个数组  比mapwithindex好用
 /**
   * Return an RDD created by coalescing all elements within each partition into an array.
   */
  def glom(): RDD[Array[T]] = withScope {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }

scala> sc.parallelize(1 to 30).glom().collect
res4: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), Array(16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30))

scala>

(3)sample

sample:
/**
   * Return a sampled subset of this RDD.
   *
   * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
   * @param fraction expected size of the sample as a fraction of this RDD's size
   *  without replacement: probability that each element is chosen; fraction must be [0, 1]
   *  with replacement: expected number of times each element is chosen; fraction must be greater
   *  than or equal to 0
   * @param seed seed for the random number generator
   *
   * @note This is NOT guaranteed to provide exactly the fraction of the count
   * of the given [[RDD]].
   */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }
  }

在这里插入图片描述

解释：
	withReplacement：抽样的时候要不要放回去

（4）filter

在这里插入图片描述

scala> sc.parallelize(1 to 30).filter(_ > 20).collect
res5: Array[Int] = Array(21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

scala> sc.parallelize(1 to 30).filter(x=> x % 2 ==0 && x >10).collect
res6: Array[Int] = Array(12, 14, 16, 18, 20, 22, 24, 26, 28, 30)

scala>

(5)other类型的

union:  就是简单的合并  是不去重的哈 
/**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T] = withScope {
    sc.union(this, other)
  }

scala>  val a = sc.parallelize(List(1,2,3,4,5,6))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala>     val b = sc.parallelize(List(4,5,6,77,7,7))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24

scala> a.union(b).collect
res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 4, 5, 6, 77, 7, 7)

scala>

intersection:  交集
  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

scala>  val a = sc.parallelize(List(1,2,3,4,5,6))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala>     val b = sc.parallelize(List(4,5,6,77,7,7))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24
scala> a.intersection(b).collect
res8: Array[Int] = Array(4, 6, 5)

subtract:差集  出现在a里面的没有出现在b里面的 叫差集
/**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be &lt;= us.
   */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }

scala>  val a = sc.parallelize(List(1,2,3,4,5,6))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala>     val b = sc.parallelize(List(4,5,6,77,7,7))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24
scala> a.subtract(b).collect
res9: Array[Int] = Array(2, 1, 3)

去重：distinct
	 /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

scala>val b = sc.parallelize(List(4,5,6,77,7,7))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24
scala> b.distinct.collect
res10: Array[Int] = Array(4, 6, 77, 7, 5)

去重这块 是可以传入分区参数的 ： 也可以没有的 没有的就是默认的分区
你传进来多少分区就意味着 重新分区了

scala>val b = sc.parallelize(List(4,5,6,77,7,7))
scala>    b.distinct(4).mapPartitionsWithIndex((index,partition)=>{
     |       partition.map(x=> s"分区是$index, 元素是 $x")
     |     }).collect()
res11: Array[String] = Array(分区是0, 元素是 4, 分区是1, 元素是 77, 分区是1, 元素是 5, 分区是2, 元素是 6, 分区是3, 元素是 7)

四个分区： 那是怎么分区的呢？  元素%partitions
即元素对分区个数取模 来分的

分区是0, 元素是 4,    4%4=0
分区是1, 元素是 77, 
分区是1, 元素是 5,  5%4 =1
分区是2, 元素是 6,  6%4=2
分区是3, 元素是 7  7%4 = 3

明白了吧  通过这个例子 知道是怎么进行分区的。

KV类型的

groupKeyKey：不怎么用的哈
把key相同的分到一组

  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }


 /**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).groupByKey()
res12: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:25

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).groupByKey().collect
res13: Array[(String, Iterable[Int])] = Array((b,CompactBuffer(2)), (a,CompactBuffer(1, 99)), (c,CompactBuffer(3)))

scala>

接着上面的值求相同的key的和是多少使用什么算子呢？上面讲过了哈

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).groupByKey().collect
res13: Array[(String, Iterable[Int])] = Array((b,CompactBuffer(2)), (a,CompactBuffer(1, 99)), (c,CompactBuffer(3)))

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).groupByKey().mapValues(x=>x.sum).collect
res14: Array[(String, Int)] = Array((b,2), (a,100), (c,3))

scala>

reduceByKey：

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).reduceByKey(_+_)
res15: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[42] at reduceByKey at <console>:25

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).reduceByKey(_+_).collect
res16: Array[(String, Int)] = Array((b,2), (a,100), (c,3))

scala>

distinct 底层

 /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }
注意：
distinct底层是使用 map+reduceByKey 的 是不是很简单
reduceByKey就把两两相同的东西 丢到一块去

/**
      * distinct 去重
      * 不允许使用distinct做去重
      *
      * x => (x,null)
      *
      * 8 => (8,null)
      * 8 => (8,null)
      */
    val b = sc.parallelize(List(3,4,5,6,7,8,8))
    b.map(x => (x,null)).reduceByKey((x,y) => x).map(_._1)

org.apache.spark.rdd.RDD[(Int, Null)] :一定要看清reduceByKey的数据结构哈

scala> val r1 = sc.parallelize(List(1,1,12,3,3,4,6,6,6))
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at parallelize at <console>:24
scala> r1.map(x=>(x,null)).reduceByKey((x,y)=>x)
res17: org.apache.spark.rdd.RDD[(Int, Null)] = ShuffledRDD[49] at reduceByKey at <console>:26

实际上到这一步 去重已经去掉了吧 明白吗   再把null去掉就可以了

scala> r1.map(x=>(x,null)).reduceByKey((x,y)=>x).map(_._1).collect
res18: Array[Int] = Array(4, 6, 12, 1, 3)

所以使用常用的算子一定要手点进去看看底层的实现哈。

个人理解这个算子超重要

groupBy：自定义分组  分组条件就是自定义传进去的

 /**
  * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
  * mapping to that key. The ordering of elements within each group is not guaranteed, and
  * may even differ each time the resulting RDD is evaluated.
  *
  * @note This operation may be very expensive. If you are grouping in order to perform an
  * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
  * or `PairRDDFunctions.reduceByKey` will provide much better performance.
  */
 def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
     : RDD[(K, Iterable[T])] = withScope {
   val cleanF = sc.clean(f)
   this.map(t => (cleanF(t), t)).groupByKey(p)
 }

参数是传入一个分组条件 怎么传呀？不要紧 可以测试
1.传自己进去
scala> sc.parallelize(List("a","a","a","b","b","c")).groupBy(x=>x)
res19: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[55] at groupBy at <console>:25
scala> sc.parallelize(List("a","a","a","b","b","c")).groupBy(x=>x).collect
res20: Array[(String, Iterable[String])] = Array((b,CompactBuffer(b, b)), (a,CompactBuffer(a, a, a)), (c,CompactBuffer(c)))

那我给一个需求：把上面的字母次数算出来
scala> sc.parallelize(List("a","a","a","b","b","c")).groupBy(x=>x)
res19: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[55] at groupBy at <console>:25
scala> sc.parallelize(List("a","a","a","b","b","c")).groupBy(x=>x).mapValues(x=>x.size).collect
res21: Array[(String, Int)] = Array((b,2), (a,3), (c,1))

注意：所以 算子都会用 怎么串起来 很重要的

sortBy：自定义排序   你想怎么排序就怎么排序  默认是升序的
/**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).sortBy(_._2)
res22: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[68] at sortBy at <console>:25

scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).sortBy(_._2).collect
res23: Array[(String, Int)] = Array((老哥,18), (double_happy,30), (娜娜,60))

scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).sortBy(_._2,false).collect
res24: Array[(String, Int)] = Array((娜娜,60), (double_happy,30), (老哥,18))

scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).sortBy(-_._2).collect
res25: Array[(String, Int)] = Array((娜娜,60), (double_happy,30), (老哥,18))

sortBykey:是按key进行排序的哈 注意和sortby的区别   sortby是自定义排序 非常的灵活

  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

 sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).sortByKey().collect
res29: Array[(String, Int)] = Array((double_happy,30), (娜娜,60), (老哥,18))

如果要求 就是按年龄来排 应该怎么排序：
  反转
scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).map(x=>(x._2,x._1)).sortByKey().collect
res30: Array[(Int, String)] = Array((18,老哥), (30,double_happy), (60,娜娜))

数据是要和开始格式类似 再转回来就可以了：

scala> sc.parallelize(List(("double_happy",30),("老哥",18),("娜娜",60))).map(x=>(x._2,x._1)).sortByKey().map(x=>(x._2,x._1)).collect
res31: Array[(String, Int)] = Array((老哥,18), (double_happy,30), (娜娜,60))

Join

join:  默认是内连接
一定是需要条件的，条件就是key的

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

第一个：名字  第二个：城市  第三个：年龄  (B,(上海,18))

scala> val j1 = sc.parallelize(List(("A","北京"),("B","上海"),("C","杭州")))
j1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[106] at parallelize at <console>:24

scala> val j2 = sc.parallelize(List(("A","30"),("B","18"),("D","60")))
j2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[107] at parallelize at <console>:24

scala>   j1.join(j2).collect
res32: Array[(String, (String, String))] = Array((B,(上海,18)), (A,(北京,30)))

scala>     j1.leftOuterJoin(j2).collect
res33: Array[(String, (String, Option[String]))] = Array((B,(上海,Some(18))), (A,(北京,Some(30))), (C,(杭州,None)))

scala>     j1.rightOuterJoin(j2).collect
res34: Array[(String, (Option[String], String))] = Array((B,(Some(上海),18)), (D,(None,60)), (A,(Some(北京),30)))

scala>

/**
      * join底层就是使用了cogroup
      * RDD[K,V]
      *
      * 根据key进行关联，返回两边RDD的记录，没关联上的是空
      * join返回值类型  RDD[(K, (Option[V], Option[W]))]      这块参数的类型是option要注意 
      * cogroup返回值类型  RDD[(K, (Iterable[V], Iterable[W]))]
      */

cogroup:
 /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }

scala>     j1.fullOuterJoin(j2).collect
res35: Array[(String, (Option[String], Option[String]))] = Array((B,(Some(上海),Some(18))), (D,(None,Some(60))), (A,(Some(北京),Some(30))), (C,(Some(杭州),None)))

scala>     j1.cogroup(j2).collect
res36: Array[(String, (Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(上海),CompactBuffer(18))), (D,(CompactBuffer(),CompactBuffer(60))), (A,(CompactBuffer(北京),CompactBuffer(30))), (C,(CompactBuffer(杭州),CompactBuffer())))

RDD操作

转换

官网RDD算子介绍：

Transformations讲解前的说明

Transformations

（1）Map相关的算子

（2）glom

(3)sample

（4）filter

(5)other类型的

KV类型的

distinct 底层

个人理解 这个算子超重要

Join

个人理解这个算子超重要