Spark001--double_happy

Speed

Spark是支持pipline操作的，根据Shufle进行切分的，中间的过程是不落地的。
运行的角度来说：
线程的
mapreduce是进程的 map task 、reduce task

RDD

1.Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
2.

5大特性： 弹性分布式数据集
	1）一系列的partition     分区里是有index的
      protected def getPartitions: Array[Partition]

解释：
	 *  - A list of partitions
	 *  - A function for computing each split
 scala中 List（1，2，3，4).map(_*2)   scala中的list是单机的
 而RDD中 数据是分区的，如果rdd.map(_*2)是对每个分区里的元素做计算 是 分布式的

    2）针对RDD做操作其实就是针对RDD底层的partition进行操作
    rdd.map(_*2)
    def compute(split: Partition, context: TaskContext): Iterator[T]

	3）rdd之间的依赖（血缘关系）
      protected def getDependencies: Seq[Dependency[_]] = deps

	4）partitioner（针对 kv类型的rdd）
      @transient val partitioner: Option[Partitioner] = None

	5）locations （优先把作业调度到数据所在节点）
      protected def getPreferredLocations(split: Partition): Seq[String] = Nil
	好处是 如果你的数据不在这个节点上 优先把作业调度到数据所在节点 好处是 直接本地读数据就可以了
	理想化状态。
	 也有 作业调度在别的节点上 数据在另一台节点上，那么 只能把数据通过网络把数据传到 作业调度的节点上去，进行计算。那么5这个特性就是减少网络数据传输。

程序开发入口

开发Spark应用程序
1）SparkConf
appName
master
2）SparkContext(sparkConf)
3）spark-shell –master local[2] 底层自动为我们创建了SparkContext sc

在这里插入图片描述

算子

RDD:创建
 parallelize :
 	sc.parallelize(List(1,2,3,4))
 textFile:
 	sc.textFile(path)
 通过RDD转换生成的

scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect
collect   collectAsync

scala> rdd.collect
res0: Array[Int] = Array(1, 2, 3, 4)                                            

scala> val rdd1 = sc.textFile("file:///home/sxwang/data")
rdd1: org.apache.spark.rdd.RDD[String] = file:///home/sxwang/data MapPartitionsRDD[2] at textFile at <console>:24

scala> rdd1.collect
res1: Array[String] = Array(spark       flink   hadoop, spark   kafka   scala)

scala> val rdd2 = rdd.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:26

scala> rdd2.collect
collect   collectAsync

scala> rdd2.collect
res2: Array[Int] = Array(2, 4, 6, 8)

scala> rdd2.toDebugString
res3: String =
(2) MapPartitionsRDD[3] at map at <console>:26 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

并行度 –简单版

scala> rdd
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4)

产看ui界面：
在这里插入图片描述
为什么是2呢？
查看源码：

def parallelize[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  assertNotStopped()
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}

defaultParallelism：
	
	 /** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
  def defaultParallelism: Int = {
    assertNotStopped()
    taskScheduler.defaultParallelism
  }

看他的实现的
在这里插入图片描述

taskScheduler.defaultParallelism：
	override def defaultParallelism(): Int = backend.defaultParallelism()

在这里插入图片描述

查看local的点进去：
  override def defaultParallelism(): Int =
    scheduler.conf.getInt("spark.default.parallelism", totalCores)
默认从配置文件里去  ，但是我们没有设置，

/**
 * Used when running a local version of Spark where the executor, backend, and master all run in
 * the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
 * Executor (created by the [[LocalSchedulerBackend]]) running locally.
 */
private[spark] class LocalSchedulerBackend(
    conf: SparkConf,
    scheduler: TaskSchedulerImpl,
    val totalCores: Int)
  extends SchedulerBackend with ExecutorBackend with Logging {

totalCores 说明是我们构建的时候传进来的 

可以看：
	scala> sc.parallelize(List(1,2,3,4),3).collect
res9: Array[Int] = Array(1, 2, 3, 4)

在这里插入图片描述