Spark003--Action 接着Spark002

2018-01-19

回顾上篇文章：

RDD:
是什么
五大特性对应五大方法
创建方式：3
操作：2 action & transformation

Spark作业开发流程：
在这里插入图片描述
也就是：
数据源–>经过一堆transformtion–>action 触发spark作业 —>输出到某个地方

你的业务无论多么复杂都是这样的。

Action

（1）collect

/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

注意：
1.Return an array that contains all of the elements in this RDD.
2.resulting array is expected to be small
3. the data is loaded into the driver's memory.
所以生产上你想看这个rdd里的数据 是不太现实的 会导致某种oom的，(oom有好多种的)
如果你还是想看rdd里的元素 该怎么办呢？
两种方法：
1) 取出部分数据
2) 把rdd输出到文件系统
真正生产上使用collect只有一个地方：？？？

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5)

(2)foreach
在这里插入图片描述

/**
  * Applies a function f to all elements of this RDD.
  */
 def foreach(f: T => Unit): Unit = withScope {
   val cleanF = sc.clean(f)
   sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
 }

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.foreach(println)
1
2
3
4
5

scala>

注意：
我在spark-shell –master local[2] 模式下 rdd.foreach(println) 会显示出结果，如果在
spark-shell –master yarn 模式下 rdd.foreach(println) 会显示出结果么？为什么呢？

(3)foreachPartition
在这里插入图片描述

/**
  * Applies a function f to each partition of this RDD.
  */
 def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
   val cleanF = sc.clean(f)
   sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
 }

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.foreachPartition(println)
non-empty iterator
non-empty iterator

scala> rdd.partitions.size
res5: Int = 2

scala> 

注意：
返回的是non-empty iterator 怎么才能把里面的内容输出出来呢？

如果这样写呢？
 rdd.foreachPartition(paritition => paritition.map(println))

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.foreachPartition(paritition => paritition.map(println))
scala>

能不能想到这是什么问题导致的？
foreachPartition(paritition => paritition.map(println)) 输出结果在正在执行的机器上面是有的
而控制台看到的是driver的

正好引入一个东西：
sortBy 上次的

val rdd2 = sc.parallelize(List(("a",1),("b",2),("c",3),("d",4)),2)
rdd2.sortBy(_._2,false)

注意：
sortBy是全局排序的还是分区排序的？

上面的两行代码看仔细了 ， 是两个分区 ,按照降序排

结果：

scala> val rdd2 = sc.parallelize(List(("a",1),("b",2),("c",3),("d",4)),2)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> rdd2.sortBy(_._2,false).foreach(println)
(d,4)
(c,3)
(b,2)
(a,1)

scala> 

同样的代码我再运行一次：
scala> val rdd2 = sc.parallelize(List(("a",1),("b",2),("c",3),("d",4)),2)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> rdd2.sortBy(_._2,false).foreach(println)
(b,2)
(a,1)
(d,4)
(c,3)

scala>

sortBy是全局排序的还是分区排序的？通过上面的测试知道了吗？知道个鬼
是不是感觉是分区排序

去idea上输出结果看一下：

package com.ruozedata.spark.spark02

import com.ruozedata.spark.homework.utils.ContextUtils

object ActionApp {

  def main(args: Array[String]): Unit = {

    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)

    val rdd2 = sc.parallelize(List(("a",1),("b",2),("c",3),("d",4)),2)
    rdd2.sortBy(_._2,false).saveAsTextFile("file:///Users/double_happy/zz/G7-03/工程/scala-spark/doc/out")

    sc.stop()
  }
}

在这里插入图片描述

难道真的是分区排序么？在进行测试。
在这里插入图片描述

scala> rdd2.sortBy(_._2,false).foreach(println)
(b,2)
(a,1)
(d,4)
(c,3)

scala> rdd2.sortBy(_._2,false).foreach(println)
(d,4)
(c,3)
(b,2)
(a,1)

scala> rdd2.sortBy(_._2,false).foreach(println)
(d,4)
(c,3)
(b,2)
(a,1)

scala>

为什么rdd2.sortBy(.2,false).foreach(println)的结果不一样？
所以使用foreach在这里根本看不出来sortBy是全局排序还是分区排序

因为 rdd2是两个分区的，foreach执行的时候不确定是哪个task先println 出来明白吗？

所以sortBy 到底是什么排序？
全局排序你看idea里的

所以你测试的时候 sortBy 后面不能跟着 foreach 来测试要输出文件

通过读取文件来测试

(3)count

/**
 * Return the number of elements in the RDD.
 */
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd.count
res5: Long = 5

scala>

(4) reduce 两两做操作

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd.count
res5: Long = 5

scala> rdd.reduce(_+_)
res6: Int = 15

scala>

(5) first

/**
  * Return the first element in this RDD.
  */
 def first(): T = withScope {
   take(1) match {
     case Array(t) => t
     case _ => throw new UnsupportedOperationException("empty collection")
   }
 }

(6)take

/**
  * Take the first num elements of the RDD. It works by first scanning one partition, and use the
  * results from that partition to estimate the number of additional partitions needed to satisfy
  * the limit.
  *
  * @note This method should only be used if the resulting array is expected to be small, as
  * all the data is loaded into the driver's memory.
  *
  * @note Due to complications in the internal implementation, this method will raise
  * an exception if called on an RDD of `Nothing` or `Null`.
  */
 def take(num: Int): Array[T] = withScope {
   val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
   if (num == 0) {
     new Array[T](0)
   } else {
     val buf = new ArrayBuffer[T]
     val totalParts = this.partitions.length
     var partsScanned = 0
     while (buf.size < num && partsScanned < totalParts) {
       // The number of partitions to try in this iteration. It is ok for this number to be
       // greater than totalParts because we actually cap it at totalParts in runJob.
       var numPartsToTry = 1L
       val left = num - buf.size
       if (partsScanned > 0) {
         // If we didn't find any rows after the previous iteration, quadruple and retry.
         // Otherwise, interpolate the number of partitions we need to try, but overestimate
         // it by 50%. We also cap the estimation in the end.
         if (buf.isEmpty) {
           numPartsToTry = partsScanned * scaleUpFactor
         } else {
           // As left > 0, numPartsToTry is always >= 1
           numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt
           numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
         }
       }

       val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
       val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

       res.foreach(buf ++= _.take(num - buf.size))
       partsScanned += p.size
     }

     buf.toArray
   }
 }

first底层调用take方法

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd.count
res5: Long = 5

scala> rdd.reduce(_+_)
res6: Int = 15

scala> rdd.first
res7: Int = 1

scala> rdd.take(2)
res8: Array[Int] = Array(1, 2)

scala>

(7) top
里面肯定是做了排序的

/**
   * Returns the top k (largest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of
   * [[takeOrdered]]. For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
   *   // returns Array(12)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
   *   // returns Array(6, 5)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of top elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }


注意：
1. This does the opposite of
   * [[takeOrdered]].

2.top 底层调用的是 takeOrdered

3.top 柯里化的 Ordering 看scala篇这部分 讲的很详细

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd.top(2)
res9: Array[Int] = Array(5, 4)

scala> rdd.takeOrdered(2)
res10: Array[Int] = Array(1, 2)

(8)zipWithIndex

给你一个算子你怎么知道他是 action还是 transformation？？

action算子里面是有sc.runJob()方法的

eg：
在这里插入图片描述

所以zipWithIndex 它不是action算子

/**
  * Zips this RDD with its element indices. The ordering is first based on the partition index
  * and then the ordering of items within each partition. So the first item in the first
  * partition gets index 0, and the last item in the last partition receives the largest index.
  *
  * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
  * This method needs to trigger a spark job when this RDD contains more than one partitions.
  *
  * @note Some RDDs, such as those returned by groupBy(), do not guarantee order of
  * elements in a partition. The index assigned to each element is therefore not guaranteed,
  * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
  * the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
  */
 def zipWithIndex(): RDD[(T, Long)] = withScope {
   new ZippedWithIndexRDD(this)
 }

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd.zipWithIndex
res11: org.apache.spark.rdd.RDD[(Int, Long)] = ZippedWithIndexRDD[31] at zipWithIndex at <console>:26

scala> rdd.zipWithIndex.collect
res12: Array[(Int, Long)] = Array((1,0), (2,1), (3,2), (4,3), (5,4))
scala>

(9)countByKey

这是action算子

/**
  * Count the number of elements for each key, collecting the results to a local Map.
  *
  * @note This method should only be used if the resulting map is expected to be small, as
  * the whole thing is loaded into the driver's memory.
  * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
  * returns an RDD[T, Long] instead of a map.
  */
 def countByKey(): Map[K, Long] = self.withScope {
   self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
 }

(10)collectAsMap 针对kv类型的

/**
  * Return the key-value pairs in this RDD to the master as a Map.
  *
  * Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
  *          one value per key is preserved in the map returned)
  *
  * @note this method should only be used if the resulting data is expected to be small, as
  * all the data is loaded into the driver's memory.
  */
 def collectAsMap(): Map[K, V] = self.withScope {
   val data = self.collect()
   val map = new mutable.HashMap[K, V]
   map.sizeHint(data.length)
   data.foreach { pair => map.put(pair._1, pair._2) }
   map
 }

scala> rdd.zipWithIndex.collect
res12: Array[(Int, Long)] = Array((1,0), (2,1), (3,2), (4,3), (5,4))

scala> rdd.zipWithIndex().countByKey()
res13: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)

scala> rdd.zipWithIndex().collectAsMap()
res14: scala.collection.Map[Int,Long] = Map(2 -> 1, 5 -> 4, 4 -> 3, 1 -> 0, 3 -> 2)

scala>

Action算子官网：Action 算子

Spark001--double_happy

2018-01-15

Speed

Spark是支持pipline操作的，根据Shufle进行切分的，中间的过程是不落地的。
运行的角度来说：
线程的
mapreduce是进程的 map task 、reduce task

RDD

1.Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
2.

5大特性： 弹性分布式数据集
	1）一系列的partition     分区里是有index的
      protected def getPartitions: Array[Partition]

解释：
	 *  - A list of partitions
	 *  - A function for computing each split
 scala中 List（1，2，3，4).map(_*2)   scala中的list是单机的
 而RDD中 数据是分区的，如果rdd.map(_*2)是对每个分区里的元素做计算 是 分布式的

    2）针对RDD做操作其实就是针对RDD底层的partition进行操作
    rdd.map(_*2)
    def compute(split: Partition, context: TaskContext): Iterator[T]

	3）rdd之间的依赖（血缘关系）
      protected def getDependencies: Seq[Dependency[_]] = deps

	4）partitioner（针对 kv类型的rdd）
      @transient val partitioner: Option[Partitioner] = None

	5）locations （优先把作业调度到数据所在节点）
      protected def getPreferredLocations(split: Partition): Seq[String] = Nil
	好处是 如果你的数据不在这个节点上 优先把作业调度到数据所在节点 好处是 直接本地读数据就可以了
	理想化状态。
	 也有 作业调度在别的节点上 数据在另一台节点上，那么 只能把数据通过网络把数据传到 作业调度的节点上去，进行计算。那么5这个特性就是减少网络数据传输。

程序开发入口

开发Spark应用程序
1）SparkConf
appName
master
2）SparkContext(sparkConf)
3）spark-shell –master local[2] 底层自动为我们创建了SparkContext sc

在这里插入图片描述

算子

RDD:创建
 parallelize :
 	sc.parallelize(List(1,2,3,4))
 textFile:
 	sc.textFile(path)
 通过RDD转换生成的

scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect
collect   collectAsync

scala> rdd.collect
res0: Array[Int] = Array(1, 2, 3, 4)                                            

scala> val rdd1 = sc.textFile("file:///home/sxwang/data")
rdd1: org.apache.spark.rdd.RDD[String] = file:///home/sxwang/data MapPartitionsRDD[2] at textFile at <console>:24

scala> rdd1.collect
res1: Array[String] = Array(spark       flink   hadoop, spark   kafka   scala)

scala> val rdd2 = rdd.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:26

scala> rdd2.collect
collect   collectAsync

scala> rdd2.collect
res2: Array[Int] = Array(2, 4, 6, 8)

scala> rdd2.toDebugString
res3: String =
(2) MapPartitionsRDD[3] at map at <console>:26 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

并行度 –简单版

scala> rdd
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4)

产看ui界面：
在这里插入图片描述
为什么是2呢？
查看源码：

def parallelize[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  assertNotStopped()
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}

defaultParallelism：
	
	 /** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
  def defaultParallelism: Int = {
    assertNotStopped()
    taskScheduler.defaultParallelism
  }

看他的实现的
在这里插入图片描述

taskScheduler.defaultParallelism：
	override def defaultParallelism(): Int = backend.defaultParallelism()

在这里插入图片描述

查看local的点进去：
  override def defaultParallelism(): Int =
    scheduler.conf.getInt("spark.default.parallelism", totalCores)
默认从配置文件里去  ，但是我们没有设置，

/**
 * Used when running a local version of Spark where the executor, backend, and master all run in
 * the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
 * Executor (created by the [[LocalSchedulerBackend]]) running locally.
 */
private[spark] class LocalSchedulerBackend(
    conf: SparkConf,
    scheduler: TaskSchedulerImpl,
    val totalCores: Int)
  extends SchedulerBackend with ExecutorBackend with Logging {

totalCores 说明是我们构建的时候传进来的 

可以看：
	scala> sc.parallelize(List(1,2,3,4),3).collect
res9: Array[Int] = Array(1, 2, 3, 4)

在这里插入图片描述

Hadoop项目流程：Hive2MySQL--double_happy

2018-01-03

Hive统计表

不管使用Hive统计的结果是什么维度，统计的结果表选择：
1.external
2.partition
就是分区+外部表

平台报表别人如何访问呢：
	1.使用hiveservie2 使用jdbc方式去连接     可以    生产上不这么用
	2.最好把数据倒入mysql里面   *****  那么下面就是基于这个思路进行讲解

Hdfs --》Mysql  这里使用Sqoop     到Spark那块就不用Sqoop了 太low了

使用Sqoop把Hive里数据倒入MySQL 注意的点

注意：
Hive里表的设计 ：分区字段 要在Hive建表的时候 多一个字段 代表分区字段
	hdfs上的数据是三个字段不算分区的 mySQL里面要对应上 
	分区字段不能作为最终字段来使用的在mysql中（sqoop）
	所以在Sqoop的时候要指定上字段而且还要指定分割符
	
	对于hive来讲默认的分割符是\001
	不然会报错 如下：	
	解决办法就是：
		Hive表 + Sqoop  使用

在这里插入图片描述

MySQL里表：需要Hive表里的哪几个字段就创建什么样的表
create table  platform_stat(
platform  varchar(20),
cnt  int,
d varchar(8)
)engine=innodb default charset=utf8;

Hive里表设计：
create table dwd_platform_stat_info(
platform string,
cnt int,
d string
) partitioned by(day string)
location '/hadoop/project/platform_stat';

Sqoop使用：Sqoop 在什么地方执行 那么 能够地方会有一个java文件生成

sqoop export \
--connect jdbc:mysql://localhost:3306/hive_dwd \
--password wsx123$%^ \
--username root \
--mapreduce-job-name  Platform_info_Hive2MySQL \
--columns "platform,cnt,d"   \
--input-fields-terminated-by '\001' \
--table platform_stat \
--export-dir /hadoop/project/platform_stat/day=20190921

查看MySQL表结果：

在这里插入图片描述

脚本封装：
	#!/bin/bash

if [ $# -eq 1 ]; then
    time=$1
else
    time=`date -d "yesterday" +%Y%m%d`
fi

#platforma_state 数据统计
hive -e "

use homework;
insert overwrite table dwd_platform_stat_info partition(day=${time})
select platform , count(1) ,day as d  from access_wide where day ='${time}'  group by platform,day ;
"

#Sqoop platforma_state_Hive2MySQL 
sqoop export \
--connect jdbc:mysql://localhost:3306/hive_dwd \
--password wsx123$%^ \
--username root \
--mapreduce-job-name  Platform_info_Hive2MySQL \
--columns "platform,cnt,d"   \
--input-fields-terminated-by '\001' \ 
--table platform_stat \
--export-dir /hadoop/project/platform_stat/day=${time}

存在的问题：
	按上面的脚本走 如果多次导入同一天的数据
	MySQL里面会有重复数据的

在这里插入图片描述

数据是有重复的

解决MySQL数据重复：
	这块只要把上次导入的日期数据删掉就可以了

脚本封装：
	#!/bin/bash

if [ $# -eq 1 ]; then
    time=$1
else
    time=`date -d "yesterday" +%Y%m%d`
fi

#platforma_state 数据统计
echo "-------------------------------------------------------------------------------------------"
echo "step  : hive -e --->  insert data to platforma_state table"
echo "-------------------------------------------------------------------------------------------"
hive -e "

use homework;
insert overwrite table dwd_platform_stat_info partition(day=${time})
select platform , count(1) ,day as d  from access_wide  where day ='${time}' group by platform,day ;
"

#登陆MySQL 删除上次 处理日期 表中数据
echo "-------------------------------------------------------------------------------------------"
echo "step : MySQL ---> delete  last time process ${time} data"
echo "-------------------------------------------------------------------------------------------"

MySQL_USER='root'
MySQL_PASSWD='wsx123$%^'

SQL_RESULT=`
mysql \
--user="${MySQL_USER}" \
--password="${MySQL_PASSWD}"  \
-e "select count(1) as cnt from hive_dwd.platform_stat where d = ${time};" | tail -1
`
echo "-------------------------------------------------------------------------------------------"
echo "MySQL last time ${time} data size ：${SQL_RESULT}"
echo "-------------------------------------------------------------------------------------------"
if [ ${SQL_RESULT} -ne 0 ] ; then
    mysql --user="${MySQL_USER}" --password="${MySQL_PASSWD}"  \
    -e "delete from hive_dwd.platform_stat where d = ${time};"
    echo "-------------------------------------------------------------------------------------------"
    echo "Delete Mysql last time  ${time} data done ,next step do Sqoop : Hive 2 MySQL"
    echo "-------------------------------------------------------------------------------------------"
fi




#Sqoop platforma_state_Hive2MySQL 
echo "-------------------------------------------------------------------------------------------"
echo "step : sqoop --> Hive 2 MySQL "
echo "-------------------------------------------------------------------------------------------"
sqoop export \
--connect jdbc:mysql://localhost:3306/hive_dwd \
--password wsx123$%^ \
--username root \
--mapreduce-job-name  Platform_info_Hive2MySQL \
--columns "platform,cnt,d"   \
--input-fields-terminated-by '\001' \
--table platform_stat \
--export-dir /hadoop/project/platform_stat/day=${time} 

error=$?
if [ $error == 0 ] ; then
 echo "-------------------------------------------------------------------------------------------"
 echo "${time} work suceess"
 echo "-------------------------------------------------------------------------------------------"
fi

任务调度

下一个博客Azkaban讲解

Hadoop压缩---double_happy

2018-01-01

调优点：

为什么要使用压缩呢？

1.节省空间 （数据在hdfs上以3副本存储 如果采用压缩 占用空间会少一些）
2.时间：网络io 和 磁盘io 会减少 
  （mapreduce过程中 map端输出采用压缩和不采用压缩效果很明显）
  2.1 map端到reduce端会经过shuffle 如果map端采用压缩那么 map端数据传到reduce端过程中
     数据压缩后体积会变小，那么经过网络传输的数据会变少 减少网络io
     因为要经过网络传输，需要从磁盘读到内存 磁盘上的数据压缩后 读取到内存的数据体积
     也会变小  所以也减少磁盘io
  这样传输的时间也会减少很多，所以有必要进行压缩。

但是注意的是如果采用压缩，对机器的cpu的要求高，所以压缩的使用场景

1.存储数据的空间不够
2.机器的core要足够

如果core不够还采用压缩，那么还是别采用压缩啦。

压缩的技术

有损压缩(lossy compression) : 适用于图片和视频允许丢失几帧
无损压缩(lossless compression):原始数据解压缩数据是没有丢失的

对称和非对称：就是压缩和解压的时间相同叫对称，反义。

压缩的使用场景结合mapreduce

数据压缩 map端输出可以用，reduce端输出也可以使用

input   
  因为这块 map读取数据的时候的inputformat默认会识别数据输入采用什么格式的压缩获取codec（
     textinputformat源码里有）
map out         配个参数就可以
reduce out    配个参数就可以

spark、flink同样的

凡事都有两面性

空间和时间 ok
cpu 耗费 cpu的利用率会高而且整个作业的处理时长会略微长一些

使用的压缩：
	有个解压缩过程 所以整个作业时间会略微长

所以为了减少空间和网络磁盘io传输时间 cpu的耗费以及作业的时长会变长

常见的压缩格式

在这里插入图片描述
还有LZ4

如何选择呢，这么多压缩的格式压缩比和解压缩度

在这里插入图片描述

相同配置的机器测试看看

压缩比：压缩前和压缩后的比值

压缩比    Bzip2 30%   Gzip (两者之间)  ,snappy \lzo50%
解压速度    反过来

在这里插入图片描述

压缩能否分片

hadoop作业是io密集型的，所以他的作业尽可能的采用压缩
spark、flink作业是pipline型的

注意：压缩又的是java写的，有的是native的，
所以你要在Hadoop里使用LZO(native的) 需要下载一些native的依赖

Splitable：
	一个文件相当于一个map task来处理，
	1.假设一个5G的文件，不能使用分割的，也就意味着这个文件只能使用一个
	maptask来处理，如果这个能分割，5G拆成10分 会采用10个maptask来处理
	并行处理。5*1024/10 = 一个maptask处理的数据量。

是否能够分割就决定了你的一个maptask处理的数据量有多少，
如果能够分割就可以多个maptask并行处理

压缩是否支持分割

分割：  注意是压缩过后的压缩文件是否支持分割的
gzip    不可分割
bzip2  可分割
LZo    带索引可以分割（默认是不支持分割的）
Snappy 不可分割

是否能分割对使用哪个压缩有很大的影响意义

在这里插入图片描述

上图三个部分使用压缩：mapreduce的流程使用压缩的部分

input：
map out
reduce out

在这里插入图片描述

三个地方使用的压缩推荐：
	input：
		Bzip2(支持分割 读一个文件 支持分割会多个并行的maptask进行处理) 数据量特别大 如果不支持压缩
		就会有一个maptask进行处理，性能很低。
	
	mapout：
		shuffle过程要选择一个解压速度更快的压缩
		因为每个maptask输出数据写到磁盘上之后经过网络io
		没有必要采用压缩比高的，之后到reducetask这过程中是采用分片和不分片这块不重要了已经
		因为maptask进来之前是一个大文件拆成多个maptask来处理
		到reduce这个过程中 难道你还需要拆么？不需要，所以这块最重要的是解压速度
	reduceout：
		1.高的压缩比节省空间（使用于归档文件）
		2.作为下一个map的输入呢？应该采用什么压缩方式，我会选择Bzip2或者LZO带索引的(支持分片)

MapReduce作业使用压缩实战

在Hadoop的core-site.xml里配置压缩，mapreduce-site.xml配置你采用压缩的位置(map的输出和reduce的输出)

core-site.xml:

  <property>
    <name>io.compression.codecs</name>
  <value>
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec,
  org.apache.hadoop.io.compress.SnappyCodec,
  com.hadoop.compression.lzo.LzoCodec,
  com.hadoop.compression.lzo.LzopCodec
  </value>
</property>

mapreduce-site.xml:
	 <property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
</property>

<property>
  <name>mapreduce.output.fileoutputformat.compress.codec</name>
  <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

测试：
wc
hdfs最后生成的结果是以bzip结尾的

在这里插入图片描述

Hive的压缩使用

1.创建表：
CREATE EXTERNAL TABLE `ods_uid_pid_info_compression_test`(
`uid` string, 
`pid` string
)
row format delimited fields terminated by '\t';

load data local inpath '/home/double_happy/data/user_pid.txt' overwrite into table ods_uid_pid_info_compression_test;

2.去hdfs上查看这数据

在这里插入图片描述

使用压缩：
	hive客户端里：
	set hive.exec.compress.output=true;
	set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;

在这里插入图片描述

查看hdfs上数据大小：
在这里插入图片描述

可以对比一下数据小了。

注意：
	hive里设置压缩不建议直接在hive-site.xml里面配置，那是全局的，
	建议还是在使用的时候用命令的方式

SQL-01

2017-12-20

SQL-第一题：聚合函数形式的题

先讨论一下最基础的东西在hivesql里，比较晦涩，但是真的很好用哦。

谈谈个人理解：
 在sql里两个核心的东西：（1）group ：核心是聚合（2）join ：的核心是要join哪些列 
 注意：group by 和 over在一个select里不能同时使用的哈

我们为什么要聚合呢？
eg：一个表 表示三个班级的学生，有名称，分数，班级，如果不聚合 
我能不能这样写呢 
select name , score from student group by class  
不能 为什么不能？

首先我的name,score 在groupby的时候我不知道取哪一个，
groupby class 就相当于在所有的列名里面取项，你不知道取哪个name，哪个score，
所以取得项要么出现在groupby 后面，
要么取得项是一个聚合列（eg：sum（score）代表每个班的所有得分数） ，
这个聚合列就用到聚合函数udaf函数。

案例1 ：【聚合函数形式的题】

eg:求每一个班级得分最高的学生得姓名+分数+班级？
 解决方法：
 1.sql -GroupBy +Join
 2.RDD
 3.sql -开窗

开窗函数：就是1对多 ，1就是partition by 谁 ,产生了多个数（产生了一个新列存这多个数） 
group by ：是多对1，把多个合并成一个

1.group by + join
1）子表a ： select class,max(score)  score from student group by class 
    （1）中不能有name 想完成需求只能子表与原表join形成一个大表，利用大表添加name
2）
    select b.name, a.class,a.score from student b 
join 
    (select class,max(score) score from student group by class ) a
on 
 b.class =a.class and a.score = b.score

2.开窗函数 
开窗函数：结合上面的问题，我在表后面再加一列，加什么列呢，首先第一个我是按照班级先分组，
班级里的分数从高到低做一个排序，把排序的名次 作为要加的列。
这样就不用像上面那样group by，直接select 那个列就能取到值。这个就叫开窗函数。
   select name,class,score,rank() over(partition by class order by score desc)  rank from student 
这个语句就是为了产生最后一列 。 over就叫开窗函数 ，这个over里面怎么开的窗呢？
partition by 就是先分组（就是以什么开窗，相当于 再某一个class里面我用一次rank()，在另一个class里面 
我在用一次rank（）， rank()的应用前提是 order by 某个东西,rank()之后，这样就产生了最后一列 名字叫rank 
这个语句执行之后产生了一个新的表，这个表多了一列叫rank。假如这个表叫aa
我产生完表aa之后，直接就
 select name,class,score from aa where rank =1   就完成需求

开窗函数里不止只有Rank()函数可用，它有很多哈
在这里插入图片描述

SQL题01

先讲思路–>再演示结果

SQL1： ods_domain_traffic_info这个表
domain           time     traffic(T)
gifshow.com   2019/01/01    5
yy.com        2019/01/01    4
huya.com      2019/01/01    1
gifshow.com   2019/01/20    6
gifshow.com   2019/02/01    8
yy.com        2019/01/20    5
gifshow.com   2019/02/02    7
需求：统计每个用户的累计访问量   一个SQL搞定
结果如下：
domain          month     traffics   totals
gifshow.com     2019-01      11         11
gifshow.com     2019-02      15         26
yy.com          2019-01       9         9
huya.com        2019-01       1         1

思路：
	结果要求的是 每个domain 每个月的 traffics  和 totals
分两步：
1.domain + month + traffics
a.
month : time 截取获得
traffics : sum(traffic)
b.
group by(domain,month ) +sum(traffic) ===> 拿到domain   month    traffics

2.拿到domain   month    traffics 目的是 domain   month    traffics  totals

a.新生成了一个列 totals  先想到 开窗函数 over()
	partition by  谁呢？ order by 谁呢？
	基于给的结果知道  partition by domain   order by  month     over()前面选择 sum(traffics) 
这样就ok了

我写sql思路过程：
1.每个domain 每个month的总量  ==》    domain   month  traffics
tmp :
	select 
		domain,substr(regexp_replace(time,"/","-"),1,7) as month, 
		sum(traffic) as traffics
	from ods_domain_traffic_info
	group by domain,substr(regexp_replace(time,"/","-"),1,7)
	
2.目的是totals 生成了一个新列（1对多） 用开窗函数 ，以domain分组以month排序 之后用sum
result:
	select
		domain,month,traffics,
		sum(traffics)over(partition by domain order by month) as totals
	from  tmp;

整合：
	select
		domain,month,traffics,
		sum(traffics)over(partition by domain order by month) as totals
	from(
	select 
		domain,substr(regexp_replace(time,"/","-"),1,7) as month, 
		sum(traffic) as traffics
	from ods_domain_traffic_info
	group by domain,substr(regexp_replace(time,"/","-"),1,7)
	) as tmp;

注意哈：我写博客为了好看 使用了 tab建 ，如何你想测试的话 把sql中的tab 地方去掉哈 。

结果演示：
1. domain +time + traffic ---> domain + month + traffics
每个domain 每个month的总量  ==》    domain   month  traffics
tmp :
	select 
		domain,substr(regexp_replace(time,"/","-"),1,7) as month, 
		sum(traffic) as traffics
	from ods_domain_traffic_info
	group by domain,substr(regexp_replace(time,"/","-"),1,7)

0: jdbc:hive2://hadoop101:10000> select 
. . . . . . . . . . . . . . . .> domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
. . . . . . . . . . . . . . . .> from ods_domain_traffic_info
. . . . . . . . . . . . . . . .> group by domain,substr(regexp_replace(time,"/","-"),1,7);
INFO  : Compiling command(queryId=double_happy_20190917140000_1dab3570-5d32-449f-ab8f-4c896be57622): select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:domain, type:string, comment:null), FieldSchema(name:month, type:string, comment:null), FieldSchema(name:traffics, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917140000_1dab3570-5d32-449f-ab8f-4c896be57622); Time taken: 0.498 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917140000_1dab3570-5d32-449f-ab8f-4c896be57622): select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
INFO  : Query ID = double_happy_20190917140000_1dab3570-5d32-449f-ab8f-4c896be57622
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0001, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0001/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0001
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:00:23,046 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:00:28,411 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.44 sec
INFO  : 2019-09-17 14:00:34,894 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.51 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 510 msec
INFO  : Ended Job = job_1568699800773_0001
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.51 sec   HDFS Read: 9195 HDFS Write: 82 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 510 msec
INFO  : Completed executing command(queryId=double_happy_20190917140000_1dab3570-5d32-449f-ab8f-4c896be57622); Time taken: 23.04 seconds
INFO  : OK
+--------------+----------+-----------+--+
|    domain    |  month   | traffics  |
+--------------+----------+-----------+--+
| gifshow.com  | 2019-01  | 11        |
| gifshow.com  | 2019-02  | 15        |
| huya.com     | 2019-01  | 1         |
| yy.com       | 2019-01  | 9         |
+--------------+----------+-----------+--+

2.目的是totals 生成了一个新列（1对多） 用开窗函数 ，以domain分组以month排序 之后用sum

0: jdbc:hive2://hadoop101:10000> select
. . . . . . . . . . . . . . . .> domain,month,traffics,
. . . . . . . . . . . . . . . .> sum(traffics)over(partition by domain order by month) as totals
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select 
. . . . . . . . . . . . . . . .> domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
. . . . . . . . . . . . . . . .> from ods_domain_traffic_info
. . . . . . . . . . . . . . . .> group by domain,substr(regexp_replace(time,"/","-"),1,7)
. . . . . . . . . . . . . . . .> ) as tmp;
INFO  : Compiling command(queryId=double_happy_20190917140202_f7fde489-9ae9-4ab5-85fd-591382b70891): select
domain,month,traffics,
sum(traffics)over(partition by domain order by month) as totals
from(
select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
) as tmp
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:domain, type:string, comment:null), FieldSchema(name:month, type:string, comment:null), FieldSchema(name:traffics, type:bigint, comment:null), FieldSchema(name:totals, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917140202_f7fde489-9ae9-4ab5-85fd-591382b70891); Time taken: 0.156 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917140202_f7fde489-9ae9-4ab5-85fd-591382b70891): select
domain,month,traffics,
sum(traffics)over(partition by domain order by month) as totals
from(
select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
) as tmp
INFO  : Query ID = double_happy_20190917140202_f7fde489-9ae9-4ab5-85fd-591382b70891
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0002, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0002/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0002
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:02:38,123 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:02:44,384 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.64 sec
INFO  : 2019-09-17 14:02:50,678 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.26 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 260 msec
INFO  : Ended Job = job_1568699800773_0002
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.26 sec   HDFS Read: 11518 HDFS Write: 92 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 260 msec
INFO  : Completed executing command(queryId=double_happy_20190917140202_f7fde489-9ae9-4ab5-85fd-591382b70891); Time taken: 21.812 seconds
INFO  : OK
+--------------+----------+-----------+---------+--+
|    domain    |  month   | traffics  | totals  |
+--------------+----------+-----------+---------+--+
| gifshow.com  | 2019-01  | 11        | 11      |
| gifshow.com  | 2019-02  | 15        | 26      |
| huya.com     | 2019-01  | 1         | 1       |
| yy.com       | 2019-01  | 9         | 9       |
+--------------+----------+-----------+---------+--+

也可以这样的 ：不生成新的列：

0: jdbc:hive2://hadoop101:10000> select
. . . . . . . . . . . . . . . .> domain,month,
. . . . . . . . . . . . . . . .> sum(traffics)over(partition by domain order by month) as traffics
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select 
. . . . . . . . . . . . . . . .> domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
. . . . . . . . . . . . . . . .> from ods_domain_traffic_info
. . . . . . . . . . . . . . . .> group by domain,substr(regexp_replace(time,"/","-"),1,7)
. . . . . . . . . . . . . . . .> ) as tmp;
INFO  : Compiling command(queryId=double_happy_20190917140505_20c693e7-e3af-4233-966a-94cdfc50388d): select
domain,month,
sum(traffics)over(partition by domain order by month) as traffics
from(
select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
) as tmp
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:domain, type:string, comment:null), FieldSchema(name:month, type:string, comment:null), FieldSchema(name:traffics, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917140505_20c693e7-e3af-4233-966a-94cdfc50388d); Time taken: 0.05 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917140505_20c693e7-e3af-4233-966a-94cdfc50388d): select
domain,month,
sum(traffics)over(partition by domain order by month) as traffics
from(
select
domain,substr(regexp_replace(time,"/","-"),1,7) as month, sum(traffic) as traffics
from ods_domain_traffic_info
group by domain,substr(regexp_replace(time,"/","-"),1,7)
) as tmp
INFO  : Query ID = double_happy_20190917140505_20c693e7-e3af-4233-966a-94cdfc50388d
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0003, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0003/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0003
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:05:17,501 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:05:22,722 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
INFO  : 2019-09-17 14:05:29,989 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.97 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 970 msec
INFO  : Ended Job = job_1568699800773_0003
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.97 sec   HDFS Read: 11443 HDFS Write: 82 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 970 msec
INFO  : Completed executing command(queryId=double_happy_20190917140505_20c693e7-e3af-4233-966a-94cdfc50388d); Time taken: 21.456 seconds
INFO  : OK
+--------------+----------+-----------+--+
|    domain    |  month   | traffics  |
+--------------+----------+-----------+--+
| gifshow.com  | 2019-01  | 11        |
| gifshow.com  | 2019-02  | 26        |
| huya.com     | 2019-01  | 1         |
| yy.com       | 2019-01  | 9         |
+--------------+----------+-----------+--+

SQL2

SQL2:
uid pid
user1 a
user2 b
user1 c
user2 c
user3 c
user3 c
1）uv  ==> uid cnt  应该是pid 有多少uid 访问 统计 uid个数  , 或者反过来？？
2）统计每个产品top3的用户信息 ==>  pid  uid  cnt

思路：
（1）这块有歧义 那么两个都做一下
	pid 有多少uid ？ 每个pid 有多少个 uid  ==》 pid + uid 
1. group by pid +count(distinct(uid))   要去重的 
    uid访问pid的个数？  每个uid  + pid
 a. group by uid  + count(distinct(pid))   要对pid去重

这题比较简单 就是以谁分组 count 去重后的谁

结果展示： 因为我就造了3个pid  abc 100个user

0: jdbc:hive2://hadoop101:10000> select uid ,count(distinct(pid)) as cnt
. . . . . . . . . . . . . . . .> from ods_uid_pid_info
. . . . . . . . . . . . . . . .> group by uid;
INFO  : Compiling command(queryId=double_happy_20190917141414_3c87306d-3ebd-49b1-9371-8fdba5629b97): select uid ,count(distinct(pid)) as cnt
from ods_uid_pid_info
group by uid
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:uid, type:string, comment:null), FieldSchema(name:cnt, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917141414_3c87306d-3ebd-49b1-9371-8fdba5629b97); Time taken: 0.055 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917141414_3c87306d-3ebd-49b1-9371-8fdba5629b97): select uid ,count(distinct(pid)) as cnt
from ods_uid_pid_info
group by uid
INFO  : Query ID = double_happy_20190917141414_3c87306d-3ebd-49b1-9371-8fdba5629b97
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0004, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0004/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0004
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:14:52,471 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:14:57,670 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
INFO  : 2019-09-17 14:15:03,926 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.69 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 690 msec
INFO  : Ended Job = job_1568699800773_0004
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.69 sec   HDFS Read: 17780 HDFS Write: 846 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 2 seconds 690 msec
INFO  : Completed executing command(queryId=double_happy_20190917141414_3c87306d-3ebd-49b1-9371-8fdba5629b97); Time taken: 19.383 seconds
INFO  : OK
+----------+------+--+
|   uid    | cnt  |
+----------+------+--+
| user0    | 1    |
| user1    | 3    |
| user10   | 3    |
| user100  | 3    |
| user11   | 3    |
| user12   | 3    |
| user13   | 3    |
| user14   | 3    |
| user15   | 3    |
| user16   | 3    |
| user17   | 3    |
| user18   | 1    |
| user19   | 2    |
| user2    | 3    |
| user20   | 3    |
| user21   | 3    |
| user22   | 3    |
| user24   | 3    |
| user25   | 3    |
| user26   | 3    |
| user27   | 3    |
| user28   | 3    |
| user29   | 3    |
| user3    | 3    |
| user30   | 3    |
| user31   | 2    |
| user32   | 3    |
| user33   | 3    |
| user34   | 3    |
| user36   | 2    |
| user37   | 3    |
| user38   | 3    |
| user39   | 1    |
| user4    | 3    |
| user41   | 3    |
| user42   | 1    |
| user43   | 1    |
| user44   | 3    |
| user45   | 3    |
| user46   | 3    |
| user47   | 2    |
| user48   | 3    |
| user49   | 2    |
| user5    | 3    |
| user50   | 3    |
| user51   | 3    |
| user52   | 3    |
| user54   | 3    |
| user55   | 3    |
| user57   | 1    |
| user58   | 3    |
| user59   | 2    |
| user6    | 3    |
| user60   | 1    |
| user61   | 3    |
| user62   | 3    |
| user63   | 3    |
| user64   | 3    |
| user65   | 2    |
| user66   | 3    |
| user67   | 3    |
| user68   | 1    |
| user69   | 2    |
| user7    | 3    |
| user70   | 3    |
| user71   | 3    |
| user72   | 1    |
| user73   | 3    |
| user74   | 3    |
| user75   | 3    |
| user76   | 2    |
| user77   | 3    |
| user78   | 3    |
| user79   | 3    |
| user8    | 3    |
| user80   | 3    |
| user81   | 3    |
| user82   | 1    |
| user83   | 3    |
| user84   | 1    |
| user85   | 3    |
| user86   | 3    |
| user87   | 3    |
| user88   | 3    |
| user9    | 3    |
| user90   | 3    |
| user91   | 3    |
| user92   | 3    |
| user93   | 2    |
| user94   | 3    |
| user95   | 3    |
| user96   | 3    |
| user97   | 3    |
| user98   | 3    |
| user99   | 3    |
+----------+------+--+

（2）统计每个产品top3的用户信息 ==>  pid  uid  cnt

思路：
1.pid  uid  cnt  top3
意思是 每个pid  每个uid 的  访问次数 并 取出 top3
分两步：

step1：每个pid  每个uid 的  访问次数
step2：基于step1 取出 top3

step1：每个pid  每个uid 的  访问次数
group by（pid + uid)  + count（uid）    (group by 是有去重的哈 不难理解吧)

tmp:
	select pid,uid,count(uid) as count 
	from ods_uid_pid_info
	group by pid,uid

step2: 基于step1    pid,uid,count   取出 top3 
目的 Top3 =》 基于1 要以pid进行分组以count 进行排序  生成新的列 rank （用开窗 1 对 多）

这块用到开窗 rank() 或者 row_number(） + over
paritition by 谁 ？order by 谁？
要求的是每个产品top3的用户信息
所以是 paritition by pid order by count    (注意哈 这个count 是step1 得到的 每个pid 每个uid 的 count)

result_tmp:
	select pid,uid,count, rank()over(partition by pid order by count desc) as rank
	from tmp;  //有并列的

result_tmp:
	select pid,uid,count, row_number()over(partition by pid order by count desc) as rank
	from tmp;   //没有并列的

step 3：基于2 进行where rank  <= 3

result: 两种
	select pid,uid,count
	from result_tmp
	where  rank<=3;

step1 演示;   每个pid  每个uid 的  访问次数
0: jdbc:hive2://hadoop101:10000> select pid,uid,count(uid) as count 
. . . . . . . . . . . . . . . .> from ods_uid_pid_info
. . . . . . . . . . . . . . . .> group by pid,uid ;
INFO  : Compiling command(queryId=double_happy_20190917144747_4703dd87-bc51-4175-8551-b4ffcc78ed11): select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:pid, type:string, comment:null), FieldSchema(name:uid, type:string, comment:null), FieldSchema(name:count, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917144747_4703dd87-bc51-4175-8551-b4ffcc78ed11); Time taken: 0.046 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917144747_4703dd87-bc51-4175-8551-b4ffcc78ed11): select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
INFO  : Query ID = double_happy_20190917144747_4703dd87-bc51-4175-8551-b4ffcc78ed11
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0007, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0007/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0007
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:47:44,611 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:47:49,855 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec
INFO  : 2019-09-17 14:47:56,080 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.46 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 460 msec
INFO  : Ended Job = job_1568699800773_0007
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.46 sec   HDFS Read: 17957 HDFS Write: 2767 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 2 seconds 460 msec
INFO  : Completed executing command(queryId=double_happy_20190917144747_4703dd87-bc51-4175-8551-b4ffcc78ed11); Time taken: 19.001 seconds
INFO  : OK
+------+----------+--------+--+
| pid  |   uid    | count  |
+------+----------+--------+--+
| a    | user1    | 4      |
| a    | user10   | 2      |
| a    | user100  | 5      |
| a    | user11   | 8      |
| a    | user12   | 4      |
| a    | user13   | 3      |
| a    | user14   | 6      |
| a    | user15   | 11     |
| a    | user16   | 6      |
| a    | user17   | 5      |
| a    | user19   | 2      |
| a    | user2    | 2      |
| a    | user20   | 3      |
| a    | user21   | 1      |
| a    | user22   | 5      |
| a    | user24   | 7      |
| a    | user25   | 4      |
| a    | user26   | 6      |
| a    | user27   | 2      |
| a    | user28   | 1      |
| a    | user29   | 4      |
| a    | user3    | 3      |
| a    | user30   | 4      |
| a    | user31   | 5      |
| a    | user32   | 4      |
| a    | user33   | 2      |
| a    | user34   | 1      |
| a    | user36   | 1      |
| a    | user37   | 7      |
| a    | user38   | 5      |
| a    | user4    | 3      |
| a    | user41   | 1      |
| a    | user43   | 1      |
| a    | user44   | 2      |
| a    | user45   | 3      |
| a    | user46   | 2      |
| a    | user47   | 2      |
| a    | user48   | 2      |
| a    | user49   | 3      |
| a    | user5    | 10     |
| a    | user50   | 3      |
| a    | user51   | 4      |
| a    | user52   | 1      |
| a    | user54   | 9      |
| a    | user55   | 6      |
| a    | user58   | 1      |
| a    | user59   | 3      |
| a    | user6    | 3      |
| a    | user61   | 2      |
| a    | user62   | 11     |
| a    | user63   | 3      |
| a    | user64   | 4      |
| a    | user65   | 2      |
| a    | user66   | 3      |
| a    | user67   | 2      |
| a    | user69   | 2      |
| a    | user7    | 4      |
| a    | user70   | 3      |
| a    | user71   | 6      |
| a    | user72   | 2      |
| a    | user73   | 6      |
| a    | user74   | 2      |
| a    | user75   | 2      |
| a    | user76   | 2      |
| a    | user77   | 3      |
| a    | user78   | 7      |
| a    | user79   | 7      |
| a    | user8    | 1      |
| a    | user80   | 2      |
| a    | user81   | 6      |
| a    | user82   | 1      |
| a    | user83   | 5      |
| a    | user85   | 3      |
| a    | user86   | 5      |
| a    | user87   | 8      |
| a    | user88   | 5      |
| a    | user9    | 1      |
| a    | user90   | 2      |
| a    | user91   | 4      |
| a    | user92   | 2      |
| a    | user93   | 1      |
| a    | user94   | 6      |
| a    | user95   | 2      |
| a    | user96   | 6      |
| a    | user97   | 6      |
| a    | user98   | 3      |
| a    | user99   | 4      |
| b    | user0    | 2      |
| b    | user1    | 7      |
| b    | user10   | 1      |
| b    | user100  | 4      |
| b    | user11   | 6      |
| b    | user12   | 3      |
| b    | user13   | 4      |
| b    | user14   | 4      |
| b    | user15   | 9      |
| b    | user16   | 6      |
| b    | user17   | 4      |
| b    | user19   | 6      |
| b    | user2    | 5      |
+------+----------+--------+--+
| pid  |   uid    | count  |
+------+----------+--------+--+
| b    | user20   | 1      |
| b    | user21   | 8      |
| b    | user22   | 8      |
| b    | user24   | 2      |
| b    | user25   | 1      |
| b    | user26   | 7      |
| b    | user27   | 2      |
| b    | user28   | 2      |
| b    | user29   | 4      |
| b    | user3    | 8      |
| b    | user30   | 3      |
| b    | user32   | 3      |
| b    | user33   | 2      |
| b    | user34   | 4      |
| b    | user36   | 1      |
| b    | user37   | 10     |
| b    | user38   | 3      |
| b    | user4    | 5      |
| b    | user41   | 1      |
| b    | user44   | 1      |
| b    | user45   | 1      |
| b    | user46   | 2      |
| b    | user47   | 1      |
| b    | user48   | 7      |
| b    | user49   | 2      |
| b    | user5    | 8      |
| b    | user50   | 6      |
| b    | user51   | 7      |
| b    | user52   | 3      |
| b    | user54   | 2      |
| b    | user55   | 1      |
| b    | user58   | 5      |
| b    | user59   | 1      |
| b    | user6    | 1      |
| b    | user60   | 4      |
| b    | user61   | 10     |
| b    | user62   | 4      |
| b    | user63   | 4      |
| b    | user64   | 1      |
| b    | user65   | 1      |
| b    | user66   | 7      |
| b    | user67   | 1      |
| b    | user68   | 1      |
| b    | user69   | 8      |
| b    | user7    | 8      |
| b    | user70   | 3      |
| b    | user71   | 4      |
| b    | user73   | 5      |
| b    | user74   | 1      |
| b    | user75   | 3      |
| b    | user77   | 8      |
| b    | user78   | 2      |
| b    | user79   | 9      |
| b    | user8    | 1      |
| b    | user80   | 6      |
| b    | user81   | 6      |
| b    | user83   | 3      |
| b    | user84   | 4      |
| b    | user85   | 6      |
| b    | user86   | 5      |
| b    | user87   | 6      |
| b    | user88   | 4      |
| b    | user9    | 1      |
| b    | user90   | 4      |
| b    | user91   | 2      |
| b    | user92   | 5      |
| b    | user94   | 3      |
| b    | user95   | 3      |
| b    | user96   | 4      |
| b    | user97   | 7      |
| b    | user98   | 3      |
| b    | user99   | 3      |
| c    | user1    | 5      |
| c    | user10   | 7      |
| c    | user100  | 2      |
| c    | user11   | 4      |
| c    | user12   | 5      |
| c    | user13   | 3      |
| c    | user14   | 2      |
| c    | user15   | 6      |
| c    | user16   | 3      |
| c    | user17   | 3      |
| c    | user18   | 1      |
| c    | user2    | 2      |
| c    | user20   | 2      |
| c    | user21   | 4      |
| c    | user22   | 4      |
| c    | user24   | 4      |
| c    | user25   | 4      |
| c    | user26   | 6      |
| c    | user27   | 4      |
| c    | user28   | 1      |
| c    | user29   | 7      |
| c    | user3    | 3      |
| c    | user30   | 1      |
| c    | user31   | 1      |
| c    | user32   | 3      |
| c    | user33   | 1      |
| c    | user34   | 2      |
| c    | user37   | 6      |
+------+----------+--------+--+
| pid  |   uid    | count  |
+------+----------+--------+--+
| c    | user38   | 6      |
| c    | user39   | 2      |
| c    | user4    | 9      |
| c    | user41   | 2      |
| c    | user42   | 2      |
| c    | user44   | 1      |
| c    | user45   | 3      |
| c    | user46   | 3      |
| c    | user48   | 6      |
| c    | user5    | 9      |
| c    | user50   | 6      |
| c    | user51   | 8      |
| c    | user52   | 4      |
| c    | user54   | 3      |
| c    | user55   | 2      |
| c    | user57   | 1      |
| c    | user58   | 2      |
| c    | user6    | 1      |
| c    | user61   | 6      |
| c    | user62   | 4      |
| c    | user63   | 3      |
| c    | user64   | 2      |
| c    | user66   | 13     |
| c    | user67   | 4      |
| c    | user7    | 10     |
| c    | user70   | 5      |
| c    | user71   | 6      |
| c    | user73   | 7      |
| c    | user74   | 4      |
| c    | user75   | 4      |
| c    | user76   | 1      |
| c    | user77   | 6      |
| c    | user78   | 3      |
| c    | user79   | 7      |
| c    | user8    | 2      |
| c    | user80   | 2      |
| c    | user81   | 11     |
| c    | user83   | 2      |
| c    | user85   | 2      |
| c    | user86   | 5      |
| c    | user87   | 4      |
| c    | user88   | 4      |
| c    | user9    | 1      |
| c    | user90   | 4      |
| c    | user91   | 2      |
| c    | user92   | 4      |
| c    | user93   | 1      |
| c    | user94   | 12     |
| c    | user95   | 3      |
| c    | user96   | 4      |
| c    | user97   | 6      |
| c    | user98   | 3      |
| c    | user99   | 5      |
+------+----------+--------+--+

step2:目的 Top3 =》 基于1 要以pid进行分组以count 进行排序  生成新的列 rank

0: jdbc:hive2://hadoop101:10000> select pid,uid,count, rank()over(partition by pid order by count desc) as rank
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select pid,uid,count(uid) as count 
. . . . . . . . . . . . . . . .> from ods_uid_pid_info
. . . . . . . . . . . . . . . .> group by pid,uid 
. . . . . . . . . . . . . . . .> )as tmp;  
INFO  : Compiling command(queryId=double_happy_20190917144949_8fd22282-a9be-4e95-aa2a-b98121f6354d): select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:pid, type:string, comment:null), FieldSchema(name:uid, type:string, comment:null), FieldSchema(name:count, type:bigint, comment:null), FieldSchema(name:rank, type:int, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917144949_8fd22282-a9be-4e95-aa2a-b98121f6354d); Time taken: 0.054 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917144949_8fd22282-a9be-4e95-aa2a-b98121f6354d): select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
INFO  : Query ID = double_happy_20190917144949_8fd22282-a9be-4e95-aa2a-b98121f6354d
INFO  : Total jobs = 2
INFO  : Launching Job 1 out of 2
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0008, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0008/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0008
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:49:58,539 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:50:03,879 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
INFO  : 2019-09-17 14:50:10,134 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.98 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 980 msec
INFO  : Ended Job = job_1568699800773_0008
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-2:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0009, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0009/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0009
INFO  : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:50:17,507 Stage-2 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:50:23,732 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
INFO  : 2019-09-17 14:50:29,976 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.39 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 390 msec
INFO  : Ended Job = job_1568699800773_0009
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.98 sec   HDFS Read: 17068 HDFS Write: 6962 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.39 sec   HDFS Read: 14419 HDFS Write: 3494 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 6 seconds 370 msec
INFO  : Completed executing command(queryId=double_happy_20190917144949_8fd22282-a9be-4e95-aa2a-b98121f6354d); Time taken: 39.409 seconds
INFO  : OK
+------+----------+--------+-------+--+
| pid  |   uid    | count  | rank  |
+------+----------+--------+-------+--+
| a    | user15   | 11     | 1     |
| a    | user62   | 11     | 1     |
| a    | user5    | 10     | 3     |
| a    | user54   | 9      | 4     |
| a    | user87   | 8      | 5     |
| a    | user11   | 8      | 5     |
| a    | user37   | 7      | 7     |
| a    | user24   | 7      | 7     |
| a    | user78   | 7      | 7     |
| a    | user79   | 7      | 7     |
| a    | user55   | 6      | 11    |
| a    | user96   | 6      | 11    |
| a    | user94   | 6      | 11    |
| a    | user16   | 6      | 11    |
| a    | user81   | 6      | 11    |
| a    | user26   | 6      | 11    |
| a    | user97   | 6      | 11    |
| a    | user71   | 6      | 11    |
| a    | user14   | 6      | 11    |
| a    | user73   | 6      | 11    |
| a    | user86   | 5      | 21    |
| a    | user38   | 5      | 21    |
| a    | user88   | 5      | 21    |
| a    | user22   | 5      | 21    |
| a    | user83   | 5      | 21    |
| a    | user31   | 5      | 21    |
| a    | user100  | 5      | 21    |
| a    | user17   | 5      | 21    |
| a    | user12   | 4      | 29    |
| a    | user7    | 4      | 29    |
| a    | user25   | 4      | 29    |
| a    | user29   | 4      | 29    |
| a    | user30   | 4      | 29    |
| a    | user32   | 4      | 29    |
| a    | user99   | 4      | 29    |
| a    | user91   | 4      | 29    |
| a    | user51   | 4      | 29    |
| a    | user64   | 4      | 29    |
| a    | user1    | 4      | 29    |
| a    | user4    | 3      | 40    |
| a    | user50   | 3      | 40    |
| a    | user66   | 3      | 40    |
| a    | user3    | 3      | 40    |
| a    | user85   | 3      | 40    |
| a    | user77   | 3      | 40    |
| a    | user98   | 3      | 40    |
| a    | user59   | 3      | 40    |
| a    | user6    | 3      | 40    |
| a    | user20   | 3      | 40    |
| a    | user70   | 3      | 40    |
| a    | user63   | 3      | 40    |
| a    | user45   | 3      | 40    |
| a    | user13   | 3      | 40    |
| a    | user49   | 3      | 40    |
| a    | user65   | 2      | 55    |
| a    | user19   | 2      | 55    |
| a    | user2    | 2      | 55    |
| a    | user27   | 2      | 55    |
| a    | user33   | 2      | 55    |
| a    | user44   | 2      | 55    |
| a    | user46   | 2      | 55    |
| a    | user47   | 2      | 55    |
| a    | user48   | 2      | 55    |
| a    | user61   | 2      | 55    |
| a    | user67   | 2      | 55    |
| a    | user69   | 2      | 55    |
| a    | user72   | 2      | 55    |
| a    | user74   | 2      | 55    |
| a    | user75   | 2      | 55    |
| a    | user76   | 2      | 55    |
| a    | user80   | 2      | 55    |
| a    | user90   | 2      | 55    |
| a    | user92   | 2      | 55    |
| a    | user95   | 2      | 55    |
| a    | user10   | 2      | 55    |
| a    | user52   | 1      | 76    |
| a    | user9    | 1      | 76    |
| a    | user41   | 1      | 76    |
| a    | user93   | 1      | 76    |
| a    | user36   | 1      | 76    |
| a    | user34   | 1      | 76    |
| a    | user28   | 1      | 76    |
| a    | user21   | 1      | 76    |
| a    | user43   | 1      | 76    |
| a    | user82   | 1      | 76    |
| a    | user8    | 1      | 76    |
| a    | user58   | 1      | 76    |
| b    | user37   | 10     | 1     |
| b    | user61   | 10     | 1     |
| b    | user15   | 9      | 3     |
| b    | user79   | 9      | 3     |
| b    | user7    | 8      | 5     |
| b    | user21   | 8      | 5     |
| b    | user22   | 8      | 5     |
| b    | user69   | 8      | 5     |
| b    | user5    | 8      | 5     |
| b    | user3    | 8      | 5     |
| b    | user77   | 8      | 5     |
| b    | user51   | 7      | 12    |
| b    | user48   | 7      | 12    |
+------+----------+--------+-------+--+
| pid  |   uid    | count  | rank  |
+------+----------+--------+-------+--+
| b    | user26   | 7      | 12    |
| b    | user97   | 7      | 12    |
| b    | user66   | 7      | 12    |
| b    | user1    | 7      | 12    |
| b    | user50   | 6      | 18    |
| b    | user87   | 6      | 18    |
| b    | user85   | 6      | 18    |
| b    | user81   | 6      | 18    |
| b    | user80   | 6      | 18    |
| b    | user19   | 6      | 18    |
| b    | user16   | 6      | 18    |
| b    | user11   | 6      | 18    |
| b    | user2    | 5      | 26    |
| b    | user92   | 5      | 26    |
| b    | user58   | 5      | 26    |
| b    | user73   | 5      | 26    |
| b    | user4    | 5      | 26    |
| b    | user86   | 5      | 26    |
| b    | user88   | 4      | 32    |
| b    | user71   | 4      | 32    |
| b    | user29   | 4      | 32    |
| b    | user84   | 4      | 32    |
| b    | user13   | 4      | 32    |
| b    | user17   | 4      | 32    |
| b    | user100  | 4      | 32    |
| b    | user34   | 4      | 32    |
| b    | user14   | 4      | 32    |
| b    | user96   | 4      | 32    |
| b    | user90   | 4      | 32    |
| b    | user63   | 4      | 32    |
| b    | user62   | 4      | 32    |
| b    | user60   | 4      | 32    |
| b    | user12   | 3      | 46    |
| b    | user99   | 3      | 46    |
| b    | user94   | 3      | 46    |
| b    | user95   | 3      | 46    |
| b    | user98   | 3      | 46    |
| b    | user75   | 3      | 46    |
| b    | user83   | 3      | 46    |
| b    | user52   | 3      | 46    |
| b    | user32   | 3      | 46    |
| b    | user30   | 3      | 46    |
| b    | user70   | 3      | 46    |
| b    | user38   | 3      | 46    |
| b    | user54   | 2      | 58    |
| b    | user28   | 2      | 58    |
| b    | user27   | 2      | 58    |
| b    | user49   | 2      | 58    |
| b    | user46   | 2      | 58    |
| b    | user24   | 2      | 58    |
| b    | user33   | 2      | 58    |
| b    | user0    | 2      | 58    |
| b    | user91   | 2      | 58    |
| b    | user78   | 2      | 58    |
| b    | user67   | 1      | 68    |
| b    | user65   | 1      | 68    |
| b    | user64   | 1      | 68    |
| b    | user6    | 1      | 68    |
| b    | user59   | 1      | 68    |
| b    | user55   | 1      | 68    |
| b    | user74   | 1      | 68    |
| b    | user8    | 1      | 68    |
| b    | user47   | 1      | 68    |
| b    | user45   | 1      | 68    |
| b    | user44   | 1      | 68    |
| b    | user41   | 1      | 68    |
| b    | user36   | 1      | 68    |
| b    | user25   | 1      | 68    |
| b    | user9    | 1      | 68    |
| b    | user20   | 1      | 68    |
| b    | user10   | 1      | 68    |
| b    | user68   | 1      | 68    |
| c    | user66   | 13     | 1     |
| c    | user94   | 12     | 2     |
| c    | user81   | 11     | 3     |
| c    | user7    | 10     | 4     |
| c    | user5    | 9      | 5     |
| c    | user4    | 9      | 5     |
| c    | user51   | 8      | 7     |
| c    | user79   | 7      | 8     |
| c    | user10   | 7      | 8     |
| c    | user73   | 7      | 8     |
| c    | user29   | 7      | 8     |
| c    | user38   | 6      | 12    |
| c    | user37   | 6      | 12    |
| c    | user97   | 6      | 12    |
| c    | user15   | 6      | 12    |
| c    | user77   | 6      | 12    |
| c    | user61   | 6      | 12    |
| c    | user50   | 6      | 12    |
| c    | user26   | 6      | 12    |
| c    | user48   | 6      | 12    |
| c    | user71   | 6      | 12    |
| c    | user99   | 5      | 22    |
| c    | user12   | 5      | 22    |
| c    | user1    | 5      | 22    |
| c    | user70   | 5      | 22    |
| c    | user86   | 5      | 22    |
| c    | user75   | 4      | 27    |
| c    | user87   | 4      | 27    |
+------+----------+--------+-------+--+
| pid  |   uid    | count  | rank  |
+------+----------+--------+-------+--+
| c    | user74   | 4      | 27    |
| c    | user67   | 4      | 27    |
| c    | user21   | 4      | 27    |
| c    | user62   | 4      | 27    |
| c    | user88   | 4      | 27    |
| c    | user96   | 4      | 27    |
| c    | user92   | 4      | 27    |
| c    | user90   | 4      | 27    |
| c    | user27   | 4      | 27    |
| c    | user25   | 4      | 27    |
| c    | user24   | 4      | 27    |
| c    | user22   | 4      | 27    |
| c    | user52   | 4      | 27    |
| c    | user11   | 4      | 27    |
| c    | user63   | 3      | 43    |
| c    | user45   | 3      | 43    |
| c    | user95   | 3      | 43    |
| c    | user46   | 3      | 43    |
| c    | user32   | 3      | 43    |
| c    | user54   | 3      | 43    |
| c    | user3    | 3      | 43    |
| c    | user98   | 3      | 43    |
| c    | user17   | 3      | 43    |
| c    | user16   | 3      | 43    |
| c    | user78   | 3      | 43    |
| c    | user13   | 3      | 43    |
| c    | user83   | 2      | 55    |
| c    | user85   | 2      | 55    |
| c    | user58   | 2      | 55    |
| c    | user55   | 2      | 55    |
| c    | user42   | 2      | 55    |
| c    | user100  | 2      | 55    |
| c    | user91   | 2      | 55    |
| c    | user14   | 2      | 55    |
| c    | user8    | 2      | 55    |
| c    | user80   | 2      | 55    |
| c    | user39   | 2      | 55    |
| c    | user64   | 2      | 55    |
| c    | user20   | 2      | 55    |
| c    | user41   | 2      | 55    |
| c    | user2    | 2      | 55    |
| c    | user34   | 2      | 55    |
| c    | user18   | 1      | 71    |
| c    | user30   | 1      | 71    |
| c    | user28   | 1      | 71    |
| c    | user44   | 1      | 71    |
| c    | user57   | 1      | 71    |
| c    | user6    | 1      | 71    |
| c    | user76   | 1      | 71    |
| c    | user9    | 1      | 71    |
| c    | user93   | 1      | 71    |
| c    | user31   | 1      | 71    |
| c    | user33   | 1      | 71    |
+------+----------+--------+-------+--+

step3   基于2 进行where rank  <= 3

整合：
result: 两种
	select pid,uid,count
	from(
	select pid,uid,count, rank()over(partition by pid order by count desc) as rank
	from(
	select pid,uid,count(uid) as count 
	from ods_uid_pid_info
	group by pid,uid 
	)as tmp
	)as result_tmp
	where  rank<=3;

结果：

0: jdbc:hive2://hadoop101:10000> select pid,uid,count
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select pid,uid,count, rank()over(partition by pid order by count desc) as rank
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select pid,uid,count(uid) as count 
. . . . . . . . . . . . . . . .> from ods_uid_pid_info
. . . . . . . . . . . . . . . .> group by pid,uid 
. . . . . . . . . . . . . . . .> )as tmp
. . . . . . . . . . . . . . . .> )as result_tmp
. . . . . . . . . . . . . . . .> where  rank<=3;
INFO  : Compiling command(queryId=double_happy_20190917143131_3e5d7235-f034-4022-bea0-b86d6464a437): select pid,uid,count
from(
select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
)as result_tmp
where  rank<=3
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:pid, type:string, comment:null), FieldSchema(name:uid, type:string, comment:null), FieldSchema(name:count, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917143131_3e5d7235-f034-4022-bea0-b86d6464a437); Time taken: 0.096 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917143131_3e5d7235-f034-4022-bea0-b86d6464a437): select pid,uid,count
from(
select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
)as result_tmp
where  rank<=3
INFO  : Query ID = double_happy_20190917143131_3e5d7235-f034-4022-bea0-b86d6464a437
INFO  : Total jobs = 2
INFO  : Launching Job 1 out of 2
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0005, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0005/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0005
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:31:30,465 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:31:34,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.17 sec
INFO  : 2019-09-17 14:31:40,960 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.51 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 510 msec
INFO  : Ended Job = job_1568699800773_0005
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-2:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0006, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0006/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0006
INFO  : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:31:49,210 Stage-2 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:31:54,497 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.17 sec
INFO  : 2019-09-17 14:32:00,785 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.48 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 480 msec
INFO  : Ended Job = job_1568699800773_0006
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.51 sec   HDFS Read: 17079 HDFS Write: 6962 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.48 sec   HDFS Read: 14767 HDFS Write: 117 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 990 msec
INFO  : Completed executing command(queryId=double_happy_20190917143131_3e5d7235-f034-4022-bea0-b86d6464a437); Time taken: 37.896 seconds
INFO  : OK
+------+---------+--------+--+
| pid  |   uid   | count  |
+------+---------+--------+--+
| a    | user15  | 11     |
| a    | user62  | 11     |
| a    | user5   | 10     |
| b    | user37  | 10     |
| b    | user61  | 10     |
| b    | user15  | 9      |
| b    | user79  | 9      |
| c    | user66  | 13     |
| c    | user94  | 12     |
| c    | user81  | 11     |
+------+---------+--------+--+

带上排名 看的更清楚：

0: jdbc:hive2://hadoop101:10000> select pid,uid,count,rank
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select pid,uid,count, rank()over(partition by pid order by count desc) as rank
. . . . . . . . . . . . . . . .> from(
. . . . . . . . . . . . . . . .> select pid,uid,count(uid) as count 
. . . . . . . . . . . . . . . .> from ods_uid_pid_info
. . . . . . . . . . . . . . . .> group by pid,uid 
. . . . . . . . . . . . . . . .> )as tmp
. . . . . . . . . . . . . . . .> )as result_tmp
. . . . . . . . . . . . . . . .> where  rank<=3;
INFO  : Compiling command(queryId=double_happy_20190917145252_c7075e67-87d3-4c1d-9d14-0944dfcdbdca): select pid,uid,count,rank
from(
select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
)as result_tmp
where  rank<=3
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:pid, type:string, comment:null), FieldSchema(name:uid, type:string, comment:null), FieldSchema(name:count, type:bigint, comment:null), FieldSchema(name:rank, type:int, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=double_happy_20190917145252_c7075e67-87d3-4c1d-9d14-0944dfcdbdca); Time taken: 0.047 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=double_happy_20190917145252_c7075e67-87d3-4c1d-9d14-0944dfcdbdca): select pid,uid,count,rank
from(
select pid,uid,count, rank()over(partition by pid order by count desc) as rank
from(
select pid,uid,count(uid) as count
from ods_uid_pid_info
group by pid,uid
)as tmp
)as result_tmp
where  rank<=3
INFO  : Query ID = double_happy_20190917145252_c7075e67-87d3-4c1d-9d14-0944dfcdbdca
INFO  : Total jobs = 2
INFO  : Launching Job 1 out of 2
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0010, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0010/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0010
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:52:46,500 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:52:50,688 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.06 sec
INFO  : 2019-09-17 14:52:55,872 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.8 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 800 msec
INFO  : Ended Job = job_1568699800773_0010
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-2:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : Starting Job = job_1568699800773_0011, Tracking URL = http://hadoop101:8088/proxy/application_1568699800773_0011/
INFO  : Kill Command = /home/double_happy/app/hadoop/bin/hadoop job  -kill job_1568699800773_0011
INFO  : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO  : 2019-09-17 14:53:03,511 Stage-2 map = 0%,  reduce = 0%
INFO  : 2019-09-17 14:53:09,752 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.89 sec
INFO  : 2019-09-17 14:53:15,983 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 400 msec
INFO  : Ended Job = job_1568699800773_0011
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.8 sec   HDFS Read: 17080 HDFS Write: 6962 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 14832 HDFS Write: 137 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 6 seconds 200 msec
INFO  : Completed executing command(queryId=double_happy_20190917145252_c7075e67-87d3-4c1d-9d14-0944dfcdbdca); Time taken: 36.913 seconds
INFO  : OK
+------+---------+--------+-------+--+
| pid  |   uid   | count  | rank  |
+------+---------+--------+-------+--+
| a    | user15  | 11     | 1     |
| a    | user62  | 11     | 1     |
| a    | user5   | 10     | 3     |
| b    | user37  | 10     | 1     |
| b    | user61  | 10     | 1     |
| b    | user15  | 9      | 3     |
| b    | user79  | 9      | 3     |
| c    | user66  | 13     | 1     |
| c    | user94  | 12     | 2     |
| c    | user81  | 11     | 3     |
+------+---------+--------+-------+--+

Scala06--double_happy

2017-12-17

隐式转换

存在的目的：增强

scala里有三种：
	隐式参数
	隐式类型转换
	隐式类

隐式类型转换：****

eg：
	A类型 ==》B类型 B对A已有的东西 进行增强（是感知不到的）
scala中 File这个类原声的并没有类似与count，read方法   但是
我们是可以通过隐式转换来增强File中并没有提供的方法

这个东西是双刃剑用不好代码流程你可能都看不明白

需求：如何为一个已存在的类添加一个新方法？

Java：使用代理
scala：使用Implicit

1.定义隐式转换函数

implict def man2superman（man:Man）:Surperman = new Superman(man.name)

代码：
object ImplicitApp {

  def main(args: Array[String]): Unit = {

    val bfx = new Surperman("bfx")

    bfx.fly()

    /**
      *隐式类型转换
      * 1。定义隐式转换函数
      *
      */

    //1。
    implicit def man2surperman(man:Man):Surperman= {
      new Surperman(man.name)
    }

    val double_happy = new Man("double_happy")
    double_happy.fly()


  }


  class Man(val name:String)

  class Surperman(val name : String){

    def fly(): Unit ={
      println(s"$name can fly...")
    }
  }

}

需求二：scala中 File这个类原声的并没有类似与count，read方法   但是
我们是可以通过隐式转换来增强File中并没有提供的方法

代码：
	object ImplicitApp {

  def main(args: Array[String]): Unit = {

    val bfx = new Surperman("bfx")

    bfx.fly()

    /**
      *隐式类型转换
      * 1。定义隐式转换函数
      *
      */

    //1。
    implicit def man2surperman(man:Man):Surperman= {
      new Surperman(man.name)
    }

    val double_happy = new Man("double_happy")
    double_happy.fly()

    //1.2 File 增强
    implicit def file2RichFile(file: File)=new RichFile(file)

    val file = new File("/Users/double_happy/zz/G7-03/工程/scala-flink/doc/implicit/file.txt")

    println(file.read())
  }


  class Man(val name:String)

  class Surperman(val name : String){

    def fly(): Unit ={
      println(s"$name can fly...")
    }
  }


  //2.增强File

  class RichFile(file:File){

    def read() ={

      //文件的路径+文件名字
       Source.fromFile(file.getPath).mkString
    }
  }



}

这样写代码里全是implicit 比较乱最好把他们抽取出来放到一个Obeject里面

隐式类型转换   在spark-core RDD里面有很多，
其中，RDD obeject里面就是把大部分的implicit放到这里面的
eg：
object RDD {

  private[spark] val CHECKPOINT_ALL_MARKED_ANCESTORS =
    "spark.checkpoint.checkpointAllMarkedAncestors"

  // The following implicit functions were in SparkContext before 1.3 and users had to
  // `import SparkContext._` to enable them. Now we move them here to make the compiler find
  // them automatically. However, we still keep the old functions in SparkContext for backward
  // compatibility and forward to the following functions directly.

  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }

  implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]): AsyncRDDActions[T] = {
    new AsyncRDDActions(rdd)
  }

隐式参数转换

隐式参数
	方法/函数的参数可以使用implicit修饰
	效果就是：

eg：
  //Int 类型就具有 add方法了 是不是很神奇
  implicit class Cal(x :Int){

    def add(a:Int) = a+x
  }

  //File类增强
  implicit class FileEnhance(file: File){

    def read2() = Source.fromFile(file.getPath).mkString
  }

  def main(args: Array[String]): Unit = {


    /**
      * 隐式类：
      *
      */


    val file2 = new File("/Users/double_happy/zz/G7-03/工程/scala-flink/doc/implicit/file.txt")
    
    println(file2.read2())

    println(4.add(2))

Scala泛型

/**
  * scala的泛型 ： 类型的约束
  */
object GeneticApp {

  def main(args: Array[String]): Unit = {

    val mm1 = new MM[Int,CupEnum,Int](90,CupEnum.A,175)
    val mm2 = new MM[Int,CupEnum,Int](10,CupEnum.F,150)
    val mm3 = new MM[Int,CupEnum,Int](80,CupEnum.B,165)

    println(mm1)
    println(mm2)
    println(mm3)

  }
}

abstract class Msg[T](content:T)

class WeChatMsg(content:String) extends Msg(content)

class DigitMsg[Int](content:Int) extends Msg(content)

class MM [A,B,C](val faceValue:A,val cap:B,val height:C){

  override def toString: String = faceValue +"\t"+cap+"\t"+height
}

//scala中枚举的使用  固定写法

object CupEnum extends Enumeration {

  type  CupEnum = Value

  val A,B,C,D,E,F = Value

}

Scala中的排序

对比java

两种：
1.sort里的 Comparator
2.实现 Comparable

public class MM  implements Comparable<MM>{


    public static void main(String[] args) {


        MM mm1 = new MM("mm1", 32);
        MM mm2 = new MM("mm2", 34);
        MM mm3 = new MM("mm3", 31);

        List<MM> mms = new ArrayList<>();
        mms.add(mm1);
        mms.add(mm2);
        mms.add(mm3);
        // 1. 第一种方式 排序cpm01(mms);
        //2。第二种方式 排序

        cmp02(mms);


    }

    private static void cmp02(List<MM> mms) {
        Collections.sort(mms);
        for (MM mm : mms) {
            System.out.println(mm);
        }
    }

    private static void cpm01(List<MM> mms) {
        //排序
        Collections.sort(mms, new Comparator<MM>() {
            @Override
            public int compare(MM o1, MM o2) {
                return o1.cup - o2.cup;
            }
        });

        for (MM mm : mms) {
            System.out.println(mm);
        }
    }

    private String name;

    private int cup;

    public MM() {
    }

    public MM(String name, int cup) {
        this.name = name;
        this.cup = cup;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getCup() {
        return cup;
    }

    public void setCup(int cup) {
        this.cup = cup;
    }

    @Override
    public String toString() {
        return "MM{" +
                "name='" + name + '\'' +
                ", cup=" + cup +
                '}';
    }

    @Override
    public int compareTo(MM o) {
        return -(this.cup - o.cup);
    }
}

scala排序与java相对应的：
	1. Ordering  ==>comparator
	2. Ordered ==>comparable

java里的上下界
	上界（upper bounds）<T extends Test > T 可以是Test的子类型   <? extends Test>
	下界(lower bounds)   < T super Test > T可以是Test的父类型			<? super Test >

Scala 里的上下界

object UpperLowerBountsApp {

  def main(args: Array[String]): Unit = {

    //3.Man 排序

//    val man1 = new Man("double",24)
//    val man2 = new Man("xiao fang",18)
//
//    println(new MaxValue(man1,man2).compare)
//    println(new MaxValue2(man1,man2).compare)

    //3.2

    val man3 = new Man2("double",24)
    val man4 = new Man2("xiao fang",18)

   // println(new MaxValue2(man3,man4).compare)  //是不行的 Man2需要隐式转换 或者 实现compareble接口 或者 extends  Ordered[Man2]

    implicit def man2ToOrderedMan2(man2: Man2) = new Ordered[Man2] {
      override def compare(that: Man2): Int = that.age - man2.age
    }

    println(new MaxValue2(man3,man4).compare)

    //3.3 上下文界定

    implicit val cpmtor = new Ordering[Man2] {
      override def compare(x: Man2, y: Man2): Int = -(x.age -y.age)
    }
    println(new MaxValue3(man3,man4).compare())

    //0。。

    val maxInt = new MaxInt(3,6)

    println(maxInt.compare())

    //1。需要需要写泛型 如果实现compareble接口 就不用写
    val maxValue = new MaxValue[Integer](6,10) //Int 没有实现compareble接口

    println(maxValue.compare)

    //2。使用视图界定 不用写泛型
    val maxValue2 = new MaxValue2(13,15)

    println(maxValue2.compare)

  }

}

//求最大值  Int 类型的
class  MaxInt(x :Int , y:Int){

  def compare()={
    if(x > y) x else y
  }
}

class  MaxLong(x :Long , y:Long){

  def compare()={
    if(x > y) x else y
  }
}

//引入 scala里的上界  这里是表示 T是Comparable[T]的子类型
class  MaxValue[T <: Comparable[T]] (x:T,y:T){

  def compare = if(x.compareTo(y) >0)  x else  y

}

//视图界定 ： 底层是使用隐式转换的
class  MaxValue2[T <% Comparable[T]] (x:T,y:T){

  def compare = if(x.compareTo(y) >0)  x else  y

}

//上下文界定
class  MaxValue3[T : Ordering](x : T,y:T)(implicit cpmtor:Ordering[T]){

  def compare() =if(cpmtor.compare(x,y) > 0) x else y

}

class Man(val name :String , val age : Int) extends  Ordered[Man]{

  override def compare(that: Man): Int = that.age - age

  override def toString: String = name + "\t" + age

}

class Man2(val name :String , val age : Int) {

  override def toString: String = name + "\t" + age
}

总结下来 scala里排序：  那些界定 能看懂即可 
	不管你使用什么界定 ，转换的类不用说 里面是一定有compare的
	bean类 不管是隐式转换也好 还是继承或实现 ordered ，最终 bean类 和转换类 结合使用的时候 他们都存在
	类似compare的东西 才不会报错！！！

逆变和协变

scala里 泛型类型是不可变的  本意  但是人为的让他可以

/**
  * scala的泛型 ： 类型的约束
  */
object GeneticApp {


  def main(args: Array[String]): Unit = {


    //1. 泛型
    val mm1 = new MM[Int,CupEnum,Int](90,CupEnum.A,175)
    val mm2 = new MM[Int,CupEnum,Int](10,CupEnum.F,150)
    val mm3 = new MM[Int,CupEnum,Int](80,CupEnum.B,165)

    println(mm1)
    println(mm2)
    println(mm3)


    //2.nb  xb

    /**
      * 泛型类型是不可变的
      *
      *  [UserA] 能不能放 Child
      *  eg：val test: Test[UserA] = new Test[Child]     //默认是不行的哈 泛型类型是不可变的
      *
      *  UserA ==》 Child  协变   补充增强    在Test里属性参数 加一个+ 号即可实现
      *
      *  UserA ==》Person 逆变   减少    在Test里属性参数 加一个- 号即可实现
      *
      */
    val test: Test[UserA] = new Test[UserA]
    val test1: Test[UserA] = new Test[Child]
    println(test)
    println(test1)

//    val test3: Test[UserA] = new Test[Person]
    println(test)


    //使用场景
    val list = List(1,2,3,4,5)
   // list.reduceLeft[UserA]()  点开看一下 就知道  这是一个下界 返回值类型 就是 UserA
    

  }

  def test[T](t:T)=println(t)
}

class Person

class  UserA extends  Person

class Child extends  UserA

class Test[+UserA]

abstract class Msg[T](content:T)

class WeChatMsg(content:String) extends Msg(content)

class DigitMsg[Int](content:Int) extends Msg(content)

class MM [A,B,C](val faceValue:A,val cap:B,val height:C){

  override def toString: String = faceValue +"\t"+cap+"\t"+height
}

//scala中枚举的使用  固定写法

object CupEnum extends Enumeration {

  type  CupEnum = Value

  val A,B,C,D,E,F = Value

}

Scala操作JDBC

pom.xml  添加jdbc
开发
	最基础的方法
使用scalikejdbc

object ScalalikJDBCAPP {


  def  insert()={
    DB.autoCommit({
      implicit session =>{
        SQL("insert into tmp(topic,groupid,partititions,offset) values(?,?,?,?)")
          .bind("happy","test-happy-group",3,8)    //插入具体的值
          .update().apply()  //执行
      }
    })
  }

  def update(): Unit = {
    DB.autoCommit({
      implicit session =>{
        SQL("update  tmp(topic,groupid,partititions,offset) set offset=? where topic=? and groupid=? and partititions=? and offset=? ")
          .bind(18,"happy","test-happy-group",3)    //修改具体的值
          .update().apply()  //执行
      }
    })
  }

  def query(): Unit = {
    val queryresult = DB.readOnly({
      implicit session => {
        SQL("select * from tmp ").map(rs => {

          tmp(
            rs.string("topic"),
            rs.string("groupid"),
            rs.int("partitions"),
            rs.long("offset")
          )
        }).list().apply() //执行
      }
    })
    queryresult.foreach(println(_))
  }

  case class tmp(topic:String,groupid:String,partitions:Int,offset:Long)

  def delete(): Unit = {
    DB.autoCommit({
      implicit session =>{
        SQL("delete from tmp where partition=? ")
          .bind(3)    //删除具体的值
          .update().apply()  //执行
      }
    })
  }

  def transaction(): Unit = {
    DB.localTx({
      implicit session =>{
        SQL("delete from tmp where partition=? ")
          .bind(3)    //删除具体的值
          .update().apply()  //执行

        //1/0   测试用  加上即  3删除了2没有删除 如果事务保证 那么谁都不会删除 ***

        SQL("delete from tmp where partition=? ")
          .bind(2)    //删除具体的值
          .update().apply()  //执行

      }
    })
  }

  def main(args: Array[String]): Unit = {

    //1.解析配置文件
    DBs.setupAll()

    //插入
    insert()

    //修改
    update()

    //查询
    query()

    //删除
    delete()

    //事务
    transaction()

  }

}

Flume03--double_happy

2017-12-13

Flume核心组件

六大组件：Source、Interceptors、 Channel Selectors、channel、Sink、Sink Processors
事务三大核心组件自定义开发这些是必须要掌握的重点

如果整个流程的事务不能保证好的话 会产生两个问题？
1.数据丢失   
2.数据会重复
为什么丢失？为什么重复？ 之后章节演示

多Agent

Flume：Agent的技术选型没有对与错只有合适不合适

案例1

在这里插入图片描述

三个 agent 配置
    待续。。。。

总结：
	1.多个Agent进行传输 选择
			nc：avro sink
			taildir：avro sink
			source: avro source
	2.A1 和A2可能是不同业务的数据 格式也可能是不一样的 可以按这个图接 ，接到A3之后是要加一些标识的，明确知道日志是什么类型的，
	通过event进来之后进过一系列拦截器链可以设置header信息来区别日志

优化上面的图

A3数据从channel出来直接写道一个sink里去的，工作当中肯定是不行的，sink挂掉就gg了。

Sink Processors：
	1.load balance
	2.failover

Sink Processors：
在这里插入图片描述
load balance
一个sink发一些数据，另外sink发送一些数据（轮询和随机）
案例2：

在这里插入图片描述

配置待续。。。

总结：
	1.A0需要sink组

failover
在这里插入图片描述

总结：
	1.优先级高的sink发送数据 另一个sink是不发送数据的      值越大的优先级越大
		优先级高的sink挂掉之后会走优先级低的sink

Flume优化的东西要注意的

面试题：

QA1：谈谈你在工作中针对Flume的调优有哪些
	Source：
		文件：TailDir   注意 默认是不支持递归的     
		     
		网络：avro
	Channel：
		Memory
		File
		capacity
		transactionCapacity
		source/sink: batchSize
		
	Sink：配置sink的个数多和少  吞吐量的问题 
	不是越多越好 每一个sink都是jvm进程 
	sink的batchsize决定出来多少数据
	
	Flume Agent架构 ：
		failover和load balance  什么场景可用性最高 
		一起用可以 但是对机器要求也高 ，退而求其次 ，但是failover是一定要配置的

QA2：谈谈你们使用Flume过程中时如何监控的
    主要是监控 channel    如果channle数据多了 说明数据挤压 sink说明出问题了
    如果sink出现问题了，意味着source 疯狂的往channle里入数据，但是这时候没有sink消费channel数据 
    那么 channle早晚会爆掉，随之 source也写不进去了。
Ganglia，json-》es->kabanna展示 都可以做监控

1. TailDir   Source：
	注意：
	如果某一个目录下面或者某几个目录下面的数据文件非常非常多，
	1.最好增加filegroups的个数，这样吞吐量肯定能上去 因为你处理的是不同的文件
     2.对于source 来讲里面是有一个 batchsize的 （参考值 10000-50000）
         Max number of lines to read and send to the channel at a time.

2. channel
      选择memory 和file 看你能不能接受丢少量数据 mem达到99.99%才会丢
      不要采用kafka channel 一个flume框架就可以了 模型当中你又引用一个kafka进来，出问题的可能性成指数增长 
      
      capacity（channel里能存多少个event）
      		The maximum number of events stored in the channel
	  transactionCapacity（souce往channle里塞多少数据 和 sink从channle里取多少数据  事务的时候）
	  		The maximum number of events the channel will take from a source or give to a sink per transaction

transactionCapacity  是大于capacity

Flume02--double_happy

2017-12-11

Flume的agent配置模板

agent_name: 配置的agent的名称
a1：就是agent的名称（名字随便起的哈）

# Name the components on this agent
<agent_name>.sources = <source_name>
<agent_name>.sinks = <sink_name>
<agent_name>.channels = <channel_name>

<agent_name>.sources.<source_name>.type = xx
<agent_name>.sinks.<sink_name>.type = yyy
<agent_name>.channels.<channel_name>.type = zzz

<agent_name>.sources.<source_name>.channels = <channel_name>
<agent_name>.sinks.<sink_name>.channel = <channel_name>

基于上一篇文章末尾引出一个，

从指定的网络端口上采集日志到控制台输出

那么Flume支持的source、channel、sink有哪些呢？

2.Flume支持的source、channel、sink有哪些呢？

source：
	avro（是rpc服务框架）
	exec （是监控一个文件） ： tail -F  xx.log（大F和小f有区别）后面只能接一个文件(小f是动态的)
	Spooling Directory: 能监控一个文件夹（文件夹下不能有子文件夹的）
	Taildir（是既能监控文件又能监控文件夹）**
	netcat
sink：
	HDFS
	logger（写到控制台）
	avro : 配合avro source使用
	kafka	
channel：只是数据的存储不涉及到数据的缓存
	memory
	file
Agent：各种组合source、channel、sink之间的关系

3.案例分析

eg：为了学习   这个思路可以借鉴      生产上直接用 Taildir
   把一个文件中新增的内容收集到HDFS上去
	exec - memory - hdfs
    一个文件夹
	spooling - memory - hdfs
    文件数据写入kafka
	exec - memory - kafka

4.实战

需求1：采集指定文件的内容到HDFS
技术选型：exec - memory - hdfs

agent  待续。。

需求1总结：
batchSize :积累多少个event 刷到hdfs上去
fileType：默认是secqueceFile   ,DataStream可以理解为文本（没有压缩的）
writeFormat:默认是Writable

hdfs上一个128m的文件和1kb文件都各自占用一个block，而一个block有一个元数据信息，
元数据多了会占用namenode的内存，元数据过多会导致namenode挂掉 （所以注意 设置文件多大合适呢？）

需求2：采集指定文件夹的内容到控制台
选型：spooling - memory - logger

agent  待续。。

需求2总结：
如果监控的文件夹下进来一个文件，那个文件处理完以后，会在那个文件后面加一个COMPLETED标识 ，
但是为什么同一个文件进去两次flume就挂掉了呢 ， 在spooling监控的文件夹下的文件如果被处理过以后，
再给这个文件内容追加写或者改就会报错，如果处理过后的文件的文件名在后面又被用到了，也会报错。
也就是文件名不能有重复的
如果agent里fileHeader参数设置为true，默认event 里的header的key就是文件路径+文件名

Flume01--double_happy

2017-12-10

介绍

我使用的是flume1.6.0-cdh5.15.1版本的，flume的agent配置 实际上就是查字典 不同的版本去不同的官网去查
那么给出flume1.6.0-cdh5.15.1的官网[flume1.6.0-cdh5.15.1](http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html)
下载地址：[cdh下载地址](http://archive.cloudera.com/cdh5/cdh/5/)

1.搭建
	三步：下载、解压、配置环境变量

2.下面基于官网进行Flume的学习

Flume：
	RDBMS ==> Sqoop ==> Hadoop    使用FS可以 
	日志：分散在各个服务器上  ??? ===> Hadoop    这里就引出Flume
	
Flume is a distributed（分布式的）, reliable（可靠）, and available service （高可用）
for efficiently *****collecting, aggregating, and moving***** large amounts of log data. 
It has a simple and flexible architecture based on streaming data flows.

collecting	   采集   source
aggregating    聚合   channel （找个地方把采集过来的数据暂存下）
moving         移动   sink

Flume： 就是编写配置文件，组合source、channel、sink三者之间的关系
Agent：就是由source、channel、sink组成
编写flume的配置文件其实就是配置agent的构成

在这里插入图片描述

Flume就是一个框架：针对日志数据进行采集汇总，把日志从A地方搬运到B地方去
注意：
	1.flume支持配置 采集的文件进行压缩 以及存储格式 
    2.flume的监控
flume把数据写到hdfs要注意有读写权限

3.如何使用Flume？
		就是配置agent
		flume 的agent 就是本地的一个配置文件

[hadoop@ruozedata001 flume]$ flume-ng
Error: Unknown or unspecified command ''

Usage: /home/hadoop/app/flume/bin/flume-ng <command> [options]...

commands:
  help                      display this help text
  agent                     run a Flume agent
  avro-client               run an avro Flume client
  version                   show Flume version info

global options:
  --conf,-c <conf>          use configs in <conf> directory
  --classpath,-C <cp>       append to the classpath
  --dryrun,-d               do not actually start Flume, just print the command
  --plugins-path <dirs>     colon-separated list of plugins.d directories. See the
                            plugins.d section in the user guide for more details.
                            Default: $FLUME_HOME/plugins.d
  -Dproperty=value          sets a Java system property value
  -Xproperty=value          sets a Java -X option

agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)
  --zkConnString,-z <str>   specify the ZooKeeper connection to use (required if -f missing)
  --zkBasePath,-p <path>    specify the base path in ZooKeeper for agent configs
  --no-reload-conf          do not reload config file if changed
  --help,-h                 display help text

avro-client options:
  --rpcProps,-P <file>   RPC client properties file with server connection params
  --host,-H <host>       hostname to which events will be sent
  --port,-p <port>       port of the avro source
  --dirname <dir>        directory to stream to avro source
  --filename,-F <file>   text file to stream to avro source (default: std input)
  --headerFile,-R <file> File containing event headers as key/value pairs on each new line
  --help,-h              display help text

  Either --rpcProps or both --host and --port must be specified.

Note that if <conf> directory is specified, then it is always included first
in the classpath.

尖括号是一定要有的 []可有可无的选项

4.实战进行简单的学习  ---引出
	这个需求1 看不懂没有关系 下面会有具体的知识讲解  
	
需求1：使用Flume从指定的端口中获取数据，输出到控制台
分析：Agent的选型问题
	Source： nc
	Channel: memory
	Sink:    logger

# 定义Agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 定义Source
a1.sources.r1.type = netcat
a1.sources.r1.bind = ruozedata001
a1.sources.r1.port = 44445

# 定义Channel
a1.channels.c1.type = memory

# 定义Sink
a1.sinks.k1.type = logger

# 定义配置关系
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

命令解释：可以再上面的flume-ng 查看具体的意思

flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/nc-mem-logger.conf\（就是编写agent的一个配置文件）
-Dflume.root.logger=INFO,console \      **打印日志到控制台
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343    **在本机加上这个端口号可以查看一些东西（metrics）为之后监控做准备。

启动之后看生成的日志里有 reloading 它代表不需要重启（就是你改了flume的agent内容，它会重加载不需要kill掉重启flume）

日志里面会有event
Event: 就是一条数据  是数据传输的最小单元 event从source到channel到sink

执行顺序： 先开启flume的哈 
1.执行命令：
	flume-ng agent \
	--name a1 \
	--conf-file /home/hadoop/script/flume/nc-mem-logger.conf \
	--conf $FLUME_HOME/conf \
	-Dflume.root.logger=INFO,console \
	-Dflume.monitoring.type=http \
	-Dflume.monitoring.port=34343
2.开启telnet 
telnet ruozedata001 44445

结果：
在这里插入图片描述

eg：
Event: { headers:{} body: 72 75 6F 7A 65 64 61 74 61 0D                   ruozedata. }
Event：headers + body(字节数组))(存的是真正的数据）

未完待续。。。下一节介绍Flume的使用

Scala语法细节05--double_happy

2017-12-08

Case Class

/**
  *   
  * case class 样例类  必须要有参数列表
  * case object 样例对象  必须不能加参数列表
  *
  * interview： class 和 case class的区别
  *
  * case class 重写了toString， equals  hashCode
  * case class 默认就实现了序列化
  * case class 不用new
  */
object CaseClassApp {
  def main(args: Array[String]): Unit = {
    println(Dog("旺财").name)
  }

}

case class Dog(name:String)

测试上面的结论：

class :
	scala> class person(val name :String , val age : Int)
	defined class person
	
	scala> val p1 = new person("sx",24)
	p1: person = person@38f796a5
	
	scala> val p2 = new person("sx",24)
	p2: person = person@1d04ef4f
	
	scala> p1 ==p2
	res37: Boolean = false
case class :
	scala> case class Person(name : String ,age : Int)
	defined class Person

	scala> val p3 = Person("sx",30)
	p3: Person = Person(sx,30)
	
	scala> val p4 = Person("sx",30)
	p4: Person = Person(sx,30)
	
	scala> p3 ==p4
	res38: Boolean = true

这里为什么 case  class 的就相等呢？
	case class 重写了toString， equals  hashCode

模式匹配

**
  *
  * 模式匹配
  * 身高
  * 腿
  * 脸
  * 胸
  *
  *  变量  match {
  *     case 颜值 => code
  *     case 腿  => code
  *     case 身高 => code
  *     case 脸   =>  code
  *     case _  => 凤姐 芙蓉姐姐
  *  }
  *
  * 匹配内容、匹配类型、匹配集合、匹配case
  */
object MatchApp {

  def main(args: Array[String]): Unit = {

    /**
      * 匹配内容
      */

    val teachers = Array("Aoi Sola", "YuiHatano", "Akiho Yoshizawa")
    val name = teachers(Random.nextInt(teachers.length))

    name match {
      case "YuiHatano" => println("波老师")
      case "Akiho Yoshizawa" => println("吉老师")
      case _ => println("真不知道这位老师是谁")
    }

//    println(name)

    /**
      * 类型匹配
      */

    def matchType(obj:Any) = obj match {
      case x:Int => println("Int")
      case s:String => println("String")
      case m:Map[_,_] => println("Map")
      case _ => println("Other Type...")
    }

//    matchType(1)
//    matchType("若泽")
//    matchType(Map("ruoze"->30))
//    matchType(10L)

    /**
      * 匹配集合
      */
    def matchList(list:List[String]): Unit = {
      list match {
        case "ruoze"::Nil => println("Hello: ruoze") // 只能匹配只有若泽一个元素
        case x::y::Nil => println(s"Hi: $x , $y") // 能匹配集合中有两个元素的
        case "jepson"::tail => println("HI:jepson and others") // 匹配jepson开头的
        case _ => println("......")
      }
    }

//    matchList(List("ruoze"))
//    matchList(List("苍老师","泷老师"))
//    matchList(List("jepson","苍老师","泷老师","波老师"))
//    matchList(List("苍老师","泷老师","波老师","jepson"))


    /**
      * case class匹配
      */
    val caseclasses = Array(CheckTimeOutTask,HeartBeat(3000),SubmitTask("100","task100"))

    caseclasses(Random.nextInt(caseclasses.length)) match {
      case CheckTimeOutTask => println("CheckTimeOutTask")
      case HeartBeat(time) => println("HeartBeat")
      case SubmitTask(id,name) => println("SubmitTask")
    }

//涉及到流的方式 
    val file = "xx.txt"
    try{
      //TODO... 业务逻辑处理
      // open file
      1/0
    } catch {
      case e:ArithmeticException => println("除数不能为0...")
      case e:Exception => e.printStackTrace()
    } finally {
      // 资源释放的
      // close
      println("一定会执行....")
    }


  }


}

case class SubmitTask(id:String,name:String)
case class HeartBeat(time:Long)
case object CheckTimeOutTask

柯里化与偏函数

object OtherFunctionApp {

  def main(args: Array[String]): Unit = {
    // currying
    def sum(a:Int,b:Int) = a + b

    //println(sum(3,5))

    // Spark源码  Spark SQL UDF
    def sum2(a:Int)(b:Int) = a + b
    //println(sum2(4)(6))

    /**
      * 偏函数  PartialFunction
      * A: 输入参数类型
      * B：输出参数类型
      *
      * 包在花括号内没有match的一组case语句
      */
    val teachers = Array("Aoi Sola", "YuiHatano", "Akiho Yoshizawa")
    val name = teachers(Random.nextInt(teachers.length))


    def say:PartialFunction[String,String] = {
      case "YuiHatano" => "波老师"
      case "Akiho Yoshizawa" => "吉老师"
      case _ => "真不知道这位老师是谁"
    }

    println(say(name))
  }

}

文件操作

object FileApp {
  def main(args: Array[String]): Unit = {
    val content = Source.fromFile("E:\\ruozedata_workspace\\ruozedata-spark\\data\\file.txt")
//    println(content)

    def read(): Unit ={
      for(line <- content.getLines()){  //
        println(line)
      }
    }
    read()
  }

}

回顾上篇文章：

Action

Speed

RDD

程序开发入口

算子

并行度 –简单版

Hive统计表

使用Sqoop把Hive里数据倒入MySQL 注意的点

任务调度

调优点：

压缩的技术

压缩的使用场景结合mapreduce

凡事都有两面性

常见的压缩格式

如何选择呢，这么多压缩的格式 压缩比和解压缩度

压缩能否分片

压缩是否支持分割

MapReduce作业使用压缩实战

Hive的压缩使用

SQL-第一题：聚合函数形式的题

SQL题01

SQL2

隐式转换

Scala泛型

Scala中的排序

逆变和协变

Scala操作JDBC

Flume核心组件

多Agent

优化上面的图

Flume优化的东西要注意的

Flume的agent配置模板

2.Flume支持的source、channel、sink有哪些呢？

3.案例分析

4.实战

介绍

Case Class

模式匹配

柯里化与偏函数

文件操作

如何选择呢，这么多压缩的格式压缩比和解压缩度