Spark008--补充Spark007

在这里插入图片描述
上次的结果输出的文本是这样的，客户说我需要压缩的格式呢？
生产上出来的数据非常非常大肯定是需要压缩的
eg：5分钟数据量达到10多个G

/**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD.
   *
   * @note We should make sure our tasks are idempotent when speculation is enabled, i.e. do
   * not use output committer that writes data directly.
   * There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
   * result of using direct output committer with speculation enabled.
   */
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit


codec 就是指定压缩的

object MulitOutputApp {

  def main(args: Array[String]): Unit = {
    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)

    val output = "file:///C:/IdeaProjects/spark/out/mulit"
    /**
      * Android
      *   xxxx.log
      *
      * iOS
      *   xxx.log
      */
    val input = sc.textFile("file:///C:/IdeaProjects/spark/data/access.log")
    input.map(x => {
      val splits = x.split("\t")
      (splits(1), x) // (platform , 完整的日志)
    }).partitionBy(new HashPartitioner(5))
      .saveAsHadoopFile(output,classOf[String],classOf[String],
        classOf[RuozedataMultipleTextOutputFormat],
        classOf[GzipCodec])      //这块加上压缩的格式
    sc.stop()
  }

  class RuozedataMultipleTextOutputFormat extends MultipleTextOutputFormat[Any,Any]{
    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
      s"$key/$name"
    }


    override def generateActualKey(key: Any, value: Any): AnyRef = {
      NullWritable.get()
    }

  }

}
结果：

在这里插入图片描述
学学底层的实现： 前面的文章好像有写忘记了

1.saveAsHadoopFile
 def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      codec: Class[_ <: CompressionCodec]): Unit = self.withScope {
    saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass,
      new JobConf(self.context.hadoopConfiguration), Some(codec))
  }

2.点进去
 saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass,
      new JobConf(self.context.hadoopConfiguration), Some(codec))

3.
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
    // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
    val hadoopConf = conf
    hadoopConf.setOutputKeyClass(keyClass)
    hadoopConf.setOutputValueClass(valueClass)
    conf.setOutputFormat(outputFormatClass)
    for (c <- codec) {
      hadoopConf.setCompressMapOutput(true)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
      hadoopConf.setMapOutputCompressorClass(c)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", c.getCanonicalName)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.type",
        CompressionType.BLOCK.toString)
    }

    // Use configured output committer if already set
    if (conf.getOutputCommitter == null) {
      hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
    }

    // When speculation is on and output committer class name contains "Direct", we should warn
    // users that they may loss data if they are using a direct output committer.
    val speculationEnabled = self.conf.getBoolean("spark.speculation", false)
    val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "")
    if (speculationEnabled && outputCommitterClass.contains("Direct")) {
      val warningMessage =
        s"$outputCommitterClass may be an output committer that writes data directly to " +
          "the final location. Because speculation is enabled, this output committer may " +
          "cause data loss (see the case in SPARK-10063). If possible, please use an output " +
          "committer that does not have this behavior (e.g. FileOutputCommitter)."
      logWarning(warningMessage)
    }

    FileOutputFormat.setOutputPath(hadoopConf,
      SparkHadoopWriterUtils.createPathFromString(path, hadoopConf))
    saveAsHadoopDataset(hadoopConf)
  }


4. 3里面关键
   for (c <- codec) {
      hadoopConf.setCompressMapOutput(true)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
      hadoopConf.setMapOutputCompressorClass(c)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", c.getCanonicalName)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.type",
        CompressionType.BLOCK.toString)
    }

 hadoopConf.setCompressMapOutput(true) 点进去看看

 public void setCompressMapOutput(boolean compress) {
    setBoolean(JobContext.MAP_OUTPUT_COMPRESS, compress);
  }
再点进去

public static final String MAP_OUTPUT_COMPRESS = "mapreduce.map.output.compress";

mapreduce.map.output.compress 这个参数不就是设置map端的输出压缩
mapreduce.output.fileoutputformat.compress.codec 这个参数 
设置最终输出的压缩 前面的文章写过

所以这底层的代码就是Mapreduce代码 **** 就是Spark给封装好的

补充知识点

/**
      *  map vs mapPartitions(优先选择)  transformation
      *
      *  RDD 1w个元素  需要你把这个RDD的元素写入到MySQL （前提没有使用数据库连接池那种）
      * 1w个元素 我要写1w次MySQL
      * 
      *  RDD 1w个元素  10分区   10次
      *
map vs mapPartitions ：
这两个写入db哪个好？   优先级角度选 mapPartitions 是没有问题的 能解决80%生产上的操作
分情况的 应该说都不好  你应该想说 mapPartitions好 但是
当你数据量很大 分区数很少 那么一个分区里的数据量很大 写的时候 可能oom 


      *
      *  foreach vs foreachPartition   action
      *
      * rdd==> transformations ==> action
      *  真正生产上把RDD的数据写入到DB，是使用foreachPartition
      */

object InterviewApp03ToMySQL {

  def main(args: Array[String]): Unit = {
    val TOPN = 2
    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)
    val input = sc.textFile("file:///C:/IdeaProjects/spark/data/site.log")
    val processRDD = input.map(x => {
      val splits = x.split(",")
      val site = splits(0)
      val url = splits(1)
      ((site, url), 1)
    })
    val sites = processRDD.map(_._1._1).distinct().collect()  // 数组
    sites.map(x=>{
      processRDD.filter(_._1._1 == x).reduceByKey(_+_)
        .sortBy(-_._2)
        .foreachPartition(partition =>{
          var connection : Connection = null
          var pstmt:PreparedStatement = null
          try{
            connection = MySQLUtils.getConnection()
            val sql = "insert into topn(domain,url,cnt) values (?,?,?)"
            pstmt = connection.prepareStatement(sql)
            //真正的数据是分区里的元素
            partition.foreach(x =>{
              pstmt.setString(1,x._1._1)
              pstmt.setString(2,x._1._2)
              pstmt.setInt(3,x._2)
              pstmt.execute()  
            })
          }catch {
            case  e:Exception => e.printStackTrace()
          }finally {
            MySQLUtils.closeResource(pstmt,connection)  //这块只关闭connection可以么? 可以的 java知识
          }

        })
    })

    sc.stop()
  }
}

object MySQLUtils {
  def getConnection():Connection = {
    Class.forName("com.mysql.jdbc.Driver")
    getConnection("hadoop101", "3306", "hive_dwd", "root", "wsx123$%^")
  }
  def getConnection(host: String, port: String, database: String, user: String, password: String):Connection = {
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection(s"jdbc:mysql://${host}:${port}/${database}", user, password)
  }

  /**
    * 关闭资源
    *
    * @param resources 可变数组
    */
  def closeResource(resources: AutoCloseable*): Unit = {
    for (resource <- resources) {
      resource.close()
    }
  }
}

结果：

mysql> select * from topn;
+-----------------+-------+------+
| domain          | url   | cnt  |
+-----------------+-------+------+
| www.baidu.com   | url2  |    2 |
| www.baidu.com   | url5  |    5 |
| www.baidu.com   | url1  |    1 |
| www.baidu.com   | url4  |    4 |
| www.baidu.com   | url3  |    3 |
| www.twitter.com | url10 |   11 |
| www.twitter.com | url6  |    1 |
| www.twitter.com | url9  |    6 |
| www.google.com  | url2  |    2 |
| www.google.com  | url6  |    7 |
| www.google.com  | url1  |    1 |
| www.google.com  | url8  |    7 |
+-----------------+-------+------+
12 rows in set (0.00 sec)

mysql>

能使用scalikejdbc 把数据写入MySQL更好哈   前面scala篇有讲

问题指出：
  partition.foreach(x =>{
              pstmt.setString(1,x._1._1)
              pstmt.setString(2,x._1._2)
              pstmt.setInt(3,x._2)
              pstmt.execute()  
            })

如果你一个partition 一个元素执行一次  里有1w个元素呢？   性能不好

肯定是要批处理的   一个批次给它搞一个事务 把自动提交给关掉

object InterviewApp03ToMySQL {

  def main(args: Array[String]): Unit = {

    val TOPN = 2
    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)
    val input = sc.textFile("file:///C:/IdeaProjects/spark/data/site.log")
    val processRDD = input.map(x => {
      val splits = x.split(",")
      val site = splits(0)
      val url = splits(1)
      ((site, url), 1)
    })


    val sites = processRDD.map(_._1._1).distinct().collect()  // 数组
    sites.map(x=>{
      processRDD.filter(_._1._1 == x).reduceByKey(_+_)
        .sortBy(-_._2)
        .foreachPartition(partition =>{
          var connection : Connection = null
          var pstmt:PreparedStatement = null
          try{
            connection = MySQLUtils.getConnection()
            connection.setAutoCommit(false)

     /*       //先把之前的数据删掉   
            val sqlDelete = s"delete from topn where access_time = ${args(0)}"
            pstmt = connection.prepareStatement(sqlDelete)
            pstmt.execute()*/

            //再插入数据
            val sql = "insert into topn(domain,url,cnt) values (?,?,?)"
            pstmt = connection.prepareStatement(sql)
            //真正的数据是分区里的元素
            partition.foreach(x =>{
              pstmt.setString(1,x._1._1)
              pstmt.setString(2,x._1._2)
              pstmt.setInt(3,x._2)
              pstmt.addBatch()
            })
            pstmt.executeBatch()   //执行批次
            connection.commit()  //提交事务
          }catch {
            case  e:Exception => e.printStackTrace()
          }finally {
            MySQLUtils.closeResource(pstmt,connection)
          }
        })
    })
    sc.stop()
  }
}

结果：

mysql> truncate table topn;       //代码里加上删除之前的数据 可以解决 最好别truncate 只是学习时方便
Query OK, 0 rows affected (0.01 sec)

mysql> select * from topn; 
+-----------------+-------+------+
| domain          | url   | cnt  |
+-----------------+-------+------+
| www.baidu.com   | url5  |    5 |
| www.baidu.com   | url2  |    2 |
| www.baidu.com   | url4  |    4 |
| www.baidu.com   | url1  |    1 |
| www.baidu.com   | url3  |    3 |
| www.twitter.com | url6  |    1 |
| www.twitter.com | url10 |   11 |
| www.twitter.com | url9  |    6 |
| www.google.com  | url2  |    2 |
| www.google.com  | url6  |    7 |
| www.google.com  | url1  |    1 |
| www.google.com  | url8  |    7 |
+-----------------+-------+------+
12 rows in set (0.00 sec)

mysql>

Submitting Applications

工作当中是再idea里开发的在生产上是在Submitting Applications

在这里插入图片描述

1.Spark-shell 底层调用的是Spark-submit

Spark-submit怎么使用呢？

Submitting Applications

idea里面：

object InterviewApp03 {

  def main(args: Array[String]): Unit = {
        val conf = new SparkConf()       //提交到集群上的时候 setAppName 和setMater全都去掉
        val sc = new SparkContext(conf)
        val input = sc.textFile(args(0))
        val processRDD = input.map(x => {
          val splits = x.split(",")
          val site = splits(0)
          val url = splits(1)
          ((site, url), 1)
        }).reduceByKey(_+_).saveAsTextFile(args(1))
    sc.stop()
  }
}

打包提交到集群上去

[double_happy@hadoop101 bin]$ ./spark-submit --help
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      
[double_happy@hadoop101 bin]$ 


[]  可选  <> 必选

spark-submit  \
--name InterviewApp03 \
--class com.ruozedata.spark.spark05.InterviewApp03 \
--master local[2] \
/home/double_happy/lib/spark-core-1.0.jar \
/data_spark/input/  /data_spark/output 


抛出一个问题 我的路径前面没有添加 hdfs:xxx:8020/路径  为什么我的不用加 ？

在这里插入图片描述

[double_happy@hadoop101 lib]$ hadoop fs -text /data_spark/output/par*
19/10/24 20:05:35 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
19/10/24 20:05:35 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
((www.google.com,url6),7)
((www.twitter.com,url9),6)
((www.baidu.com,url1),1)
((www.google.com,url8),7)
((www.google.com,url1),1)
((www.baidu.com,url3),3)
((www.google.com,url2),2)
((www.twitter.com,url10),11)
((www.twitter.com,url6),1)
((www.baidu.com,url5),5)
((www.baidu.com,url2),2)
((www.baidu.com,url4),4)
[double_happy@hadoop101 lib]$

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies.

你的代码 依赖hadoop spark 但是这些集群本身 就有 打包的时候 要的是瘦包 如果需要第三方的包
可以使用  --jars  或者 package到application 里 

个人不建议使用assembly 打包

到ss 打包再讨论

--jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.

第三方包 加入这个命令

Loading Configuration from a File

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory.

spark-submit 默认会加载 conf/spark-defaults.conf   in the Spark directory 

当然也可以人为指定 ：
 --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).



[double_happy@hadoop101 conf]$ cat spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
[double_happy@hadoop101 conf]$ 


带来的好处是什么？
默认情况下 你每次开启spark-shell  --master local[2]
 每次都要手动加参数 
 可以再 spark-defalut.xml 里加上即可。

如果你的业务线非常多 你就多写几个spark-defalut.xml  通过 --files 传过去你想要的模式

以上是最基本的操作

它默认走的是 spark-defalut.xml 那么底层的实现一定是走的默认参数的

Monitoring

上面跑的程序
在这里插入图片描述

成功完之后 SC被干掉了 UI上面还能看到么？
看不见了 如果半夜三点程序挂了 或者在调优的场景下  你程序跑完 UI就没了呀 该怎么解决呢？

Monitoring

scala> sc.parallelize(List("a","c","c","a","b","b","d")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((d,1), (b,2), (a,2), (c,2))                  

scala>

在这里插入图片描述
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors

1.页面上展示 有多少个job （就是代码里有多少个action）
2.stages --->一个 job里面  有多少个stage    点进去

在这里插入图片描述

3.一个stage有几个task呢？点进去

在这里插入图片描述

Information about the running executors：
关于你运行的executors信息 
1.你运行这个程序 设置了多少个executors 活的有多少 死的有多少
2.对调优很重要 
你肯定要看 ：
   1.shuffle的数据量有多少 
   2.经历多少算子

在这里插入图片描述
这副图运行时间慢的原因是：
emmm 总有贪小便宜的人去我云服务器上挖矿悄踏玛真的很烦因为我把端口全部开放了

If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

这句话挺重要的因为之前在公司里我喜欢用 yarn client模式而我提交的任务比较多
达到特别多的时候再提交任务是排不上的哈提交不上的

Note that this information is only available for the duration of the application by default. To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures Spark to log Spark events that encode the information displayed in the UI to persisted storage.

Note that this information is only available for the duration of the application by default

1.默认情况下 这个页面只能 在 application 生命周期内有效  所以运行完 sc.stop 之后
你的UI界面就看不见了 
2.spark.eventLog.enabled 这个参数设置为true  开启spark日志 当你再打开ui界面 可以看的到 运行完的application

那么怎么构建总结的UI呢？
在这里插入图片描述

1.spark.eventLog.enabled true   开关打开
2.spark.eventLog.dir hdfs://namenode/shared/spark-logs    
记录spark运行过程中日志记录在这个目录  这是spark作业触发的

3.使用start-history-server.sh 展示记录的spark日志  也需要一个目录  
spark.history.fs.logDirectory 用这个参数 
这个参数 是配置spark-env.sh里

注意：
这个跟配置压缩一样 打开开关 + 指定codec

[double_happy@hadoop101 conf]$ cat spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
 spark.master                     local[2]
spark.eventLog.enabled           true
 spark.eventLog.dir               hdfs://hadoop101:8020/spark_directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

参数配置

This creates a web interface at http://:18080 by default, listing incomplete and completed applications and attempts.

1.completed 和 incomplete spark怎么区别呢 需要一个刷新时间 
用这个参数 spark.history.fs.update.interval

2.如果配置成功之后日志日积月累多了 数据量会很大 所以需要定期删除的
spark.history.fs.cleaner.enabled     是否需要清理呢
spark.history.fs.cleaner.interval         清理周期是多少
spark.history.fs.cleaner.maxAge        一次清理几天的

配置一下：
在这里插入图片描述
启动：

1.hdfs 上创建 log 文件夹
2.启动./sbin/start-history-server.sh

测试查看是否成功

[double_happy@hadoop101 sbin]$ ./start-history-server.sh 
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/double_happy/app/spark/logs/spark-double_happy-org.apache.spark.deploy.history.HistoryServer-1-hadoop101.out
[double_happy@hadoop101 sbin]$ tail -200f /home/double_happy/app/spark/logs/spark-double_happy-org.apache.spark.deploy.history.HistoryServer-1-hadoop101.out
Spark Command: /usr/java/java/bin/java -cp /home/double_happy/app/spark/conf/:/home/double_happy/app/spark/jars/*:/home/double_happy/app/hadoop/etc/hadoop/ -Dspark.history.fs.logDirectory=hdfs://hadoop101:8020/spark_directory -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
19/10/24 23:06:35 INFO HistoryServer: Started daemon with process name: 6633@hadoop101
19/10/24 23:06:35 INFO SignalUtils: Registered signal handler for TERM
19/10/24 23:06:35 INFO SignalUtils: Registered signal handler for HUP
19/10/24 23:06:35 INFO SignalUtils: Registered signal handler for INT
19/10/24 23:06:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/24 23:06:35 INFO SecurityManager: Changing view acls to: double_happy
19/10/24 23:06:35 INFO SecurityManager: Changing modify acls to: double_happy
19/10/24 23:06:35 INFO SecurityManager: Changing view acls groups to: 
19/10/24 23:06:35 INFO SecurityManager: Changing modify acls groups to: 
19/10/24 23:06:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(double_happy); groups with view permissions: Set(); users  with modify permissions: Set(double_happy); groups with modify permissions: Set()
19/10/24 23:06:35 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
19/10/24 23:06:37 INFO Utils: Successfully started service on port 18080.
19/10/24 23:06:37 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://hadoop101:18080

说明启动ok

测试：
1.spark-shell 运行了一个东西
2.spark-submit 提交了两次

1.
scala>  sc.parallelize(List("a","c","c","a","b","b","d")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((d,1), (b,2), (a,2), (c,2))

scala> sc.stop

2.
spark-submit  \
> --name InterviewApp03 \
> --class com.ruozedata.spark.spark05.InterviewApp03 \
> --master local[2] \
> /home/double_happy/lib/spark-core-1.0.jar \
> /data_spark/input/  /data_spark/output

在这里插入图片描述
Note that in all of these UIs, the tables are sortable by clicking their headers, making it easy to identify slow tasks, data skew, etc.

Note
1.The history server displays both completed and incomplete Spark jobs. If an application makes multiple attempts after failures, the failed attempts will be displayed, as well as any ongoing incomplete attempt or the final successful attempt.

2.Incomplete applications are only updated intermittently. The time between updates is defined by the interval between checks for changed files (spark.history.fs.update.interval). On larger clusters, the update interval may be set to large values. The way to view a running application is actually to view its own web UI.

3.Applications which exited without registering themselves as completed will be listed as incomplete —even though they are no longer running. This can happen if an application crashes.

4.One way to signal the completion of a Spark job is to stop the Spark Context explicitly (sc.stop()), or in Python using the with SparkContext() as sc: construct to handle the Spark Context setup and tear down.

4.就是你代码里 sc.stop() 写了  程序完成后会显示在 completed 里面 如果不写会显示在incomplete 
所以 为了 好区分正在运行的作业还是 完成的作业 sc.stop() 要加上的

在这里插入图片描述

1.本地运行的作业 全部以 local开头的 
2.ui上显示很多信息   点进去

在这里插入图片描述

这个页面不就回来了么程序已经运行完了测试成功。

Download 之后就是个Json文件 也可以去HDFS上去看我配置的log目录 也可以下载的
整个历史页面 就是靠 Json文件 来渲染的

所以这个东西有了 spark调优就方便了很多

在这里插入图片描述

start-history-server.sh

[double_happy@hadoop101 sbin]$ cat start-history-server.sh 
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Starts the history server on the machine this script is executed on.
#
# Usage: start-history-server.sh
#
# Use the SPARK_HISTORY_OPTS environment variable to set history server configuration.
#

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"

exec "${SPARK_HOME}/sbin"/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1 "$@"

去idea里找到 HistoryServer类：
1.这个类一定有main方法

 def main(argStrings: Array[String]): Unit = {
    Utils.initDaemon(log)
    new HistoryServerArguments(conf, argStrings)
    initSecurity()
    val securityManager = createSecurityManager(conf)


2. new HistoryServerArguments(conf, argStrings)  点进去 
里面的  解析
// This mutates the SparkConf, so all accesses to it must be made after this line
   Utils.loadDefaultSparkProperties(conf, propertiesFile)

3.def loadDefaultSparkProperties(conf: SparkConf, filePath: String = null): String = {
    val path = Option(filePath).getOrElse(getDefaultPropertiesFile())
    Option(path).foreach { confFile =>
      getPropertiesFromFile(confFile).filter { case (k, v) =>
        k.startsWith("spark.")
      }.foreach { case (k, v) =>
        conf.setIfMissing(k, v)
        sys.props.getOrElseUpdate(k, v)
      }
    }
    path
  }

4.getDefaultPropertiesFile

def getDefaultPropertiesFile(env: Map[String, String] = sys.env): String = {
    env.get("SPARK_CONF_DIR")
      .orElse(env.get("SPARK_HOME").map { t => s"$t${File.separator}conf" })
      .map { t => new File(s"$t${File.separator}spark-defaults.conf")}
      .filter(_.isFile)
      .map(_.getAbsolutePath)
      .orNull
  }

能知道 加载 spark-defaults.conf 文件 明白了吗 
详细脚本实现 自己看源码

在这里插入图片描述

源码里有写：
spark.history.retainedApplications   ： 默认50个

The number of applications to retain UI data for in the cache. If this cap is exceeded, then the oldest applications will be removed from the cache. If an application is not in the cache, it will have to be loaded from disk if it is accessed from the UI.

1.The number of applications to retain UI data for in the cache
 是在内存中的
 这个参数 不是 ui上面只能展示50个意思哈   是内存里只放50个 超过了 removed from the cache 
 2. 看解释 很清楚

闭包

在这里插入图片描述

Shared Variables

Shared Variables
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.
Spark natively supports accumulators of numeric types, and programmers can add support for new types.

accumulators：
使用场景：ETl处理的时候 把正确的条数 和总的条数  统计出来 
1.原生的支持数值类型 ，开发者开发可以支持别的类型

scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))

scala> accum.value
res1: Long = 10

scala>

在这里插入图片描述

现在的计数器都是AccumulatorV2 版本  官网上写了如何自定义累加器 生产上我没用用过

案例：

object InterviewApp03 {

  def main(args: Array[String]): Unit = {
    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)
    var cnts = 0
    val data = sc.parallelize(List(1,2,3,4,5,6,7,8),3)
    data.foreach(x => {
      cnts += 1
      println(s"cnts:$cnts")
    })
    println(cnts+"~~~~~~~")
    sc.stop()
  }
}

结果是：
cnts:1
cnts:1
cnts:2
cnts:2
cnts:3
cnts:1
cnts:2
cnts:3
0~~~~~~~

这个结果什么意思？是想要的结果么？

注意：
	上面这个 不是并行的 也就是 多个分区 是无法计算的 
需要使用计数器 
   计数器一定是在action算子之后使用   一定是要触发action的要不然拿不到结果 
那么触发多个action可以么？

broadcast variables ：
类似的
1.前面解析ip库的时候，使用的mapreduce的分布式缓存 
2.在sql里的 

a join b on a.id=b.id ==> shuffle  (普通的join 必然有shuffle的)  
会按照join的条件 作为key(就是id)，其他的值作为value 
 经过shuffle到reduce端 把相同的key聚在一块 来做的

大数据集和小数据集join ==>采用 mapjoin
 小表放到缓存中，不会有真正的join发生，底层其实就是一个匹配  匹配上拿出来 匹配不上就滚蛋的
 
 那么 上面的这些可以使用 broadcast variables 来实现

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

eg:这个代码  
val xx = new HashMap() // 10M
rdd.map(x=>{
    ....xx
})

如果有一万个task  每个task里 额外的  1w*10M

Broadcast variables cached on each machine （理解为executor就可以）而不是 cp到tasks

object RDDOperationApp02 {
  def main(args: Array[String]): Unit = {
    val sc = ContextUtils.getSparkContext(this.getClass.getSimpleName)

    //广播的时候 要使用 kv的 最佳实践
    val rdd1 = sc.parallelize(Array(("23","smart"),("9","愤怒的麻雀"))).collectAsMap()
    val rdd2 = sc.parallelize(Array(("23","郑州"),("9","蜀国"),("14","魔都")))
    val rdd1_bc = sc.broadcast(rdd1)
    rdd2.map(x=>(x._1,x)).mapPartitions(x => {
      val bc_value = rdd1_bc.value
      for((k,v)<- x if(bc_value.contains(k)))
        yield (k, bc_value.get(k).getOrElse(""), v._2)
    }).printInfo()
    sc.stop()
  }
}

结果是：
(9,愤怒的麻雀,蜀国)
(23,smart,郑州)


Scala中的yield的主要作用是记住每次迭代中的有关值，并逐一存入到一个数组中。
要将结果存放到数组的变量或表达式必须放在yield{}里最后位置

scala> val rdd1 = sc.parallelize(Array(("23","smart"),("9","愤怒的麻雀"))).collectAsMap()
rdd1: scala.collection.Map[String,String] = Map(23 -> smart, 9 -> 愤怒的麻雀)

scala> val rdd2 = sc.parallelize(Array(("23","郑州"),("9","蜀国"),("14","魔都")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val rdd1_bc = sc.broadcast(rdd1)
rdd1_bc: org.apache.spark.broadcast.Broadcast[scala.collection.Map[String,String]] = Broadcast(3)

scala> rdd2.map(x=>(x._1,x)).mapPartitions(x => {
     |   val bc_value = rdd1_bc.value
     |   for((k,v)<- x if(bc_value.contains(k)))
     |     yield (k, bc_value.get(k).getOrElse(""), v._2)
     | }).foreach(println)
(23,smart,郑州)
(9,愤怒的麻雀,蜀国)

scala> 

查看页面 是没有shuffle的 没有join的 就是mapjoin

在这里插入图片描述

sparkcore之后sparksql 以及sparkstreaming 、sss、spark调优的重要的文章是进行私密的我写博客的目的是为了做笔记为了学习