1.高级命令

常用命令总结：

一：HA命令

操作HA 就用 hdfs haadmin
[hadoop@ruozedata003 ~]$ hdfs haadmin
Usage: DFSHAAdmin [-ns <nameserviceId>]
    [-transitionToActive <serviceId> [--forceactive]]
    [-transitionToStandby <serviceId>]
    [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
    [-getServiceState <serviceId>]
    [-checkHealth <serviceId>]
    [-help <command>]

1.hdfs  haadmin  -getServiceState    nn1/nn2       查看serviceId的状态

eg：
[hadoop@ruozedata003 ~]$ hdfs haadmin -getServiceState nn1
active
[hadoop@ruozedata003 ~]$ hdfs haadmin -getServiceState nn2
standby
[hadoop@ruozedata003 ~]$

小案例模拟：假设active 的 nn 被kill 掉了，standby 的 nn 能不能变成active？

（1）把 nn1的namenode kill掉
			[hadoop@ruozedata001 ~]$ kill -9 4584
			[hadoop@ruozedata001 ~]$ jps
			5184 ResourceManager
			4690 DataNode
			4338 QuorumPeerMain
			4883 JournalNode
			5291 NodeManager
			5836 Jps
			5070 DFSZKFailoverController
			[hadoop@ruozedata001 ~]$ 
（2）查看nn2的状态 
			[hadoop@ruozedata003 ~]$ hdfs haadmin  -getServiceState nn1
			19/08/24 17:49:08 INFO ipc.Client: Retrying connect to server: ruozedata001/172.17.76.204:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
			Operation failed: Call From ruozedata003/172.17.76.205 to ruozedata001:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
			[hadoop@ruozedata003 ~]$ hdfs haadmin  -getServiceState nn2
			active
说明standby nn 切换成active
（3）手动启动 nn1 单点的namenode
			[hadoop@ruozedata001 ~]$ hadoop-daemon.sh start namenode
			starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/logs/hadoop-hadoop-namenode-ruozedata001.out
			[hadoop@ruozedata001 ~]$ jps
			5184 ResourceManager
			4690 DataNode
			4338 QuorumPeerMain
			4883 JournalNode
			5989 Jps
			5909 NameNode
			5291 NodeManager
			5070 DFSZKFailoverController
			[hadoop@ruozedata001 ~]$ 
（4）查看各个serviceId的状态
			[hadoop@ruozedata001 ~]$ hdfs haadmin  -getServiceState nn1
			standby
			[hadoop@ruozedata001 ~]$ hdfs haadmin  -getServiceState nn2
			active
			[hadoop@ruozedata001 ~]$

2.把standby 切换为active的失效转移
hdfs haadmin -failover 第一个参数是active的nn serviceId1的active转向serviceId2的active

[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn1
active
[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn2
standby
[hadoop@ruozedata001 ~]$ hdfs haadmin -failover nn2 nn1   
Failover to NameNode at ruozedata001/172.17.76.204:8020 successful
[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn1
active
[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn2
standby
[hadoop@ruozedata001 ~]$ hdfs haadmin -failover nn1 nn2
Failover to NameNode at ruozedata002/172.17.76.206:8020 successful
[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn1
standby
[hadoop@ruozedata001 ~]$ hdfs haadmin -getServiceState nn2
active
[hadoop@ruozedata001 ~]$

二：HDFS集群的健康检查命令

当有文件损坏的时候损坏的块/丢失的副本

做hdfs集群健康检查hdfs fsck
[hadoop@ruozedata001 ~]$ hdfs fsck
Usage: DFSck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-maintenance]
        <path>  start checking from this path
        -move   move corrupted files to /lost+found
        -delete delete corrupted files
        -files  print out files being checked
        -openforwrite   print out files opened for write
        -includeSnapshots       include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
        -list-corruptfileblocks    print out list of missing blocks and files they belong to
        -blocks print out block report
        -locations      print out locations for every block
        -racks  print out network topology for data-node locations

        -maintenance    print out maintenance state node details
        -blockId        print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

1.查看有没有损坏的块和丢失的副本

[hadoop@ruozedata001 ~]$ hdfs fsck /
Connecting to namenode via http://ruozedata002:50070/fsck?ugi=hadoop&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /172.17.76.204 for path / at Sat Aug 24 18:14:07 CST 2019
Status: HEALTHY
 Total size:    0 B
 Total dirs:    7
 Total files:   0
 Total symlinks:                0
 Total blocks (validated):      0
 Minimally replicated blocks:   0
 Over-replicated blocks:        0
 Under-replicated blocks:       0
 Mis-replicated blocks:         0
 Default replication factor:    3
 Average block replication:     0.0
 Corrupt blocks:                0
 Missing replicas:              0
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Sat Aug 24 18:14:07 CST 2019 in 2 milliseconds
The filesystem under path '/' is HEALTHY
[hadoop@ruozedata001 ~]$

2.如果有损坏的块和丢失的副本不想恢复他们想一了百了只删除损坏的文件

[hadoop@ruozedata001 ~]$ hdfs fsck / -delete
Connecting to namenode via http://ruozedata002:50070/fsck?ugi=hadoop&delete=1&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /172.17.76.204 for path / at Sat Aug 24 18:16:17 CST 2019
Status: HEALTHY
 Total size:    0 B
 Total dirs:    7
 Total files:   0
 Total symlinks:                0
 Total blocks (validated):      0
 Minimally replicated blocks:   0
 Over-replicated blocks:        0
 Under-replicated blocks:       0
 Mis-replicated blocks:         0
 Default replication factor:    3
 Average block replication:     0.0
 Corrupt blocks:                0
 Missing replicas:              0
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Sat Aug 24 18:16:17 CST 2019 in 0 milliseconds
The filesystem under path '/' is HEALTHY

注意：如果不想删除损坏的文件怎么办呢？（作业）
3.损坏的文件在哪些块上面（因为也不知道文件的这些块分布在那台机器上面） -list-corruptfileblocks 会打印损坏的文件的哪些块损坏了或丢失了

三：HDFS集群案例

1.现象是：
断电导致HDFS服务不正常或者显示块损坏

2.检查HDFS系统文件健康状态
hdfs fsck / 或者在50070的页面查看(这个命令只会显示哪个文件损坏了)

3.检查损坏的文件的哪些块损坏 (加上-list-corruptfileblocks 参数)

hdfs fsck / -list-corruptfileblocks 会打印出损坏的块
这块会打印出哪个文件的哪些块损坏

4.解决
1.第一种解决办法：不建议耗时
把损坏的数据重新刷一份到HDFS平台即可（前提你明确知道重刷的哪个数据（就是你删除哪个损坏文件对应的源头数据在哪））
2.第二种解决办法：下面的实验

5.想要知道文件的哪些块分布在哪些机器上面？
如果我知道哪个文件的哪些块在哪台机器上面我就可以手工的把那个块删掉了（使用linux命令）就不用使用 hdfs fsck / -delete

因为 hdfs fsck / -delete 他是删除损坏的文件（直接把损坏的文件干掉了而不是删除损坏的块）

但是 有一个问题 如果这个损坏的块被删除 你的数据该如何完整的恢复？ 你怎么知道删除的那个块 数据丢了多少？ 根本不知道

实验：上传一个文件到hdfs上（三个副本），在某台机器上找到对应的副本文件删掉一个副本文件
最终使用hdfs debug 来利用其余两个好的副本恢复损坏的文件

重要的命令：hdfs debug
[hadoop@ruozedata001 ~]$ hdfs debug
Usage: hdfs debug <command> [arguments]
These commands are for advanced users only.
Incorrect usages may result in data loss. Use at your own risk.
verifyMeta -meta <metadata-file> [-block <block-file>]
computeMeta -block <block-file> -out <output-metadata-file>
recoverLease -path <path> [-retries <num-retries>]
[hadoop@ruozedata001 ~]$