最近公司因为断电之前没有关闭Hadoop集群,造成数据丢失,namenode坏了,无法启动,所以我尝试恢复。
方法一:使用hadoop namenode -importCheckpoint
1、删除name目录:
1 [hadoop@node1 hdfs]$ rm -rf name
2、关闭集群,从secondarynamenode拷贝namesecondary目录到dfs.name.dir:
[hadoop@node2 hdfs]$ scp -r namesecondary node1:/app/user/hdfs/fsp_w_picpath 100% 157 0.2KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsp_w_picpath 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsp_w_picpath 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00
3、在namenode节点上执行hadoop namenode -importCheckpoint
[hadoop@node1 hdfs]$ hadoop namenode -importCheckpoint13/11/14 07:24:20 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = node1/192.168.1.151 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/13/11/14 07:24:20 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=900013/11/14 07:24:20 INFO namenode.NameNode: Namenode up at: node1.com/192.168.1.151:900013/11/14 07:24:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null13/11/14 07:24:20 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop13/11/14 07:24:21 INFO namenode.FSNamesystem: supergroup=supergroup13/11/14 07:24:21 INFO namenode.FSNamesystem: isPermissionEnabled=true13/11/14 07:24:21 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean13/11/14 07:24:21 INFO common.Storage: Storage directory /app/user/hdfs/name is not formatted.13/11/14 07:24:21 INFO common.Storage: Formatting ...13/11/14 07:24:21 INFO common.Storage: Number of files = 2613/11/14 07:24:21 INFO common.Storage: Number of files under construction = 013/11/14 07:24:21 INFO common.Storage: Image file of size 2410 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Edits file /app/user/hdfs/namesecondary/current/edits of size 4 edits # 0 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 13/11/14 07:24:21 INFO namenode.FSNamesystem: Finished loading FSImage in 252 msecs13/11/14 07:24:21 INFO hdfs.StateChange: STATE* Safe mode ON. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.13/11/14 07:24:21 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog13/11/14 07:24:21 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 5007013/11/14 07:24:21 INFO http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 5007013/11/14 07:24:21 INFO http.HttpServer: Jetty bound to port 5007013/11/14 07:24:21 INFO mortbay.log: jetty-6.1.1413/11/14 07:24:21 INFO mortbay.log: Started SelectChannelConnector@node1.com:5007013/11/14 07:24:21 INFO namenode.NameNode: Web-server up at: node1.com:5007013/11/14 07:24:21 INFO ipc.Server: IPC Server Responder: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server listener on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 0 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 1 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 2 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 3 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 4 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 5 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 6 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 9 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 7 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 8 on 9000: starting13/11/14 07:37:05 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151 ************************************************************/[hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps1027 JobTracker1121 Jps879 NameNode
总结:
注意:恢复的namenode中secondarynamenode的最近一次check到故障发生这段时间的内容将丢失,所以fs.checkpoint.period参数值在实际设定中要尽可能的权衡。并且也时常备份secondarynamenode节点中的内容,因为scondarynamenode也是单点的,以防发生故障。
补充说明:如果是用新的节点来恢复namenode,则要注意
1、新节点的Linux环境,目录结构,环境变量等等配置需要跟原来的namenode一模一样,包括conf目录下的所有文件配置。
2、新namenode的主机名要与原namenode保持一致,如果是重新命名主机名的话,则需要批量替换datanode和secondarynamenode的hosts文件,并且重新配置以下文件的部分core-site.xml文件中的fs.default.name
hdfs-site.xml文件中的dfs.http.address(secondarynamenode节点上)
mapred-site.xml文件中的mapred.job.tracker(如果jobtracker与namenode在同一个机器上,一般都是同一台机器上)。
还有第二种方法:
使用namespaceID
1、关闭集群,格式化namenode:
1 [hadoop@node1 name]$ stop-all.sh 2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 no namenode to stop 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ hadoop namenode -format10 13/11/14 06:21:37 INFO namenode.NameNode: STARTUP_MSG: 11 /************************************************************12 STARTUP_MSG: Starting NameNode13 STARTUP_MSG: host = node1/192.168.1.15114 STARTUP_MSG: args = [-format]15 STARTUP_MSG: version = 0.20.216 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 201017 ************************************************************/18 Re-format filesystem in /app/user/hdfs/name ? (Y or N) Y19 13/11/14 06:21:39 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop20 13/11/14 06:21:39 INFO namenode.FSNamesystem: supergroup=supergroup21 13/11/14 06:21:39 INFO namenode.FSNamesystem: isPermissionEnabled=true22 13/11/14 06:21:39 INFO common.Storage: Image file of size 96 saved in 0 seconds.23 13/11/14 06:21:39 INFO common.Storage: Storage directory /app/user/hdfs/name has been successfully formatted.24 13/11/14 06:21:39 INFO namenode.NameNode: SHUTDOWN_MSG: 25 /************************************************************26 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.15127 ************************************************************/
2、从任意datanode中获取namenode格式化之前namespaceID并修改namenode的namespaceID跟datanode一致:
#Thu Nov :: CST namespaceID storageIDDS. cTime storageType layoutVersion apphdfsdata ----修改namenode的namespaceID---- #Thu Nov :: CST namespaceID cTime storageType layoutVersion
3、删除新的namenode的fsp_w_picpath文件:
1 [hadoop@node1 current]$ ll2 total 163 -rw-rw-r-- 1 hadoop hadoop 4 Nov 14 06:21 edits4 -rw-rw-r-- 1 hadoop hadoop 96 Nov 14 06:21 fsp_w_picpath6 -rw-rw-r-- 1 hadoop hadoop 8 Nov 14 06:21 fstime6 -rw-rw-r-- 1 hadoop hadoop 101 Nov 14 06:22 VERSION7 [hadoop@node1 current]$ rm fsp_w_picpath
4、从Secondarynamenode拷贝fsp_w_picpath到Namenode的current目录下:
[hadoop@node2 current]$ ll total 16-rw-rw-r-- 1 hadoop hadoop 4 Nov 14 05:38 edits-rw-rw-r-- 1 hadoop hadoop 2410 Nov 14 05:38 fsp_w_picpath-rw-rw-r-- 1 hadoop hadoop 8 Nov 14 05:38 fstime-rw-rw-r-- 1 hadoop hadoop 101 Nov 14 05:38 VERSION[hadoop@node2 current]$ scp fsp_w_picpath node1:/app/user/hdfs/name/currentThe authenticity of host 'node1 (192.168.1.151)' can't be established. RSA key fingerprint is ca:9a:7e:19:ee:a1:35:44:7e:9d:d4:09:5c:fc:c5:0a. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node1,192.168.1.151' (RSA) to the list of known hosts. fsp_w_picpath 100% 2410 2.4KB/s 00:00
5、重启集群:
[hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps32486 Jps32419 JobTracker32271 NameNode
在第二种方法中,其中1,2步骤是格式化namenode,后面3,4,5是用备份恢复之前的数据。
我恢复的时候,3,4,5也做了,是备份数据竟然是去年9月份的,而且也是坏的,无奈,所有数据都没了。。。所以说一定要定时去手动备份namenode和secondrynamenode,因为本身系统也是单点备份,很不可靠,折腾了几天,还是没有恢复数据。。算是吃一堑长一智吧。
原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/193926.html