Hadoop2.7.6_02_HDFS常用操作详解大数据

 

1. HDFS常用操作

1.1. 查询

1.1.1.  浏览器查询

Hadoop2.7.6_02_HDFS常用操作详解大数据

 

1.1.2. 命令行查询

[[email protected] bin]$ hadoop fs -ls / 

  

1.2. 上传文件

 1 [[email protected] zhangliang]$ cat test.info  
 2 111 
 3 222 
 4 333 
 5 444 
 6 555 
 7  
 8 [[email protected] zhangliang]$ hadoop fs -put test.info /   # 上传文件  
 9 [[email protected] software]$ ll -h 
10 total 641M 
11 -rw-r--r--  1 yun    yun  190M Jun  8 16:36 CentOS-7.4_hadoop-2.7.6.tar.gz 
12 [[email protected] software]$ hadoop fs -put CentOS-7.4_hadoop-2.7.6.tar.gz /  # 上传文件 
13 [[email protected] software]$ hadoop fs -ls /  
14 Found 2 items 
15 -rw-r--r--   3 yun supergroup  198811365 2018-06-10 09:04 /CentOS-7.4_hadoop-2.7.6.tar.gz 
16 -rw-r--r--   3 yun supergroup         21 2018-06-10 09:02 /test.info

 

1.2.1. 文件存放位置

 1 [[email protected] subdir0]$ pwd   # 根据之前的配置,总共3份相同的文件  
 2 /app/hadoop/tmp/dfs/data/current/BP-925531343-10.0.0.11-1528537498201/current/finalized/subdir0/subdir0 
 3 [[email protected] subdir0]$ ll -h  # 默认128M 切片  
 4 total 192M 
 5 -rw-rw-r-- 1 yun yun   21 Jun 10 09:02 blk_1073741825 
 6 -rw-rw-r-- 1 yun yun   11 Jun 10 09:02 blk_1073741825_1001.meta 
 7 -rw-rw-r-- 1 yun yun 128M Jun 10 09:04 blk_1073741826 
 8 -rw-rw-r-- 1 yun yun 1.1M Jun 10 09:04 blk_1073741826_1002.meta 
 9 -rw-rw-r-- 1 yun yun  62M Jun 10 09:04 blk_1073741827 
10 -rw-rw-r-- 1 yun yun 493K Jun 10 09:04 blk_1073741827_1003.meta 
11 [[email protected] subdir0]$ cat blk_1073741825 
12 111 
13 222 
14 333 
15 444 
16 555 
17  
18 [[email protected] subdir0]$

 

1.2.2. 浏览器访问

Hadoop2.7.6_02_HDFS常用操作详解大数据

 

1.3. 文件下载

 1 [[email protected] zhangliang]$ pwd 
 2 /app/software/zhangliang 
 3 [[email protected] zhangliang]$ ll 
 4 total 0 
 5 [[email protected] zhangliang]$ hadoop fs -ls / 
 6 Found 2 items 
 7 -rw-r--r--   3 yun supergroup  198811365 2018-06-10 09:04 /CentOS-7.4_hadoop-2.7.6.tar.gz 
 8 -rw-r--r--   3 yun supergroup         21 2018-06-10 09:02 /test.info 
 9 [[email protected] zhangliang]$ hadoop fs -get /test.info 
10 [[email protected] zhangliang]$ hadoop fs -get /CentOS-7.4_hadoop-2.7.6.tar.gz 
11 [[email protected] zhangliang]$ ll 
12 total 194156 
13 -rw-r--r-- 1 yun yun 198811365 Jun 10 09:10 CentOS-7.4_hadoop-2.7.6.tar.gz 
14 -rw-r--r-- 1 yun yun        21 Jun 10 09:09 test.info 
15 [[email protected] zhangliang]$ cat test.info  
16 111 
17 222 
18 333 
19 444 
20 555 
21  
22 [[email protected] zhangliang]$

 

2. 简单案例

2.1. 准备数据

 1 [[email protected] zhangliang]$ pwd 
 2 /app/software/zhangliang 
 3 [[email protected] zhangliang]$ ll 
 4 total 8 
 5 -rw-rw-r-- 1 yun yun 33 Jun 10 09:43 test.info 
 6 -rw-rw-r-- 1 yun yun 55 Jun 10 09:43 zhang.info 
 7 [[email protected] zhangliang]$ cat test.info  
 8 111 
 9 222 
10 333 
11 444 
12 555 
13 333 
14 222 
15 222 
16 222 
17  
18 [[email protected] zhangliang]$ cat zhang.info  
19 zxcvbnm 
20 asdfghjkl 
21 qwertyuiop 
22 qwertyuiop 
23 111 
24 qwertyuiop 
25 [[email protected] zhangliang]$ hadoop fs -mkdir -p /wordcount/input 
26 [[email protected] zhangliang]$ hadoop fs -put test.info zhang.info /wordcount/input 
27 [[email protected] zhangliang]$ hadoop fs -ls /wordcount/input 
28 Found 2 items 
29 -rw-r--r--   3 yun supergroup         37 2018-06-10 09:45 /wordcount/input/test.info 
30 -rw-r--r--   3 yun supergroup         55 2018-06-10 09:45 /wordcount/input/zhang.info

 

2.1. 运行分析

 1 [[email protected] mapreduce]$ pwd 
 2 /app/hadoop/share/hadoop/mapreduce 
 3 [[email protected] mapreduce]$ ll 
 4 total 5000 
 5 -rw-r--r-- 1 yun yun  545657 Jun  8 16:36 hadoop-mapreduce-client-app-2.7.6.jar 
 6 -rw-r--r-- 1 yun yun  777070 Jun  8 16:36 hadoop-mapreduce-client-common-2.7.6.jar 
 7 -rw-r--r-- 1 yun yun 1558767 Jun  8 16:36 hadoop-mapreduce-client-core-2.7.6.jar 
 8 -rw-r--r-- 1 yun yun  191881 Jun  8 16:36 hadoop-mapreduce-client-hs-2.7.6.jar 
 9 -rw-r--r-- 1 yun yun   27830 Jun  8 16:36 hadoop-mapreduce-client-hs-plugins-2.7.6.jar 
10 -rw-r--r-- 1 yun yun   62954 Jun  8 16:36 hadoop-mapreduce-client-jobclient-2.7.6.jar 
11 -rw-r--r-- 1 yun yun 1561708 Jun  8 16:36 hadoop-mapreduce-client-jobclient-2.7.6-tests.jar 
12 -rw-r--r-- 1 yun yun   72054 Jun  8 16:36 hadoop-mapreduce-client-shuffle-2.7.6.jar 
13 -rw-r--r-- 1 yun yun  295697 Jun  8 16:36 hadoop-mapreduce-examples-2.7.6.jar 
14 drwxr-xr-x 2 yun yun    4096 Jun  8 16:36 lib 
15 drwxr-xr-x 2 yun yun      30 Jun  8 16:36 lib-examples 
16 drwxr-xr-x 2 yun yun    4096 Jun  8 16:36 sources 
17 [[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /wordcount/input /wordcount/output   
18 18/06/10 09:46:09 INFO client.RMProxy: Connecting to ResourceManager at mini02/10.0.0.12:8032 
19 18/06/10 09:46:10 INFO input.FileInputFormat: Total input paths to process : 2 
20 18/06/10 09:46:10 INFO mapreduce.JobSubmitter: number of splits:2 
21 18/06/10 09:46:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528589937101_0002 
22 18/06/10 09:46:10 INFO impl.YarnClientImpl: Submitted application application_1528589937101_0002 
23 18/06/10 09:46:10 INFO mapreduce.Job: The url to track the job: http://mini02:8088/proxy/application_1528589937101_0002/ 
24 18/06/10 09:46:10 INFO mapreduce.Job: Running job: job_1528589937101_0002 
25 18/06/10 09:46:21 INFO mapreduce.Job: Job job_1528589937101_0002 running in uber mode : false 
26 18/06/10 09:46:21 INFO mapreduce.Job:  map 0% reduce 0% 
27 18/06/10 09:46:30 INFO mapreduce.Job:  map 100% reduce 0% 
28 18/06/10 09:46:36 INFO mapreduce.Job:  map 100% reduce 100% 
29 18/06/10 09:46:37 INFO mapreduce.Job: Job job_1528589937101_0002 completed successfully 
30 18/06/10 09:46:37 INFO mapreduce.Job: Counters: 49 
31     File System Counters 
32         FILE: Number of bytes read=113 
33         FILE: Number of bytes written=368235 
34         FILE: Number of read operations=0 
35         FILE: Number of large read operations=0 
36         FILE: Number of write operations=0 
37         HDFS: Number of bytes read=311 
38         HDFS: Number of bytes written=65 
39         HDFS: Number of read operations=9 
40         HDFS: Number of large read operations=0 
41         HDFS: Number of write operations=2 
42     Job Counters  
43         Launched map tasks=2 
44         Launched reduce tasks=1 
45         Data-local map tasks=2 
46         Total time spent by all maps in occupied slots (ms)=12234 
47         Total time spent by all reduces in occupied slots (ms)=3651 
48         Total time spent by all map tasks (ms)=12234 
49         Total time spent by all reduce tasks (ms)=3651 
50         Total vcore-milliseconds taken by all map tasks=12234 
51         Total vcore-milliseconds taken by all reduce tasks=3651 
52         Total megabyte-milliseconds taken by all map tasks=12527616 
53         Total megabyte-milliseconds taken by all reduce tasks=3738624 
54     Map-Reduce Framework 
55         Map input records=16 
56         Map output records=15 
57         Map output bytes=151 
58         Map output materialized bytes=119 
59         Input split bytes=219 
60         Combine input records=15 
61         Combine output records=9 
62         Reduce input groups=8 
63         Reduce shuffle bytes=119 
64         Reduce input records=9 
65         Reduce output records=8 
66         Spilled Records=18 
67         Shuffled Maps =2 
68         Failed Shuffles=0 
69         Merged Map outputs=2 
70         GC time elapsed (ms)=308 
71         CPU time spent (ms)=2700 
72         Physical memory (bytes) snapshot=709656576 
73         Virtual memory (bytes) snapshot=6343041024 
74         Total committed heap usage (bytes)=473956352 
75     Shuffle Errors 
76         BAD_ID=0 
77         CONNECTION=0 
78         IO_ERROR=0 
79         WRONG_LENGTH=0 
80         WRONG_MAP=0 
81         WRONG_REDUCE=0 
82     File Input Format Counters  
83         Bytes Read=92 
84     File Output Format Counters  
85         Bytes Written=65 
86 [[email protected] zhangliang]$ hadoop fs -ls /wordcount/output   
87 Found 2 items 
88 -rw-r--r--   3 yun supergroup          0 2018-06-10 09:46 /wordcount/output/_SUCCESS 
89 -rw-r--r--   3 yun supergroup         65 2018-06-10 09:46 /wordcount/output/part-r-00000  
90 [[email protected] zhangliang]$ hadoop fs -cat /wordcount/output/part-r-00000   
91 111    2 
92 222    4 
93 333    2 
94 444    1 
95 555    1 
96 asdfghjkl    1 
97 qwertyuiop    3 
98 zxcvbnm    1

 

3. 案例:开发shell采集脚本

3.1. 需求说明

  点击流日志每天都10T,在业务应用服务器上,需要准实时上传至数据仓库(Hadoop HDFS)上

 

3.2. 需求分析

  一般上传文件都是在凌晨24点操作,由于很多种类的业务数据都要在晚上进行传输,为了减轻服务器的压力,避开高峰期

  如果需要伪实时的上传,则采用定时上传的方式。比如每小时上传一次。

 

3.3. web日志模拟

  运行jar包模拟web日志

 1 # 必要的目录   
 2 # 模拟web服务器目录 /app/webservice  
 3 # 模拟web日志目录 /app/webservice/logs   
 4 # 模拟待上传文件存放的目录 /app/webservice/logs/up2hdfs  # 在这个目录上传到HDFS 
 5 [[email protected] webservice]$ pwd 
 6 /app/webservice 
 7 [[email protected] webservice]$ ll 
 8 total 360 
 9 drwxrwxr-x 3 yun yun     39 Jun 14 17:40 logs 
10 -rw-r--r-- 1 yun yun 367872 Jun 14 11:32 testlog.jar 
11 # 运行jar包,打印日志 
12 [[email protected] webservice]$ java -jar testlog.jar &   
13 …………………… 
14 # 生成web日志查看 
15 [[email protected] logs]$ pwd 
16 /app/webservice/logs 
17 [[email protected] logs]$ ll -hrt 
18 total 460K 
19 -rw-rw-r-- 1 yun yun  11K Jun 14 17:28 access.log.7 
20 -rw-rw-r-- 1 yun yun  11K Jun 14 17:28 access.log.6 
21 -rw-rw-r-- 1 yun yun  11K Jun 14 17:28 access.log.5 
22 -rw-rw-r-- 1 yun yun  11K Jun 14 17:28 access.log.4 
23 -rw-rw-r-- 1 yun yun  11K Jun 14 17:29 access.log.3 
24 -rw-rw-r-- 1 yun yun  11K Jun 14 17:29 access.log.2 
25 -rw-rw-r-- 1 yun yun  11K Jun 14 17:29 access.log.1 
26 -rw-rw-r-- 1 yun yun 9.6K Jun 14 17:29 access.log

 

3.4. Shell脚本执行

1 # 脚本测试 
2 [[email protected] hadoop]$ pwd 
3 /app/yunwei/hadoop 
4 [[email protected] hadoop]$ ll  
5 -rwxrwxr-x 1 yun yun 2657 Jun 14 17:27 uploadFile2Hdfs.sh 
6 [[email protected] hadoop]$ ./uploadFile2Hdfs.sh      
7 ………………

 

3.5. 查看结果

3.5.1. 脚本执行结果查看

Web日志转移与上传查看

 1 # web日志转移查看 
 2 [[email protected] logs]$ pwd 
 3 /app/webservice/logs 
 4 [[email protected] logs]$ ll -hrt 
 5 total 28K 
 6 -rw-rw-r-- 1 yun yun 7.3K Jun 14 17:36 access.log    ### 表明日志已经转移 
 7 drwxrwxr-x 2 yun yun  16K Jun 14 17:40 up2hdfs 
 8  
 9 # 待上传文件存放的目录 
10 [[email protected] up2hdfs]$ pwd 
11 /app/webservice/logs/up2hdfs 
12 [[email protected] up2hdfs]$ ll -hrt  
13 -rw-rw-r-- 1 yun yun 11K Jun 14 17:35 access.log.7_20180614174001_DONE 
14 -rw-rw-r-- 1 yun yun 11K Jun 14 17:35 access.log.6_20180614174001_DONE 
15 -rw-rw-r-- 1 yun yun 11K Jun 14 17:35 access.log.5_20180614174001_DONE 
16 -rw-rw-r-- 1 yun yun 11K Jun 14 17:35 access.log.4_20180614174001_DONE 
17 -rw-rw-r-- 1 yun yun 11K Jun 14 17:35 access.log.3_20180614174001_DONE 
18 -rw-rw-r-- 1 yun yun 11K Jun 14 17:36 access.log.2_20180614174001_DONE 
19 -rw-rw-r-- 1 yun yun 11K Jun 14 17:36 access.log.1_20180614174001_DONE 
20 ## 文件以 DONE 结尾,表明已上传到HDFS

 

脚本日志查看

# shell脚本日志查看 
[[email protected] log]$ pwd 
/app/yunwei/hadoop/log 
[[email protected] log]$ ll 
total 28 
-rw-rw-r-- 1 yun yun 24938 Jun 14 17:40 uploadFile2Hdfs.sh.log 
[[email protected] log]$ cat uploadFile2Hdfs.sh.log  
2018-06-14 17:40:08 uploadFile2Hdfs.sh access.log.1_20180614174001        ok 
2018-06-14 17:40:10 uploadFile2Hdfs.sh access.log.2_20180614174001        ok 
2018-06-14 17:40:12 uploadFile2Hdfs.sh access.log.3_20180614174001        ok 
2018-06-14 17:40:14 uploadFile2Hdfs.sh access.log.4_20180614174001        ok 
2018-06-14 17:40:16 uploadFile2Hdfs.sh access.log.5_20180614174001        ok 
2018-06-14 17:40:18 uploadFile2Hdfs.sh access.log.6_20180614174001        ok 
2018-06-14 17:40:20 uploadFile2Hdfs.sh access.log.7_20180614174001        ok

 

3.5.2. HDFS命令行上传文件查看

1 [[email protected] ~]$ hadoop fs -ls /data/webservice/20180614 
2 Found 7 items 
3 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.1_20180614174001 
4 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.2_20180614174001 
5 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.3_20180614174001 
6 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.4_20180614174001 
7 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.5_20180614174001 
8 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.6_20180614174001 
9 -rw-r--r--   2 yun supergroup      10268 2018-06-14 17:08 /data/webservice/20180614/access.log.7_20180614174001

 

3.5.3. 浏览器查看看

http://10.0.0.11:50070/explorer.html#/data/webservice/20180614 
 
路径 : /data/webservice/20180614 

Hadoop2.7.6_02_HDFS常用操作详解大数据

 

3.6. 脚本加入定时任务

1 [[email protected] hadoop]$ crontab -l    
2  
3 # WEB日志上传HDFS   每小时执行一次 
4 00 */1 * * * /app/yunwei/hadoop/uploadFile2Hdfs.sh    >/dev/null 2>&1

 

3.7. 相关的脚本和jar包

在Git上,路径如下: 
https://github.com/zhanglianghhh/bigdata/tree/master/hadoop/hdfs 

  

Hadoop2.7.6_02_HDFS常用操作详解大数据

 

原创文章,作者:Maggie-Hunter,如若转载,请注明出处:https://blog.ytso.com/tech/bigdata/9102.html

(0)
上一篇 2021年7月19日 09:11
下一篇 2021年7月19日 09:12

相关推荐

发表回复

登录后才能评论