1.下载软件:
wget http://apache.fayea.com/pig/pig-0.15.0/pig-0.15.0.tar.gz
2.解压
tar -zxvf pig-0.15.0.tar.gz
mv pig-0.15.0 /usr/local/
ln -s pig-0.15.0 pig
3.配置环境变量:
export PATH=PATH=$HOME/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/pig/bin:$PATH;
export PIG_CLASSPATH=/usr/local/hadoop/etc/hadoop;
4.进入grunt shell:
以本地模式登录pig: 该方式的所有文件和执行过程都在本地,一般用于测试程序
[hadoop@host61 ~]$ pig -x local
15/10/03 01:14:09 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/03 01:14:09 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-10-03 01:14:09,756 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-03 01:14:09,758 [main] INFO org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443860049744.log
2015-10-03 01:14:10,133 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found
2015-10-03 01:14:12,648 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-03 01:14:12,656 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 01:14:12,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: file:///
2015-10-03 01:14:13,573 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
grunt>
以Mapreduce模式登录:实际工作模式:
[hadoop@host63 ~]$ pig
15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/03 02:11:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-10-03 02:11:55,086 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-03 02:11:55,087 [main] INFO org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443863515062.log
2015-10-03 02:11:55,271 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found
2015-10-03 02:11:59,735 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 02:11:59,740 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-03 02:11:59,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://host61:9000/
2015-10-03 02:12:06,256 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 02:12:06,257 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: host61:9001
2015-10-03 02:12:06,265 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>
5.pig的运行方式有如下三种:
1.脚本
2.grunt
3.嵌入式
6.登录pig,并使用常用的命令:
[hadoop@host63 ~]$ pig
15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/03 06:01:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-10-03 06:01:01,412 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-03 06:01:01,413 [main] INFO org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443877261408.log
2015-10-03 06:01:01,502 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found
2015-10-03 06:01:03,657 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 06:01:03,657 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-03 06:01:03,662 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://host61:9000/
2015-10-03 06:01:05,968 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 06:01:05,968 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: host61:9001
2015-10-03 06:01:05,979 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
grunt> help
Commands:
<pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]
[-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script.
-script – Explain the entire script.
-out – Store the output into directory rather than print to stdout.
-brief – Don’t expand nested plans (presenting a smaller graph for overview).
-dot – Generate the output in .dot format. Default is text format.
-xml – Generate the output in .xml format. Default is text format.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
alias – Alias to explain.
dump <alias> – Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param <param_name>=param_value] [-param_file <file_name>] <script> –
Execute the script with access to grunt environment including aliases.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
script – Script to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> –
Execute the script with access to grunt environment.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
script – Script to be executed.
sh <shell command> – Invoke a shell command.
kill <job_id> – Kill the hadoop job specified by the hadoop job id.
set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default.
debug – Set debug on or off. Default is off.
job.name – Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath – String that contains the path. This is used by streaming.
any hadoop property.
help – Display this message.
history [-n] – Display the list statements in cache.
-n Hide line numbers.
quit – Quit the grunt shell.
grunt> help sh
Commands:
<pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]
[-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script.
-script – Explain the entire script.
-out – Store the output into directory rather than print to stdout.
-brief – Don’t expand nested plans (presenting a smaller graph for overview).
-dot – Generate the output in .dot format. Default is text format.
-xml – Generate the output in .xml format. Default is text format.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
alias – Alias to explain.
dump <alias> – Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param <param_name>=param_value] [-param_file <file_name>] <script> –
Execute the script with access to grunt environment including aliases.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
script – Script to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> –
Execute the script with access to grunt environment.
-param <param_name – See parameter substitution for details.
-param_file <file_name> – See parameter substitution for details.
script – Script to be executed.
sh <shell command> – Invoke a shell command.
kill <job_id> – Kill the hadoop job specified by the hadoop job id.
set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default.
debug – Set debug on or off. Default is off.
job.name – Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath – String that contains the path. This is used by streaming.
any hadoop property.
help – Display this message.
history [-n] – Display the list statements in cache.
-n Hide line numbers.
quit – Quit the grunt shell.
2015-10-03 06:02:22,264 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1000: Error during parsing. Encountered ” <EOL> “/n “” at line 2, column 8.
Was expecting one of:
“cat” …
“clear” …
“cd” …
“cp” …
“copyFromLocal” …
“copyToLocal” …
“dump” …
“//d” …
“describe” …
“//de” …
“aliases” …
“explain” …
“//e” …
“help” …
“history” …
“kill” …
“ls” …
“mv” …
“mkdir” …
“pwd” …
“quit” …
“//q” …
“register” …
“using” …
“as” …
“rm” …
“set” …
“illustrate” …
“//i” …
“run” …
“exec” …
“scriptDone” …
<IDENTIFIER> …
<PATH> …
<QUOTEDSTRING> …
Details at logfile: /home/hadoop/pig_1443877261408.log
查看当前目录:
grunt> ls
hdfs://host61:9000/user/hadoop/.Trash <dir>
查看根目录:
grunt> ls /
hdfs://host61:9000/in <dir>
hdfs://host61:9000/out <dir>
hdfs://host61:9000/user <dir>
切换目录:
grunt> cd /
显示当前目录:
grunt> ls
hdfs://host61:9000/in <dir>
hdfs://host61:9000/out <dir>
hdfs://host61:9000/user <dir>
grunt> cd /in
grunt> ls
hdfs://host61:9000/in/jdk-8u60-linux-x64.tar.gz<r 3> 181238643
hdfs://host61:9000/in/mytest1.txt<r 3> 23
hdfs://host61:9000/in/mytest2.txt<r 3> 24
hdfs://host61:9000/in/mytest3.txt<r 3> 4
查看文件信息:
grunt> cat mytest1.txt
this is the first file
拷贝hdfs中的文件至操作系统:
grunt> copyToLocal /in/mytest5.txt /home/hadoop/mytest.txt
[hadoop@host63 ~]$ ls -l mytest.txt
-rw-r–r–. 1 hadoop hadoop 102 Oct 3 06:23 mytest.txt
使用sh+操作系统命令可以在grunt中执行操作系统中的命令:
grunt> sh ls -l /home/hadoop/mytest.txt
-rw-r–r–. 1 hadoop hadoop 102 Oct 3 06:23 /home/hadoop/mytest.txt
7.pig的数据模型:
bag:表
tuple:行,记录
field:属性
pig不要求相同bag里面的不同tuple有相同数量或相同类型的field;
8.pig latin的常用语句:
LOAD:指出载入数据的方法;
FOREACH:逐行扫描并进行某种处理;
FILTER:过滤行;
DUMP:把结果显示到屏幕;
STORE:把结果保存到文件;
9.数据处理样例:
产生测试文件:
[hadoop@host63 tmp]$ ls -l / |awk ‘{if(NR != 1)print $NF”#”$5}’ >/tmp/mytest.txt
[hadoop@host63 tmp]$ cat /tmp/mytest.txt
bin#4096
boot#1024
dev#3680
etc#12288
home#4096
lib#4096
lib64#12288
lost+found#16384
media#4096
mnt#4096
opt#4096
proc#0
root#4096
sbin#12288
selinux#0
srv#4096
sys#0
tmp#4096
usr#4096
var#4096
装载文件:
grunt> records = LOAD ‘/tmp/mytest.txt’ USING PigStorage(‘#’) AS (filename:chararray,size:int);
2015-10-03 07:35:48,479 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-03 07:35:48,480 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 07:35:48,497 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2015-10-03 07:35:48,716 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-03 07:35:48,723 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2015-10-03 07:35:48,723 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
显示文件:
grunt> DUMP records;
(bin,4096)
(boot,1024)
(dev,3680)
(etc,12288)
(home,4096)
(lib,4096)
(lib64,12288)
(lost+found,16384)
(media,4096)
(mnt,4096)
(opt,4096)
(proc,0)
(root,4096)
(sbin,12288)
(selinux,0)
(srv,4096)
(sys,0)
(tmp,4096)
(usr,4096)
(var,4096)
显示records的结构:
grunt> DESCRIBE records;
records: {filename: chararray,size: int}
过滤记录:
grunt> filter_records =FILTER records BY size>4096;
grunt> DUMP fileter_records;
(etc,12288)
(lib64,12288)
(lost+found,16384)
(sbin,12288)
grunt> DESCRIBE filter_records;
filter_records: {filename: chararray,size: int}
分组:
grunt> group_records =GROUP records BY size;
grunt> DUMP group_records;
(0,{(sys,0),(proc,0),(selinux,0)})
(1024,{(boot,1024)})
(3680,{(dev,3680)})
(4096,{(var,4096),(usr,4096),(tmp,4096),(srv,4096),(root,4096),(opt,4096),(mnt,4096),(media,4096),(lib,4096),(home,4096),(bin,4096)})
(12288,{(etc,12288),(lib64,12288),(sbin,12288)})
(16384,{(lost+found,16384)})
grunt> DESCRIBE group_records;
group_records: {group: int,records: {(filename: chararray,size: int)}}
格式化:
grunt> format_records = FOREACH group_records GENERATE group, FLATTEN(records);
去重:
grunt> dis_records =DISTINCT records;
排序:
grunt> ord_records =ORDER dis_records BY size desc;
取前3行数据:
grunt> top_records=LIMIT ord_records 3;
求最大值:
grunt> max_records =FOREACH group_records GENERATE group,MAX(records.size);
grunt> DUMP max_records;
(0,0)
(1024,1024)
(3680,3680)
(4096,4096)
(12288,12288)
(16384,16384)
查看执行计划:
grunt> EXPLAIN max_records;
保存记录集:
grunt> STORE group_records INTO ‘/tmp/mytest_group’;
grunt> STORE filter_records INTO ‘/tmp/mytest_filter’;
grunt> STORE max_records INTO ‘/tmp/mytest_max’;
10.UDF
pig支持使用java,python,javascript编写UDF
原创文章,作者:kepupublish,如若转载,请注明出处:https://blog.ytso.com/194180.html