pig的安装及使用

1.下载软件:

wget http://apache.fayea.com/pig/pig-0.15.0/pig-0.15.0.tar.gz

2.解压

tar -zxvf pig-0.15.0.tar.gz

mv pig-0.15.0 /usr/local/

ln -s pig-0.15.0 pig

3.配置环境变量:

export PATH=PATH=$HOME/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/pig/bin:$PATH;

export PIG_CLASSPATH=/usr/local/hadoop/etc/hadoop;

4.进入grunt shell:

以本地模式登录pig: 该方式的所有文件和执行过程都在本地,一般用于测试程序

[hadoop@host61 ~]$ pig -x local

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType

2015-10-03 01:14:09,756 [main] INFO  org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 01:14:09,758 [main] INFO  org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443860049744.log

2015-10-03 01:14:10,133 [main] INFO  org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 01:14:12,648 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 01:14:12,656 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 01:14:12,685 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: file:///

2015-10-03 01:14:13,573 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

grunt> 

以Mapreduce模式登录:实际工作模式:

[hadoop@host63 ~]$ pig

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 02:11:55,086 [main] INFO  org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 02:11:55,087 [main] INFO  org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443863515062.log

2015-10-03 02:11:55,271 [main] INFO  org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 02:11:59,735 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:11:59,740 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 02:11:59,742 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 02:12:06,256 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:12:06,257 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: host61:9001

2015-10-03 02:12:06,265 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> 

5.pig的运行方式有如下三种:

1.脚本

2.grunt

3.嵌入式

6.登录pig,并使用常用的命令:

[hadoop@host63 ~]$ pig

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 06:01:01,412 [main] INFO  org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 06:01:01,413 [main] INFO  org.apache.pig.Main – Logging error messages to: /home/hadoop/pig_1443877261408.log

2015-10-03 06:01:01,502 [main] INFO  org.apache.pig.impl.util.Utils – Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 06:01:03,662 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 06:01:05,968 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:05,968 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: host61:9001

2015-10-03 06:01:05,979 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> help

Commands:

<pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script.

-script – Explain the entire script.

-out – Store the output into directory rather than print to stdout.

-brief – Don’t expand nested plans (presenting a smaller graph for overview).

-dot – Generate the output in .dot format. Default is text format.

-xml – Generate the output in .xml format. Default is text format.

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

alias – Alias to explain.

dump <alias> – Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> – 

Execute the script with access to grunt environment including aliases.

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

script – Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> – 

Execute the script with access to grunt environment. 

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

script – Script to be executed.

sh  <shell command> – Invoke a shell command.

kill <job_id> – Kill the hadoop job specified by the hadoop job id.

set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported: 

default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default.

debug – Set debug on or off. Default is off.

job.name – Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath – String that contains the path. This is used by streaming.

any hadoop property.

help – Display this message.

history [-n] – Display the list statements in cache.

-n Hide line numbers. 

quit – Quit the grunt shell.

grunt> help sh

Commands:

<pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script.

-script – Explain the entire script.

-out – Store the output into directory rather than print to stdout.

-brief – Don’t expand nested plans (presenting a smaller graph for overview).

-dot – Generate the output in .dot format. Default is text format.

-xml – Generate the output in .xml format. Default is text format.

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

alias – Alias to explain.

dump <alias> – Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> – 

Execute the script with access to grunt environment including aliases.

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

script – Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> – 

Execute the script with access to grunt environment. 

-param <param_name – See parameter substitution for details.

-param_file <file_name> – See parameter substitution for details.

script – Script to be executed.

sh  <shell command> – Invoke a shell command.

kill <job_id> – Kill the hadoop job specified by the hadoop job id.

set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported: 

default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default.

debug – Set debug on or off. Default is off.

job.name – Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath – String that contains the path. This is used by streaming.

any hadoop property.

help – Display this message.

history [-n] – Display the list statements in cache.

-n Hide line numbers. 

quit – Quit the grunt shell.

2015-10-03 06:02:22,264 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1000: Error during parsing. Encountered ” <EOL> “/n “” at line 2, column 8.

Was expecting one of:

“cat” …

“clear” …

“cd” …

“cp” …

“copyFromLocal” …

“copyToLocal” …

“dump” …

“//d” …

“describe” …

“//de” …

“aliases” …

“explain” …

“//e” …

“help” …

“history” …

“kill” …

“ls” …

“mv” …

“mkdir” …

“pwd” …

“quit” …

“//q” …

“register” …

“using” …

“as” …

“rm” …

“set” …

“illustrate” …

“//i” …

“run” …

“exec” …

“scriptDone” …

<IDENTIFIER> …

<PATH> …

<QUOTEDSTRING> …

Details at logfile: /home/hadoop/pig_1443877261408.log

查看当前目录:

grunt> ls

hdfs://host61:9000/user/hadoop/.Trash <dir>

查看根目录:

grunt> ls /

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

切换目录:

grunt> cd /

显示当前目录:

grunt> ls

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

grunt> cd /in

grunt> ls

hdfs://host61:9000/in/jdk-8u60-linux-x64.tar.gz<r 3> 181238643

hdfs://host61:9000/in/mytest1.txt<r 3> 23

hdfs://host61:9000/in/mytest2.txt<r 3> 24

hdfs://host61:9000/in/mytest3.txt<r 3> 4

查看文件信息:

grunt> cat mytest1.txt

this is the first file

拷贝hdfs中的文件至操作系统:

grunt> copyToLocal /in/mytest5.txt /home/hadoop/mytest.txt

[hadoop@host63 ~]$ ls -l mytest.txt

-rw-r–r–. 1 hadoop hadoop 102 Oct  3 06:23 mytest.txt

使用sh+操作系统命令可以在grunt中执行操作系统中的命令:

grunt> sh ls -l /home/hadoop/mytest.txt

-rw-r–r–. 1 hadoop hadoop 102 Oct  3 06:23 /home/hadoop/mytest.txt

7.pig的数据模型:

bag:表

tuple:行,记录

field:属性

pig不要求相同bag里面的不同tuple有相同数量或相同类型的field;

8.pig latin的常用语句:

LOAD:指出载入数据的方法;

FOREACH:逐行扫描并进行某种处理;

FILTER:过滤行;

DUMP:把结果显示到屏幕;

STORE:把结果保存到文件;

9.数据处理样例:

产生测试文件:

[hadoop@host63 tmp]$ ls -l / |awk ‘{if(NR != 1)print $NF”#”$5}’ >/tmp/mytest.txt

[hadoop@host63 tmp]$ cat /tmp/mytest.txt

bin#4096

boot#1024

dev#3680

etc#12288

home#4096

lib#4096

lib64#12288

lost+found#16384

media#4096

mnt#4096

opt#4096

proc#0

root#4096

sbin#12288

selinux#0

srv#4096

sys#0

tmp#4096

usr#4096

var#4096

装载文件:

grunt> records = LOAD ‘/tmp/mytest.txt’ USING PigStorage(‘#’) AS (filename:chararray,size:int);

2015-10-03 07:35:48,479 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 07:35:48,480 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,497 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,716 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

显示文件:

grunt> DUMP records;

(bin,4096)

(boot,1024)

(dev,3680)

(etc,12288)

(home,4096)

(lib,4096)

(lib64,12288)

(lost+found,16384)

(media,4096)

(mnt,4096)

(opt,4096)

(proc,0)

(root,4096)

(sbin,12288)

(selinux,0)

(srv,4096)

(sys,0)

(tmp,4096)

(usr,4096)

(var,4096)

显示records的结构:

grunt> DESCRIBE records;

records: {filename: chararray,size: int}

过滤记录:

grunt> filter_records =FILTER records BY size>4096;

grunt> DUMP fileter_records;

(etc,12288)

(lib64,12288)

(lost+found,16384)

(sbin,12288)

grunt> DESCRIBE filter_records;

filter_records: {filename: chararray,size: int}

分组:

grunt> group_records =GROUP records BY size;

grunt> DUMP group_records;

(0,{(sys,0),(proc,0),(selinux,0)})

(1024,{(boot,1024)})

(3680,{(dev,3680)})

(4096,{(var,4096),(usr,4096),(tmp,4096),(srv,4096),(root,4096),(opt,4096),(mnt,4096),(media,4096),(lib,4096),(home,4096),(bin,4096)})

(12288,{(etc,12288),(lib64,12288),(sbin,12288)})

(16384,{(lost+found,16384)})

grunt> DESCRIBE group_records;

group_records: {group: int,records: {(filename: chararray,size: int)}}

格式化:

grunt> format_records = FOREACH group_records GENERATE group, FLATTEN(records);

去重:

grunt> dis_records =DISTINCT records;

排序:

grunt> ord_records =ORDER dis_records BY size desc;

取前3行数据:

grunt> top_records=LIMIT ord_records 3;

求最大值:

grunt> max_records =FOREACH group_records GENERATE group,MAX(records.size);

grunt> DUMP max_records;

(0,0)

(1024,1024)

(3680,3680)

(4096,4096)

(12288,12288)

(16384,16384)

查看执行计划:

grunt> EXPLAIN max_records;

保存记录集:

grunt> STORE group_records INTO ‘/tmp/mytest_group’;

grunt> STORE filter_records INTO ‘/tmp/mytest_filter’;

grunt> STORE max_records INTO ‘/tmp/mytest_max’;

10.UDF

pig支持使用java,python,javascript编写UDF

原创文章,作者:kepupublish,如若转载,请注明出处:https://blog.ytso.com/194180.html

(0)
上一篇 2021年11月16日
下一篇 2021年11月16日

相关推荐

发表回复

登录后才能评论