Hive Example
此实例主要学习Hive的基本操作
准备数据
- 下载数据
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
- 解压数据
unzip ml-100k.zip
- 数据说明(具体请查看解压目录中README)
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC u.info -- The number of users, items, and ratings in the u data set. u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set. u.genre -- A list of the genres. u.user -- Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code The user ids are the ones used in the u.data data set. u.occupation -- A list of the occupations.
创建表
接下来使用上述数据中的其中三个数据做接下来的练习,创建文件 hiveExample.sql,hiveExample.sql内容如下:
CREATE TABLE IF NOT EXISTS u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS u_user(
userid INT,
age INT,
gender STRING,
occupation STRING ,
zipcode INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS u_item(
movieid INT,
movietitle STRING,
releasedate STRING,
videoreleasedate STRING,
IMDbURL STRING,
unknown INT,
Action INT,
Adventure INT,
Animation INT,
Childrens INT,
Comedy INT,
Crime INT,
Documentary INT,
Drama INT,
Fantasy INT,
FilmNoir INT,
Horror INT,
Musical INT,
Mystery INT,
Romance INT,
SciFi INT,
Thriller INT,
War INT,
Western INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
执行如下命令,创建表:
hive -f hiveExample.sql
导入数据:
创建文件 importData.sql,importData.sql内容如下:
LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.data' OVERWRITE INTO TABLE u_data;
LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.user' OVERWRITE INTO TABLE u_user;
LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.item' OVERWRITE INTO TABLE u_item;
其中/home/dev/storeFile/ml-100k/u.data为本地文件路径。
执行如下命令,导入数据:
hive -f importData.sql
导入成功后打印如下信息:
Loading data to table default.u_data
OK
Time taken: 3.852 seconds
Loading data to table default.u_user
OK
Time taken: 0.563 seconds
Loading data to table default.u_item
OK
Time taken: 0.706 seconds
HiveQL查询实例
select / order by /limit
将 u_user表中用户按照age降序排列,并查询出前5位用户信息:
hive> select * from u_user u order by u.age desc limit 5 ;
结果:
481 73 M retired 37771
860 70 F retired 48322
767 70 M engineer 0
803 70 M administrator 78212
585 69 M librarian 98501
group by
- group by 按照一个或者多个列对结果进行分组。
- count / distinct 一同使用 DISTINCT 和 COUNT 关键词,来计算非重复结果的数目
统计u_user表中不同性别的人数。
select u.gender, count(distinct u.userid) from u_user u group by u.gender;
结果:
OK
F 273
M 670
Time taken: 19.786 seconds, Fetched: 2 row(s)
参考资料
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9573.html