Hive Example详解大数据

Hive Example

此实例主要学习Hive的基本操作

准备数据

  • 下载数据

    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

  • 解压数据

    unzip ml-100k.zip

  • 数据说明(具体请查看解压目录中README)
u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.               Each user has rated at least 20 movies.  Users and items are               numbered consecutively from 1.  The data is randomly               ordered. This is a tab separated list of                  user id | item id | rating | timestamp.               The time stamps are unix seconds since 1/1/1970 UTC  u.info     -- The number of users, items, and ratings in the u data set.  u.item     -- Information about the items (movies); this is a tab separated               list of               movie id | movie title | release date | video release date |               IMDb URL | unknown | Action | Adventure | Animation |               Children's | Comedy | Crime | Documentary | Drama | Fantasy |               Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |               Thriller | War | Western |               The last 19 fields are the genres, a 1 indicates the movie               is of that genre, a 0 indicates it is not; movies can be in               several genres at once.               The movie ids are the ones used in the u.data data set.  u.genre    -- A list of the genres.  u.user     -- Demographic information about the users; this is a tab               separated list of               user id | age | gender | occupation | zip code               The user ids are the ones used in the u.data data set.  u.occupation -- A list of the occupations. 

创建表

接下来使用上述数据中的其中三个数据做接下来的练习,创建文件 hiveExample.sql,hiveExample.sql内容如下:

CREATE TABLE IF NOT EXISTS u_data ( 
        userid INT, 
        movieid INT, 
        rating INT, 
        unixtime STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '/t' 
STORED AS TEXTFILE; 
 
CREATE TABLE IF NOT EXISTS u_user( 
        userid INT, 
        age  INT, 
        gender STRING, 
        occupation STRING , 
        zipcode INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE; 
 
CREATE TABLE IF NOT EXISTS u_item( 
        movieid INT, 
        movietitle STRING, 
        releasedate  STRING, 
        videoreleasedate STRING, 
        IMDbURL STRING, 
        unknown  INT, 
        Action INT, 
        Adventure  INT, 
        Animation INT, 
        Childrens INT, 
        Comedy INT, 
        Crime INT, 
        Documentary INT, 
        Drama INT, 
        Fantasy INT, 
        FilmNoir INT, 
        Horror INT, 
        Musical INT, 
        Mystery INT, 
        Romance INT, 
        SciFi INT, 
        Thriller INT, 
        War INT, 
        Western INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE;

执行如下命令,创建表:
hive -f hiveExample.sql

导入数据:

创建文件 importData.sql,importData.sql内容如下:

LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.data' OVERWRITE INTO TABLE u_data; 
 
LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.user' OVERWRITE INTO TABLE u_user; 
 
LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.item' OVERWRITE INTO TABLE u_item;

其中/home/dev/storeFile/ml-100k/u.data为本地文件路径。
执行如下命令,导入数据:
hive -f importData.sql
导入成功后打印如下信息:

Loading data to table default.u_data 
OK 
Time taken: 3.852 seconds 
Loading data to table default.u_user 
OK 
Time taken: 0.563 seconds 
Loading data to table default.u_item 
OK 
Time taken: 0.706 seconds

HiveQL查询实例

select / order by /limit

将 u_user表中用户按照age降序排列,并查询出前5位用户信息:

hive> select * from u_user u order by u.age desc limit 5 ;

结果:

481 73  M   retired 37771 
860 70  F   retired 48322 
767 70  M   engineer    0 
803 70  M   administrator   78212 
585 69  M   librarian   98501

group by

  • group by 按照一个或者多个列对结果进行分组。
  • count / distinct 一同使用 DISTINCT 和 COUNT 关键词,来计算非重复结果的数目
    统计u_user表中不同性别的人数。
select u.gender, count(distinct u.userid) from u_user u group by u.gender;

结果:

OK 
F   273 
M   670 
Time taken: 19.786 seconds, Fetched: 2 row(s)

参考资料

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9573.html

(0)
上一篇 2021年7月19日
下一篇 2021年7月19日

相关推荐

发表回复

登录后才能评论