1.Hive 分区partition
必须在表定义时指定对应的partition字段
a、单分区建表语句:
create table day_table (id int, content string) partitioned by (dt string);
单分区表,按天分区,在表结构中存在id,content,dt三列。
以dt为文件夹区分
b、 双分区建表语句:
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。
先以dt为文件夹,再以hour子文件夹区分
2.创建2个表psn2 psn3
create table psn2 (id int,name string,hobby array,address map )partitioned by (age int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','COLLECTION ITEMS TERMINATED BY '-'MAP KEYS TERMINATED BY ':'LINES TERMINATED BY '\n';
3.插入数据
load data local inpath '/root/data' into table psn2 partition (age=10);
4.查询psn2
hive> select * from psn2;OK1 小明1 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 102 小明2 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 103 小明3 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 104 小明4 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 105 小明5 ["lol","movie"] {"beijing":"changping","shanghai":"pudong"} 106 小明6 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 107 小明7 ["lol","book"] {"beijing":"changping","shanghai":"pudong"} 108 小明8 ["lol","book"] {"beijing":"changping","shanghai":"pudong"} 109 小明9 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10Time taken: 7.93 seconds, Fetched: 9 row(s)
理论上分区可以无限分,但是实际需要根据需求来分区。
如:历史数据按天分区
5.错误实例psn3
create table psn3 (id int,name string,age int,hobby array,address map )partitioned by (age int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','COLLECTION ITEMS TERMINATED BY '-'MAP KEYS TERMINATED BY ':'LINES TERMINATED BY '\n';
报错提示:FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
原因:分区字段不能再表的列中
6.同时创建两个分区
create table psn3 (id int,name string,hobby array,address map )partitioned by (age int,sex string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','COLLECTION ITEMS TERMINATED BY '-'MAP KEYS TERMINATED BY ':'LINES TERMINATED BY '\n';
注意:添加两个字段相应的插入数据时要指定两个字段,一个字段会报错。
hive> create table psn3 ( > id int, > name string, > hobby array, > address map > ) > partitioned by (age int,sex string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > COLLECTION ITEMS TERMINATED BY '-' > MAP KEYS TERMINATED BY ':' > LINES TERMINATED BY '\n';OKTime taken: 1.167 seconds
7.向双分区加载数据
hive> load data local inpath '/root/data' into table psn3 partition (age=10);FAILED: SemanticException [Error 10006]: Line 1:63 Partition not found '10'hive> load data local inpath '/root/data' into table psn3 partition (age=10,sex='boy');Loading data to table default.psn3 partition (age=10, sex=boy)OKTime taken: 3.115 secondshive>
8.删除分区
alter table psn2 drop partition (sex='boy');hive> alter table psn3 drop partition (sex='boy');Dropped the partition age=10/sex=boyOKTime taken: 0.195 seconds
9.结论:
添加分区的时候,必须在现有分区的基础之上
删除分区的时候,会将所有存在的分区都删除
10.添加分区
添加时必须指定age=10 还是 age=20的分区删除,不然会报错
hive> alter table psn3 add partition(sex='man');FAILED: ValidationFailureSemanticException default.psn3: partition spec {sex=man} doesn't contain all (2) partition columnshive> alter table psn3 add partition(age=10,sex='man');OKTime taken: 0.418 seconds
删除前age=10 和age=20下分别有boy和man两个目录
11.执行删除
hive> alter table psn3 drop partition(sex='man');Dropped the partition age=10/sex=manOKTime taken: 0.389 seconds
load 加载数据的过程其实是在上传文件 partition 是对应hdfs的目录
12.通过一个表的查询结果的数据插入到另一个表中
create table psn4 (id int,name string,hobby array)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','COLLECTION ITEMS TERMINATED BY '-'LINES TERMINATED BY '\n';
这种插入数据会转换成mapreduce任务
from psn3 insert overwrite table psn4 select id,name,hobby;
13.这种操作的作用:
1.复制表
2.可以作为中间表存在