当前位置：首页 > news >正文

分区与分桶

news 来源：原创 2024/9/25 17:16:57

分区

分区字段大小写：

在hive中，分区字段名是不区分大小写的，不过字段值是区分大小写的。我们可以来测试一下

导入数据

load data local inpath '/home/hivedata/user1.txt' into table part4 partition(year='2018',month='03',DAy='21'); load data local inpath '/home/hivedata/user3.txt' into table part4 partition(year='2018',month='03',day='AA');

查看分区的数量

show partitions tableName

添加和删除分区

添加分区：

添加分区的时候，partition之间没有符号！

-- 单个分区 alter table part3 add partition(year='2023',month='05',day='02'); -- 多个分区 alter table part3 add partition(year='2023',month='05',day='03') partition(year='2023',month='05',day='04'); 一下子添加多个分区，partition 之间没有符号！ -- 添加分区，并且带有数据 alter table part3 add partition(year='2023',month='05',day='05') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25'; -- 多分区，带数据 alter table part3 add partition(year='2020',month='05',day='06') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25' partition(year='2020',month='05',day='07') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25';

删除分区：

删除分区的时候，partition之间，有逗号！

--删除一个分区： alter table part3 drop partition(year='2023',month='05',day='05'); --删除多个分区，中间有逗号 alter table part3 drop partition(year='2023',month='05',day='02'),partition(year='2023',month='05',day='03');

查看表结构

desc formatted part3;

对比一下：

desc part4;

desc formatted part4;

desc extended part4;

让分区关联数据的三种方式【重点】

（1）方式一：上传数据后修复

create table if not exists part5( id int, name string, age int ) partitioned by (year string,month string,day string) row format delimited fields terminated by ',';

在hdfs上创建文件夹： hive (yhdb)> dfs -mkdir -p /user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=28; 上传数据 hive (yhdb)> dfs -put /home/hivedata/user1.txt /user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=28;

这时查询数据，发现此时表中是没有数据的，原因是partition的元数据没有在mysql中，修复一下：

msck repair table part5;

通过修复的日志，可以看出，修复操作其实是在part5这个表的元数据中，添加了分区的数据。

再次测试 select * from part5 ,发现就有数据了

（2）上传数据后添加分区

在hdfs上创建文件夹： hive (yhdb)> dfs -mkdir -p /user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=27; 上传数据 hive (yhdb)> dfs -put /home/hivedata/user1.txt /user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=27; 创建一个分区： alter table part5 add partition(year='2023',month='08',day='27');

先创建一个分区，会不会产生文件夹呢？会！创建一个分区表，会不会产生文件夹呢？不会！你也可以先创建分区，在分区的文件夹里面，上传数据！

alter table part5 add partition(year='2023',month='08',day='26'); 添加分区之后就有了文件夹：/user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=26 在这个文件夹里面上传数据： dfs -put /home/hivedata/user1.txt /user/hive/warehouse/yhdb.db/part5/year=2023/month=08/day=26;

（3）方式三：load数据到分区

-- load一下数据：（我们经常使用load上传数据） load data local inpath '/home/hivedata/user1.txt' into table part5 partition(year='2023',month='08',day='25'); 这种方式其实没必要，因为不创建文件夹，load数据到分区表也会自动创建的。

分区的种类：

静态分区：先创建分区，再加载数据

动态分区：直接加载数据，根据数据动态创建分区

混合分区：分区字段有静态的，也有动态的。

动态分区：

根据数据的查询结果，动态的生成不同的分区

不能使用load加载，需要先建普通的表，查询出来再加入到表中

动态分区的玩法：

（1）开启动态分区功能（默认true，开启） set hive.exec.dynamic.partition=true; (2)设置为非严格模式（动态分区的模式，默认strict ，表示必须指定至少一个分区为静态分区，nostrict模式表示允许所有的分区字段都可以使用动态分区） set hive.exec.dynamic.partition.mode=nostrict; (3) 在所有执行MR的节点上，最大一共可以创建多少个动态分区，默认为1000 set hive.exec.dynamic.partitions=1000; （4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。 set hive.exrc.dynatmic.partitions.pernode=100;

创建普通的表，将数据加载进去： -- 创建表 create table order_partition ( order_no string, type string, order_time string ) row format delimited fields terminated by '\t'; -- 加载数据 load data local inpath "/home/hivedata/dongtai.txt" into table order_partition; 接着按照需求，创建动态分区表 create table order_dynamic_partition ( order_no string ) partitioned by(type String, `time` String) row format delimited fields terminated by '\t'; 导入数据：效果就是按照type和time 两个字段的数据，动态的创建分区：一定不要使用load加载数据，要从普通表中查询数据插入到动态表： insert overwrite table order_dynamic_partition partition (type, `time`) select order_no, type, order_time from order_partition; hive (yhdb)> show partitions order_dynamic_partition; OK partition type=china/time=2014-05-01 type=china/time=2014-05-02 type=usa/time=2014-05-01 Time taken: 0.254 seconds, Fetched: 3 row(s) 思考：order_no, type, order_time 能过换成* insert overwrite table order_dynamic_partition partition (type, `time`) select * from order_partition; 虽然没有报错，但是不建议：因为动态分区是由规律的：动态分区数据必须是查询数据的后几位。 insert overwrite table order_dynamic_partition partition (type, `time`) select order_no, type, order_time from order_partition; 动态分区需要依赖于两个字段的数据，这两个数据必须是最后两个，而且必须数据要照应. 也就是说，不管select 有多少个字段，最后两个字段必须照应，否则有问题！

分桶

1、分桶的意义

数据分区可能导致有些分区,数据过多，有些分区,数据极少。分桶是将数据集分解为若干部分(数据文件)的另一种技术。

分区和分桶其实都是对数据更细粒度的管理。当单个分区或者表中的数据越来越大，分区不能细粒度的划分数据时，我们就采用分桶技术将数据更细粒度的划分和管理。

分桶必须分区

分区和分桶都属于hive优化的一部分

分桶没有分区作用大

提高效率

底层原理其实是MR的分区 HashPartitioner

与MapReduce中的HashPartitioner的原理一模一样

2、分桶的原理

与MapReduce中的HashPartitioner的原理一模一样 MapReduce：使用key的hash值对reduce的数量进行取模(取余) hive：使用分桶字段的hash值对分桶的数量进行取模(取余)。针对某一列进行分桶存储。每一条记录都是通过分桶字段的值的hash对分桶个数取余，然后确定放入哪个桶。 MapReduce: Key 单词 reduce的数量是3个，最后形成3个。 hello --> hello 进行hash算法 --> 得到的hash值对3取模（0 1 2） MapReduce假如不指定分区，是否有分区呢？答案是有，使用默认分区HashPartitioner。 Hive --> 假如我指定分桶字段为 id , 桶的数量为 3个，就是hash(id) % 3 = 0 1 2 桶是一个个的文件，分区是一个个的文件夹。

3、分桶有啥好处

分区的意义：提高查询效率分桶的意义：将每一个分区的数据进行切分，变成一个个小文件，然后进行抽样查询（从一堆数据中找一些数据进行分析）。在进行多表联查的时候，可以提高效率（hive优化的时候再提）。

分桶是在分区的基础上再分，分的是文件而不是文件夹

分桶可以做抽样查询（利用百分比查询）

在分桶表中导入数据（标准）

创建一个普通分区表，然后将数据导入分区表中（采用insert 而不是 load）

再导入到分桶表内

cluster by（分桶且排序，分桶字段和排序字段必须一样）

1）建表

-- 创建分桶的表： create table stu_bucket(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by ' ';

2）设置reduce的数量：

想要将表创建为4个桶，需要将hive中mapreduce.job.reduces参数设置为>=4或设置为-1; 通过 set mapreduce.job.reduces ; 可以查看参数的值 hive (yhdb)> set mapreduce.job.reduces; mapreduce.job.reduces=-1 hive (yhdb)> set mapreduce.job.reduces=-1; reduces = -1 表示让系统自行决定reduce的数量。

3）加载数据

建议：不要使用load直接加载！使用：创建普通表，加载普通表的数据到分桶表。建议不要使用load直接加载，但是可以尝试一下： load data local inpath '/home/hivedata/student.txt' into table tmp_bucket;

接下来使用标准写法：

-- 创建一个普通的分区表 create table temp_stu ( id int, name string ) row format delimited fields terminated by ' '; load data local inpath '/home/hivedata/student.txt' into table temp_stu; -- 将数据导入分桶表中 insert into tmp_bucket select * from temp_stu cluster by (id);

分桶的查询

查询的数据不是很对.... 不知道是什么原因，但是语法是正确的语法: tablesample(bucket x out of y on sno) 注意：这里的x不能大于y on后面跟分桶字段。 select * from stu_bucket; select * from stu_bucket tablesample(bucket 1 out of 1); 查询第一桶 select * from stu_bucket tablesample(bucket 1 out of 4 on id); 查询第一桶和第三桶 select * from stu_bucket tablesample(bucket 1 out of 2 on id); 查询第二桶和第四桶的数据 select * from stu_bucket tablesample(bucket 2 out of 2 on id); 查询对8取余的第一桶的数据： -- 即使只定义了4桶的分桶，也依旧可以查8桶（将数据分为8份，取第一份） select * from stu_bucket tablesample(bucket 1 out of 8 on id); -- 抽样查询 select * from cq01 tablesample(bucket 1 out of 4 on id); select * from cq01 tablesample(bucket 1 out of 4 on id) where name = 'xx'; select * from cq01 tablesample(3 rows); -- 行数 select * from cq01 tablesample(30 percent); -- 数量的百分比 select * from cq01 tablesample(6B); -- B K M G T P 具体数量单位（K,KB,MB,GB...） select * from cq01 tablesample(bucket 1 out of 4 on rand()); -- rand()随机抽样 select * from cq01 order by rand() limit 3;-- 随机抽3行数据