Hive（统计分析）

Hive类型篇

Hive查询篇

报错1：Error: Error while compiling statement: FAILED: ParseException line 2:0 missing EOF at 'select' near ')' (state=42000,code=40000)

解决办法：把单引号换成双引号

报错2：0: jdbc:hive2://localhost:10000 (closed)>

解决办法：start-all.sh (hdfs没起来)，或者mysql连接不上服务器（连接本地（修改hive-site.xml文件localhost换成本地IP地址）或者使用docker），netstat -ntlp | grep 3306，查看3306端口占用情况。

1.内部连接：

select 
a.name as aname,
a.num as anum,
b.name as bname,
b.nick as bnick
from t_a a
inner join t_b b
on a.name=b.name
复制代码

2.注意：多个字段必须用逗号隔开，一般语句不用逗号。

一、为什么使用Hive

直接使用hadoop所面临的问题人员学习成本太高项目周期要求太短 MapReduce实现复杂查询逻辑开发难度太大
为什么要使用Hive 操作接口采用类SQL语法，提供快速开发的能力。避免了去写MapReduce，减少开发人员的学习成本。功能扩展很方便。、

二、Hive的特点

1.可扩展 Hive可以自由的扩展集群的规模，一般情况下不需要重启服务。

2.延展性 Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。

3.容错良好的容错性，节点出现问题SQL仍可完成执行

三、hive安装

1.最简安装：用内嵌derby作为元数据库

准备工作：安装hive的机器上应该有HADOOP环境（安装目录，HADOOP_HOME环境变量）

安装：直接解压一个hive安装包即可

此时，安装的这个hive实例使用其内嵌的derby数据库作为记录元数据的数据库，此模式不便于让团队成员之间共享协作

2.标准安装：将mysql作为元数据库。

3.mysql安装
1.上传mysql安装包
2.解压：
[root@mylove ~]# tar -xvf MySQL-5.6.26-1.linux_glibc2.5.x86_64.rpm-bundle.tar 

3.安装mysql的server包
[root@mylove ~]# rpm -ivh MySQL-server-5.6.26-1.linux_glibc2.5.x86_64.rpm 

依赖报错：
缺perl

yum install perl

安装完perl后 ，继续重新安装mysql-server

（可以配置一个本地yum源进行安装：

1、先在vmware中给这台虚拟机连接一个光盘镜像

2、挂在光驱到一个指定目录：mount -t iso9660 -o loop /dev/cdrom /mnt/cdrom

3、将yum的配置文件中baseURL指向/mnt/cdrom

[root@mylove ~]# rpm -ivh MySQL-server-5.6.26-1.linux_glibc2.5.x86_64.rpm 

又出错：包冲突conflict with

移除老版本的冲突包：mysql-libs-5.1.73-3.el6_5.x86_64

[root@mylove ~]# rpm -e mysql-libs-5.1.73-3.el6_5.x86_64 --nodeps

继续重新安装mysql-server

[root@mylove ~]# rpm -ivh MySQL-server-5.6.26-1.linux_glibc2.5.x86_64.rpm 

成功后，注意提示：里面有初始密码及如何改密码的信息

初始密码：/root/.mysql_secret  

改密码脚本：/usr/bin/mysql_secure_installation

4.安装mysql的客户端包：

[root@mylove ~]# rpm -ivh MySQL-client-5.6.26-1.linux_glibc2.5.x86_64.rpm

5.启动mysql的服务端：

[root@mylove ~]# service mysql start

Starting MySQL. SUCCESS!


6.修改root的初始密码：

[root@mylove ~]# /usr/bin/mysql_secure_installation  按提示

7.测试：

用mysql命令行客户端登陆mysql服务器看能否成功

[root@mylove ~]# mysql -uroot -proot

mysql> show databases;

8.给root用户授予从任何机器上登陆mysql服务器的权限：

mysql> grant all privileges on *.* to 'root'@'%' identified by '你的密码' with grant option;
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

注意点：要让mysql可以远程登录访问
最直接测试方法：从windows上用Navicat去连接，能连，则可以，不能连，则要去mysql的机器上用命令行客户端进行授权：
在mysql的机器上,启动命令行客户端： 

mysql -uroot -proot
mysql>grant all privileges on *.* to 'root'@'%' identified by 'root的密码' with grant option;
mysql>flush privileges;


3.2.2.hive的元数据库配置

vi conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>

2、上传一个mysql的驱动jar包到hive的安装目录的lib中

3、配置HADOOP_HOME 和HIVE_HOME到系统环境变量中：/etc/profile
4、source /etc/profile
5、hive启动测试
然后用命令启动hive交互界面：   
[root@hdp20-04 ~]# hive

复制代码

4.hive使用方式
4.1.最基本使用方式
启动一个hive交互shell
bin/hive
hive>

设置一些基本参数，让hive使用起来更便捷，比如：
1、让提示符显示当前库：
hive>set hive.cli.print.current.db=true;
2、显示查询结果时显示字段名称：
hive>set hive.cli.print.header=true; 

但是这样设置只对当前会话有效，重启hive会话后就失效，解决办法：
在linux的当前用户目录中，编辑一个.hiverc文件，将参数写入其中：
vi .hiverc
set hive.cli.print.header=true;
set hive.cli.print.current.db=true;


4.2.启动hive服务使用
启动hive的服务：
[root@hdp20-04 hive-1.2.1]# bin/hiveserver2 -hiveconf hive.root.logger=DEBUG,console

上述启动，会将这个服务启动在前台，如果要启动在后台，则命令如下：
nohup bin/hiveserver2 1>/dev/null 2>&1 &


启动成功后，可以在别的节点上用beeline去连接
方式（1）
[root@hdp20-04 hive-1.2.1]# bin/beeline  回车，进入beeline的命令界面
输入命令连接hiveserver2
beeline> !connect jdbc:hive2//mini1:10000
（hadoop01是hiveserver2所启动的那台主机名，端口默认是10000）
方式（2）
启动时直接连接：
bin/beeline -u jdbc:hive2://mini1:10000 -n root

接下来就可以做正常sql查询了


4.3.脚本化运行
大量的hive查询任务，如果用交互式shell来进行输入的话，显然效率及其低下，因此，生产中更多的是使用脚本化运行机制：
该机制的核心点是：hive可以用一次性命令的方式来执行给定的hql语句

[root@hdp20-04 ~]#  hive -e "insert into table t_dest select * from t_src;"

然后，进一步，可以将上述命令写入shell脚本中，以便于脚本化运行	hive任务，并控制、调度众多hive任务，示例如下：
vi t_order_etl.sh
#!/bin/bash
hive -e "select * from db_order.t_order"
hive -e "select * from default.t_user"
hql="create table  default.t_bash as select * from db_order.t_order"
hive -e "$hql"

如果要执行的hql语句特别复杂，那么，可以把hql语句写入一个文件：
vi x.hql
select * from db_order.t_order;
select count(1) from db_order.t_user;

然后，用hive -f /root/x.hql 来执行


5.hive建库建表与数据导入
5.1.建库
hive中有一个默认的库：
库名： default
库目录：hdfs://hdp20-01:9000/user/hive/warehouse

新建库：
create database db_order;
库建好后，在hdfs中会生成一个库目录：
hdfs://hdp20-01:9000/user/hive/warehouse/db_order.db

5.2.建表
5.2.1.基本建表语句
use db_order;
create table t_order(id string,create_time string,amount float,uid string);
表建好后，会在所属的库目录中生成一个表目录
/user/hive/warehouse/db_order.db/t_order
只是，这样建表的话，hive会认为表数据文件中的字段分隔符为 ^A

正确的建表语句为：
create table t_order(id string,create_time string,amount float,uid string)
row format delimited
fields terminated by ',';

这样就指定了，我们的表数据文件中的字段分隔符为 ","
5.2.2.删除表
drop table t_order;
删除表的效果是：
hive会从元数据库中清除关于这个表的信息；
hive还会从hdfs中删除这个表的表目录；

5.2.3.内部表与外部表
内部表(MANAGED_TABLE)：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中

外部表(EXTERNAL_TABLE)：表目录由建表用户自己指定
create external table t_access(ip string,url string,access_time string)
row format delimited
fields terminated by ','location '/access/log';
导入数据：在Linux下opt目录创建一个数据文件，然后用load data local inpath'路径'into table 表名；

创建外部表：
create external table t_logdata(cid_sn string,os_ver string,resolution string,commit_time string,sdk_ver string,device_id_type string,device_model string,promotion_channel string,app_ver_name string,app_ver_code string,pid string,net_type string,device_id string,app_device_id string,release_channel string,country string,time_zone string,os_name string,manufacture string,commit_id string,app_token string,app_id string,language string,User_id string)
partitioned by(day string)
row format delimited
fields terminated by ','
location '/log-data-clean-data/';（HDFS的目录）

向外部表分区导入数据：
 load data local(本地) inpath '/opt/log/data/*' into table t_logdata partition(day='2017-08-15');
  load data inpath'/log-data-clean/android/*' into table t_logdata partition(day='2017-08-15');
 select * from t_logdata where day='’2017-08-15';

外部表和内部表的特性差别：
1、内部表的目录在hive的仓库目录中 VS 外部表的目录由用户指定
2、drop一个内部表时：hive会清除相关元数据，并删除表数据目录
3、drop一个外部表时：hive只会清除相关元数据；

一个hive的数据仓库，最底层的表，一定是来自于外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；
然后，后续的etl操作，产生的各种表建议用managed_table


5.2.4.分区表
分区表的实质是：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

比如，网站每天产生的浏览记录，浏览记录应该建一个表来存放，但是，有时候，我们可能只需要对某一天的浏览记录进行分析
这时，就可以将这个表建为分区表，每天的数据导入其中的一个分区；
当然，每日的分区目录，应该有一个目录名（分区字段）

一个分区字段的实例：
示例如下：
1、创建带分区的表
create table t_access(ip string,url string,access_time string)
partitioned by(dt string)
row format delimited
fields terminated by ',';
注意：分区字段不能是表定义中的已存在字段

show partitions t_logdata;
添加分区：
alter table t_logdata add partition (day='2017-08-16');
删除分区：
alter table t_logdata drop partition (day='2017-08-15');

2、向分区中导入数据
load data local inpath '/root/access.log.2017-08-04.log' into table t_access partition(dt='20170804');
load data local inpath '/root/access.log.2017-08-05.log' into table t_access partition(dt='20170805');

3、针对分区数据进行查询
 select count(*) as number  from fengze where dt="20170806";（as后面加别名）
a、统计8月4号的总PV：
select count(*) from t_access where dt='20170804';
实质：就是将分区字段当成表字段来用，就可以使用where子句指定分区了

b、统计表中所有数据总的PV：
select count(*) from t_access;
实质：不指定分区条件即可

多个分区字段示例
建表：
create table t_partition(id int,name string,age int)
partitioned by(department string,sex string,howold int)
row format delimited fields terminated by ',';

导数据：
load data local inpath '/root/p1.dat' into table t_partition partition(department='xiangsheng',sex='male',howold=20);

5.2.5.CTAS建表语法
可以通过已存在表来建表：
1、create table t_user_2 like t_user;
新建的t_user_2表结构定义与源表t_user一致，但是没有数据

2、在建表的同时插入数据
create table t_access_user 
as
select ip,url from t_access;
t_access_user会根据select查询的字段来建表，同时将查询的结果插入新表中
！！！此语句是倒着分析！！！1

5.2.4.桶(bucket)
把表组织成桶有两个理由：
1.第一个理由是获得更高的查询处理效率。桶为表加上了额外的结构。Hive在处理有些查询时能够利用这个结构。连接两个在（包含连接列的）相同列上划分了桶的表，可以使用map端连接（map-side join）高效地实现。
2.把表划分成桶的 第二个理由是使“取样”或者说“采样”更高效。
使用clustered by 子句来指定划分桶所用的列 和要划分的桶的个数：
eg：create table bucketed_users(id int，name string)
    clustered by(id) into 4 buckets;
有个没有划分桶的用户表，要向分桶后的表中填充成员，需要将hive.enforce.bucketing属性设置为true。然后使用insert：
eg：insert overwrite table bucketed_users
    select * from users；
用tablesample子句对表进行取样：
eg：> select * from bucketed_users
    > tablesample(bucket 1 out of 4 on id);


5.3.数据导入导出
5.3.1.将数据文件导入hive的表
方式1：导入数据的一种方式：
手动用hdfs命令，将文件放入表目录；
Hadoop fs -put w.text /input
方式2：在hive的交互式shell中用hive命令来导入本地数据到表目录
hive>load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令导入hdfs中的数据文件到表目录
hive>load data inpath '/access.log.2017-08-06.log' into table t_access partition(dt='20170806');

注意：导本地文件和导HDFS文件的区别：
本地文件导入表：复制
hdfs文件导入表：移动

5.3.2.将hive表中的数据导出到指定路径的文件
1、将hive表中的数据导入HDFS的文件
insert overwrite directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access;


2、将hive表中的数据导入本地磁盘文件
insert overwrite local directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access limit 100000;


5.3.3.hive文件格式

HIVE支持很多种文件格式： SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE

create table t_pq(movie string,rate int)  stored as textfile;
create table t_pq(movie string,rate int)  stored as sequencefile;
create table t_pq(movie string,rate int)  stored as parquetfile;


演示：
1、先建一个存储文本文件的表
create table t_access_text(ip string,url string,access_time string)
row format delimited fields terminated by ','
stored as textfile;

导入文本数据到表中：
load data local inpath '/root/access-data/000000_0' into table t_access_text;

2、建一个存储sequence file文件的表：
create table t_access_seq(ip string,url string,access_time string)
stored as sequencefile;

从文本表中查询数据插入sequencefile表中，生成数据文件就是sequencefile格式的了：

insert into t_access_seq
select * from t_access_text;

3、建一个存储parquet file文件的表：

create table t_access_parq(ip string,url string,access_time string)
stored as parquetfile;
复制代码

select * from t_students distribute by sex sort by age;（以男女年龄排序查询）

方法：
1、运行机制和原理
2、动手：安装配置
3、写程序

Hive: 基于HDFS之上的数据仓库
=======================================
一、什么是数据仓库?
	1、其实就是一个数据库（Oracle、MySQL等等）
	2、比较大，数据多
	3、一般select

二、什么是Hive？（蜂巢）
	1、是基于HDFS之上的数据仓库
		Hive        HDFS
		表  ---->   目录
		数据 --->   文件
		分区 --->   目录
		
	2、是一个翻译器:  SQL ---> Hive ---> MapReduce
	    举例：  select ---> hive  ---> MapReduce程序

三、Hive的体系结构和安装配置
	1、Hive的体系结构
	2、Hive的元信息：表名、列名、列的类型 *****
	3、安装配置（简单一点）
	（*）安装模式：嵌入模式、本地模式、远程模式
	（*）嵌入模式：
		（1）不需要MySQL
		（2）自带Derby数据存储元信息
		（3）特点：只支持一个连接
		
	（*）本地模式、远程模式
		（1）MySQL数据库
		（2）生产：远程模式
		
	核心的配置文件：/root/training/apache-hive-2.3.0-bin/conf/hive-site.xml
	<?xml version="1.0" encoding="UTF-8" standalone="no"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
	<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://localhost:3306/hive?useSSL=false</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>hiveowner</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>Welcome_1</value>
	</property>
	</configuration>

	注意：
	Hive-on-MR： 过期了
	Hive-on-Spark: 不成熟，还在开发阶段
	Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.	
	
四、Hive的数据模型：表（重要）
	注意：Hive中，数据默认的分隔符是 tab键

	1、内部表：类似MySQL中表
	hive> create table newemp
		> (empno int,
		>  ename string,
		>  job string,
		>  mgr int,
		>  hiredate string,
		>  sal int,
		>  comm int,
		>  deptno int
		> )row format delimited fields terminated by ',';
	
	导入数据: load 
	本地数据：load data local inpath '/root/temp/emp.csv' into table newemp;
	HDFS：    load data inpath '/HDFS的目录/emp.csv' into table newemp;
	
	2、分区表:
	hive> create table emp_part
		> (empno int,
		>  ename string,
		>  job string,
		>  mgr int,
		>  hiredate string,
		>  sal int,
		>  comm int
		> )partitioned by (deptno int)
		> row format delimited fields terminated by ',';	
	
	3、桶表
	4、外部表
	
	5、视图:
	（*）视图是一个虚表
	（*）简化复杂的查询
		hive> create view newview10
			> as
			> select * from newemp where deptno=10;
		hive> select * from newview10;
		7782    CLARK   MANAGER 7839    1981/6/9        2450    NULL    10
		7839    KING    PRESIDENT       NULL    1981/11/17      5000    NULL    10
		7934    MILLER  CLERK   7782    1982/1/23       1300    NULL    10

五、使用JDBC操作Hive
	1、JDBC：操作RDBMS的标准接口
复制代码

1.从本地文件系统中导入数据到hive表

（1）数据准备（/home/sopdm/test.dat）：

1,wyp,25,13188888888

2,test,30,13899999999

3,zs,34,89931412

（2）首先创建表

use sopdm;

drop table if exists sopdm.wyp;

create table if not exists sopdm.wyp(id int,name string,age int,tel string)

row format delimited

fields terminated by ','

stored as textfile;

（3）从本地文件系统中导入数据到Hive表

load data local inpath ‘/home/sopdm/test.dat’ into table sopdm.wyp;

（4）可以到wyp表的数据目录下查看，如下命令

dfs -ls /user/sopdm/hive/warehouse/sopdm.db/wyp;

2.从HDFS上导入数据到Hive表

（1）现在hdfs中创建一个input目录存放HDFS文件

hadoop fs -mkdir input; 或 hadoop fs -mkdir /user/sopdm/input;

（2）把本地文件上传到HDFS中,并重命名为test_hdfs.dat

hadoop fs -put /home/sopdm/test.dat /user/sopdm/input/test_hdfs.dat;

（3）查看文件

dfs -cat /user/sopdm/input/test_hdfs.dat;

（4）将内容导入hive表中

--拷贝“本地数据”到“hive”使用：load data local…

--转移“HDFS”到“hive”（必须同一个集群）使用：load data…

load data inpath ‘/user/sopdm/input/test_hdfs.dat’ into table sopdm.wyp;

3.从别的Hive表中导入数据到Hive表中

create table if not exists sopdm.wyp2(id int,name string,tel string)

row format delimited

fields terminated by ','

stored as textfile;

--overwrite是覆盖，into是追加

insert into table sopdm.wyp2

select id,name,tel from sopdm.wyp;

--多表插入

--高效方式-查询语句插入多个分区

from sopdm.wyp w

insert overwrite table sopdm.wyp2

select w.id,w.name,w.tel where w.age=25
复制代码