Hive索引功能测试

作者：Syn良子发布时间：[ 2017/1/12 10:02:35 ] 推荐标签：hive 功能测试

　　从Hive的官方wiki来看，Hive0.7以后增加了一个对表建立index的功能，想试下性能是否有很大提升，参考了一些资料亲手实现了一遍，记录下过程和心得
　　一.测试数据准备
　　1.新建一个gen-data.sh脚本，内容如下
#! /bin/bash
#generating 1.7G raw data.
i=0
while [ $i -ne 5000000 ]
do
echo "$i        A decade ago， many were predicting that Cooke， a New York City prodigy， would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando， Fla. There was a time， however fleeting， when he was more heralded， or perhaps merely hyped， than any other high school player in America."
i=$(($i+1))
done
　　2.生成文件
　　执行如上脚本: sh gen-data.sh >dual.txt，大约几分钟后生成完毕.
　　二.Hive建立表和索引
　　1.建表，注意和上面生成的数据是一致的，id和name以制表符隔开进行映射
　　create table table01(id int，name string) row format delimited fields terminated by ' ';
　　2.加载数据到表中
　　load data local inpath '~/testData/hive/dataScripts/dual.txt' overwrite into table table01; (用时Time taken: 160.787 seconds)
　　3.创建table02，数据来自于table01
　　create table table02 as select id ，name as text from table01; (Time taken: 154.463 seconds)
　　4.查询测试
　　select * from table02 where id=500000; (Time taken: 30.463 seconds， Fetched: 1 row(s))
　　此时dfs -ls /user/hive/warehouse/，会看到有table01和table02对应的数据文件夹生成
　　5.利用hive的CompactIndexHandler为id字段自动创建索引
　　create index table02_index on table table02(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;
　　alter index table02_index on table02 rebuild; (Time taken: 112.451 seconds)
　　注意上面这句是必要的，因为deferred rebuild以后，索引文件内容初始化是empty的，而alter index能够帮助重建index structure.
　　6.此时会看到索引表的生成，查看索引表内容
　　hive> select * from default__table02_table02_index__ limit 3;
　　OK
　　9    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [3168]
　　36    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [12698]
　　63    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [22229]
　　这里可以看到索引表分为三列，分别是索引列的枚举值，每个值对应的数据文件位置，以及在这个文件位置中的偏移量。通过这种方式，
　　可以减少查询的数据量（偏移量可以告诉你从哪个位置开始找，自然只需要定位到相应的block），起到减少资源消耗的作用.
　　7.再次查询测试
　　select * from table02 where id=500000; (Time taken: 29.226 seconds， Fetched: 1 row(s))
　　对比刚开始的30.463秒，基本没变化。所以继续研究
　　8.需要进行索引手动裁剪，如下
　　SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
　　Insert overwrite directory "/tmp/table02_index_data" select `_bucketname`， `_offsets` from default__table02_table02_index__ where id =500000;
　　Set hive.index.compact.file=/tmp/table02_index_data;
　　Set hive.optimize.index.filter=false;
　　Set hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
　　简单解释下上面命令的意思是对自己需要索引的查询比如id = 500000，手动从已有的索引表default__table02_table02_index__ 中裁剪出来插入临时的tmp目录，然后设置索引的文件
　　指向和忽略自动索引
　　9.终查询测试
　　select * from table02 where id =500000; (Time taken: 17.259 seconds， Fetched: 1 row(s))
　　好，这次变成17秒了，证明索引生效了.但是感觉差强人意啊.
　　个人总结:从官方的wiki，jira以及自己的测试来看，Hive的索引很不好用，它并不是传统的的B树索引，而是冗余了一个lookup的索引表，把需要索引的表简单划分了range和偏移量，
　　这些信息被储存在索引表里面进行查询，而且使用的时候不能直接用，还要根据条件进行裁剪才会真正生效。个人感觉这是个半成品，官方也宣称这块儿功能需要加强.