There can be multiple ways to merge files. One such way suggested by Hive is to use -
alter table <Table Name> PARTITION <Partition Name> CONCATENATE;
But, above solution does not work directly. Because, it triggers a MR Job with only map task and no reduce task. So, in output number of files will be equal to number of mapper running.
So, one can reduce number of mappers running that will eventually reduce number of files in output. Set below properties and that will cause "CONCATENATE" job to output less mappers JOBS -
Also, note that Concatenate can cause data to loss in cause of improper ORC file statistics. Refer - https://issues.apache.org/jira/browse/HIVE-13285
alter table <Table Name> PARTITION <Partition Name> CONCATENATE;
But, above solution does not work directly. Because, it triggers a MR Job with only map task and no reduce task. So, in output number of files will be equal to number of mapper running.
So, one can reduce number of mappers running that will eventually reduce number of files in output. Set below properties and that will cause "CONCATENATE" job to output less mappers JOBS -
hive> set hive.merge.mapfiles=true;
hive> set hive.merge.mapredfiles=true;
hive> set hive.merge.size.per.task=1073741824;
hive> set hive.merge.smallfiles.avgsize=1073741824;
hive >set
mapreduce.input.fileinputformat.split.maxsize=1073741824;
hive> set mapred.job.reuse.jvm.num.tasks=5;
Also, note that Concatenate can cause data to loss in cause of improper ORC file statistics. Refer - https://issues.apache.org/jira/browse/HIVE-13285
Other way to merge is to execute query like below -
insert overwrite table … select * from () order by ..
1. Set
number of reducer to 1
2. We should have some operation in query that lead a reducer to run. Because normal select * from will lead to a JOB with Mapper only Tasks.
Comments
Post a Comment