Skip to main content

Posts

Showing posts from April, 2018

Hive - Merge large number of small files

There can be multiple ways to merge files. One such way suggested by Hive is to use - alter table <Table Name>  PARTITION <Partition Name>  CONCATENATE; But, above solution does not work directly. Because , it triggers a MR Job with only map task and no reduce task. So, in output number of files will be equal to number of mapper running. So, one can reduce number of mappers running that will eventually reduce number of files in output. Set below properties and that will cause " CONCATENATE " job to output less mappers JOBS - hive> set hive.merge.mapfiles=true; hive> set hive.merge.mapredfiles=true; hive> set hive.merge.size.per.task=1073741824; hive> set hive.merge.smallfiles.avgsize=1073741824; hive >set mapreduce.input.fileinputformat.split.maxsize=1073741824; hive> set mapred.job.reuse.jvm.num.tasks=5; Also, note that Concatenate can cause data to loss in cause of improper ORC file statistics. Refer - https://issues.a

Spark & Hive over Spark - Performance Problems Hortonworks

I had been using Spark & Hive to Insert data in to Table. I have following table in Hive - CREATE TABLE `ds_test`(   `name` string) PARTITIONED BY (   `company` string,   `market` string,   `eventdate` string,   `processdate` string) ROW FORMAT SERDE   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION   'hdfs://hdpprod/apps/hive/warehouse/ds_test' TBLPROPERTIES (   'transient_lastDdlTime'='1524769102') I was inserting data into table using Hive over SQL like below -  sqlContext.sql("INSERT OVERWRITE TABLE ds_test PARTITION(COMPANY = 'MCOM', MARKET, EVENTDATE, PROCESSDATE) Select name, MARKET, EVENTDATE, PROCESSDATE from Table1") Above method was working fine. But, we were facing performance problems -  1) We saw that application was running t