QueryDB

Posts

Too Large Frame error

Description : When the size of the shuffle data blocks exceeds the limit of 2 GB, which spark can handle, the following error occurs. org.apache.spark.shuffle.FetchFailedException: Too large frame: XXXXXXXXXX at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) Caused by: java.lang.IllegalArgumentException: Too large frame: XXXXXXXXXX at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133) Solutions that may work - Set spark.sql.shuffle.partitions Identify the DataFrame that is causing the issue. After the DataFrame is identified, repartition the DataFrame by using df.repartition() A possible reason to problem above can be data skewness.

Business Value from Machine Learning Methods

Linear Regression - To make predictions for sales forecast, price optimization, marketing optimization, financial risk assessment. Logistic Regression - To predict customer churn, to predict response versus advertisement spending, predict lifetime value of customer, and to monitor how business decisions affect predicted churn rates. Naive Bayes - Build spam detector, analyze customer sentiments, or automatically categorize products, customers or competitors. K-means clustering - Useful for cost modeling and customer segmentation Hierarchical clustering - Model business processes, or to segment customers based on survey responses, hierarchical clustering will probably come in handy. K-nearest neighbor classification - Type of instance based learning. use it for text document classification, financial distress prediction modeling, and competitor analysis and classification. Principal component analysis - Dimensionality reduction method that you can use for detecting fraud, for s

Types of Data Analytics

In the order of their complexity, they can be classified in 4 types - Descriptive Analytics - Based on Historical & Current data answer question like - "What happened?" Diagnostic Analytics - For deducing & inferring success or failure, like - "Why that happened?" or "Why it went wrong?" or "Why did we receive this growth or success?" Predictive Analytics - Based on what happened or what is happening deriving the answer to question like "What will happen?". This involves complex model-building and analysis in order to predict a future event or trend. Prescriptive Analytics - Optimize processes, structures, and systems through informed action that's based on predictive analytics - what you should do based on what will happen.

Hive - Merge large number of small files

There can be multiple ways to merge files. One such way suggested by Hive is to use - alter table <Table Name> PARTITION <Partition Name> CONCATENATE; But, above solution does not work directly. Because , it triggers a MR Job with only map task and no reduce task. So, in output number of files will be equal to number of mapper running. So, one can reduce number of mappers running that will eventually reduce number of files in output. Set below properties and that will cause " CONCATENATE " job to output less mappers JOBS - hive> set hive.merge.mapfiles=true; hive> set hive.merge.mapredfiles=true; hive> set hive.merge.size.per.task=1073741824; hive> set hive.merge.smallfiles.avgsize=1073741824; hive >set mapreduce.input.fileinputformat.split.maxsize=1073741824; hive> set mapred.job.reuse.jvm.num.tasks=5; Also, note that Concatenate can cause data to loss in cause of improper ORC file statistics. Refer - https://issues.a

Spark & Hive over Spark - Performance Problems Hortonworks

I had been using Spark & Hive to Insert data in to Table. I have following table in Hive - CREATE TABLE `ds_test`( `name` string) PARTITIONED BY ( `company` string, `market` string, `eventdate` string, `processdate` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdpprod/apps/hive/warehouse/ds_test' TBLPROPERTIES ( 'transient_lastDdlTime'='1524769102') I was inserting data into table using Hive over SQL like below - sqlContext.sql("INSERT OVERWRITE TABLE ds_test PARTITION(COMPANY = 'MCOM', MARKET, EVENTDATE, PROCESSDATE) Select name, MARKET, EVENTDATE, PROCESSDATE from Table1") Above method was working fine. But, we were facing performance problems - 1) We saw that application was running t