QueryDB

Posts

Spark Duplicate Data on Hive overwrite

With Spark 2.2, while doing insert into table in SaveMode.Overwrite , data is getting duplicated. We analyzed this behavior and found that existing data is not getting deleted from Hive Partition due to permission issue. Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bd-prod, access=EXECUTE, inode="/user/....":hive:hdfs:drwx------ And, Spark JOB doesn't fails in this case leading to duplicate data in Hive Table.

FileAlreadyExistsException in Spark jobs

Description : The FileAlreadyExistsException error occurs in the following scenarios: Failure of the previous task might leave some files that trigger the FileAlreadyExistsException errors as shown below. When the executor runs out of memory, the individual tasks of that executor are scheduled on another executor. As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries. In Spark, if data frame to be saved has different schema than the Hive table schema, generally, columns should be in sequence with partition columns being last. In Spark, say your table is partitioned on column A Say you have 2 data frames you union them try to save it in the table, it will result error above Because, Say data frame 1 had a record with column A="valueA" Say data frame 2 has a record with column A="valueA" A

Too Large Frame error

Description : When the size of the shuffle data blocks exceeds the limit of 2 GB, which spark can handle, the following error occurs. org.apache.spark.shuffle.FetchFailedException: Too large frame: XXXXXXXXXX at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) Caused by: java.lang.IllegalArgumentException: Too large frame: XXXXXXXXXX at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133) Solutions that may work - Set spark.sql.shuffle.partitions Identify the DataFrame that is causing the issue. After the DataFrame is identified, repartition the DataFrame by using df.repartition() A possible reason to problem above can be data skewness.

Business Value from Machine Learning Methods

Linear Regression - To make predictions for sales forecast, price optimization, marketing optimization, financial risk assessment. Logistic Regression - To predict customer churn, to predict response versus advertisement spending, predict lifetime value of customer, and to monitor how business decisions affect predicted churn rates. Naive Bayes - Build spam detector, analyze customer sentiments, or automatically categorize products, customers or competitors. K-means clustering - Useful for cost modeling and customer segmentation Hierarchical clustering - Model business processes, or to segment customers based on survey responses, hierarchical clustering will probably come in handy. K-nearest neighbor classification - Type of instance based learning. use it for text document classification, financial distress prediction modeling, and competitor analysis and classification. Principal component analysis - Dimensionality reduction method that you can use for detecting fraud, for s

Types of Data Analytics

In the order of their complexity, they can be classified in 4 types - Descriptive Analytics - Based on Historical & Current data answer question like - "What happened?" Diagnostic Analytics - For deducing & inferring success or failure, like - "Why that happened?" or "Why it went wrong?" or "Why did we receive this growth or success?" Predictive Analytics - Based on what happened or what is happening deriving the answer to question like "What will happen?". This involves complex model-building and analysis in order to predict a future event or trend. Prescriptive Analytics - Optimize processes, structures, and systems through informed action that's based on predictive analytics - what you should do based on what will happen.