Skip to main content

Posts

Showing posts from May, 2019

Spark Duplicate Data on Hive overwrite

With Spark 2.2, while doing insert into table in SaveMode.Overwrite , data is getting duplicated. We analyzed this behavior and found that existing data is not getting deleted from Hive Partition due to permission issue.  Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bd-prod, access=EXECUTE, inode="/user/....":hive:hdfs:drwx------ And, Spark JOB doesn't fails in this case leading to duplicate data in Hive Table.

FileAlreadyExistsException in Spark jobs

Description : The FileAlreadyExistsException error occurs in the following scenarios: Failure of the previous task might leave some files that trigger the FileAlreadyExistsException errors as shown below. When the executor runs out of memory, the individual tasks of that executor are scheduled on another executor. As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries. In Spark, if data frame to be saved has different schema than the Hive table schema, generally, columns should be in sequence with partition columns being last. In Spark, say your table is partitioned on  column A Say you have 2 data frames you union them  try to save it in the table, it will result error above Because,  Say data frame 1 had a record with column A="valueA" Say data frame 2 has a record with column A="valueA" A

Too Large Frame error

Description : When the size of the shuffle data blocks exceeds the limit of 2 GB, which spark can handle, the following error occurs. org.apache.spark.shuffle.FetchFailedException: Too large frame: XXXXXXXXXX at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)    Caused by: java.lang.IllegalArgumentException: Too large frame: XXXXXXXXXX at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133) Solutions that  may work  - Set spark.sql.shuffle.partitions Identify the DataFrame that is causing the issue. After the DataFrame is identified, repartition the DataFrame by using df.repartition() A possible reason to problem above can be data skewness.