Search This Blog

Wednesday, July 17, 2019

Read from a hive table and write back to it using spark sql

In context to Spark 2.2 -

if we read from an hive table and write to same, we get following exception-

scala> dy.write.mode("overwrite").insertInto("incremental.test2")
             org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;

 1. This error means that our process is reading from same table and writing to same table.
 2. Normally, this should work as process writes to directory .hiveStaging... 
 3. This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
 4. This error should not occur with insertInto method, as it overwrites partitions not the table.
 5. A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties - 

> 'spark.sql.partitionProvider' 'spark.sql.sources.provider'
> 'spark.sql.sources.schema.numPartCols
> 'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
> 'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
> 'spark.sql.sources.schema.partCol.0'
> 'spark.sql.sources.schema.partCol.1'


Note that - If used with Parquet output format then it may still throw an exception (, even after removing above TBL Properties). It is possible only when our flow tries to read from and write into same table & format is Parquet.

No comments:

Post a Comment