Read from a hive table and write back to it using spark sql

In context to Spark 2.2 -

if we read from an hive table and write to same, we get following exception-

scala> dy.write.mode("overwrite").insertInto("incremental.test2")
             org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;

org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;


 1. This error means that our process is reading from same table and writing to same table.
 2. Normally, this should work as process writes to directory .hiveStaging... 
 3. This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
 4. This error should not occur with insertInto method, as it overwrites partitions not the table.
 5. A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties - 

> 'spark.sql.partitionProvider' 'spark.sql.sources.provider'
> 'spark.sql.sources.schema.numPartCols
> 'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
> 'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
> 'spark.sql.sources.schema.partCol.0'
> 'spark.sql.sources.schema.partCol.1'



Note that - If used with Parquet output format then it may still throw an exception (, even after removing above TBL Properties). It is possible only when our flow tries to read from and write into same table & format is Parquet.

QueryDB

Search This Blog

Read from a hive table and write back to it using spark sql

Comments

Post a Comment

Popular posts

Spark MongoDB Connector Not leading to correct count or data while reading

Scala Spark building Jar leads java.lang.StackOverflowError

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

AWS EMR Spark – Much Larger Executors are Created than Requested

Hive Count Query not working