Skip to main content

Read from a hive table and write back to it using spark sql

In context to Spark 2.2 -

if we read from an hive table and write to same, we get following exception-

scala> dy.write.mode("overwrite").insertInto("incremental.test2")
             org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;

 1. This error means that our process is reading from same table and writing to same table.
 2. Normally, this should work as process writes to directory .hiveStaging... 
 3. This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
 4. This error should not occur with insertInto method, as it overwrites partitions not the table.
 5. A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties - 

> 'spark.sql.partitionProvider' 'spark.sql.sources.provider'
> 'spark.sql.sources.schema.numPartCols
> 'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
> 'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
> 'spark.sql.sources.schema.partCol.0'
> 'spark.sql.sources.schema.partCol.1'


Note that - If used with Parquet output format then it may still throw an exception (, even after removing above TBL Properties). It is possible only when our flow tries to read from and write into same table & format is Parquet.

Comments

Popular posts

Caused by: org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans

Error - Caused by: org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans RepartitionByExpression [src_db_id#90, ssd_id#83, last_upd_dt#89], 200 +- Filter (isnotnull(ldy#94) && (ldy#94 = 2019))    +- HiveTableRelation `db2`.`t2`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [ssd_id#83, flctr_bu_id#84, group_name#85, eval_level#86, last_chg_actn#87, last_upd_oper#88, last_upd_dt#89, src_db_id#90, across_skill_type_eval_type#91, dlf_batch_id#92, dlf_load_dttm#93], [ldy#94] and Aggregate [2019] +- Project    +- HiveTableRelation `db1`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [ssd_id#36, flctr_bu_id#37, group_name#38, eval_level#39, last_chg_actn#40, last_upd_oper#41, last_upd_dt#42, src_db_id#43, across_skill_type_eval_type#44] Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProd




Too Large Frame error

Description : When the size of the shuffle data blocks exceeds the limit of 2 GB, which spark can handle, the following error occurs. org.apache.spark.shuffle.FetchFailedException: Too large frame: XXXXXXXXXX at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)    Caused by: java.lang.IllegalArgumentException: Too large frame: XXXXXXXXXX at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133) Solutions that  may work  - Set spark.sql.shuffle.partitions Identify the DataFrame that is causing the issue. After the DataFrame is identified, repartition the DataFrame by using df.repartition() A possible reason to problem above can be data skewness.




Use TLSv1.2 and deactivate TLSv1 and TLSv1.1

Recently, I got in to a situation where my customer web service deactivated TLSv1 and TLSv1.1 protocol. Eventually, my application client that used to interact with Server started receiving below error in hand-shake javax . net . ssl . SSLHandshakeException : Received fatal alert : handshake_failure   After analysis I found out that my application runs on JDK 1.5 that only supports TLSv1. To replicate this scenario, I deployed web-service in tomcat  and made my tomcat to accept only TLSv1.2 protocol. This can be done by changing server.xml as follows: <Connector ...  SSLEnabled="true" sslProtocols="TLSv1.2" sslEnabledProtocols = "TLSv1.2" /> ***Please note tha t it depends upon tomcat version to use which either of  sslProtocols or  sslEnabledProtocols attribute Now when I ran my usual client application it received handshake failure as Client did Hello with TLSv1, while my server was not ready to accept it. Pos




Error - org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable

In case your Spark application is failing with below error -  Caused by: java.lang.ClassCastException:  org.apache.hadoop.io .Text cannot be cast to  org.apache.hadoop.io .IntWritable at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$5.apply(TableReader.scala:399) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$5.apply(TableReader.scala:399) Analysis & Cause - This is a data reading error. It may be because ORC Data files have some column as Text (, or String). But, Hive table defined on top of that has column defined as Int. Solution- Either update the datatype in ORC file or Hive Metadata. So, that both are in Sync. Also, to verify above behavior execute the query from -  Hive Shell - It should work fine. Spark Shell - It should fail.