Skip to main content

Posts

Snappy ERROR using Spark/ Hive

we received following error using SPARK- ERROR - 1) java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy         at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)         at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51) 2) Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-d5273c94-b734-4a61-b631-b68a9e859151-libsnappyjava.so: /tmp/snappy-1.1.2-d5273c94-b734-4a61-b631-b68a9e859151-libsnappyjava.so: failed to map segment from shared object: Operation not permitted         at java.lang.ClassLoader$NativeLibrary.load(Native Method)         at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)         at java.l...

Spark Hive Table on top data consisting of sub directories

We can create a Hive Table on HDFS path which consist of data in sub directories. Like - Table A |--Dir1 |   |--datafile |--Dir2 |   |--datafile |--Dir3    |--datafile When we read this Hive table using Spark it gives error that "respective path is a directory not a file". Solution- Data can be read recursively by setting following property - set mapreduce.input.fileinputformat.input.dir.recursive=true; 

Serialization Exception running GenericUDF in HIVE with Spark.

I get an exception running a job with a GenericUDF in HIVE with Spark. Exception trace as below -  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)                 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)                 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)                 ... 40 more Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: java.time.temporal.TemporalQueries$$Lambda$27/1278539211 Serialization trace: query (java.time.format.DateTimeFormatterBuilder$ZoneTextPrinterParser) printerParsers (java.time.format.DateTimeFormatterBuilder$CompositePrinterParser) printerParser (java.time.format.DateTimeFormatter) frmt (com.ds.common.udfs.TZToOffset)       ...

Spark Duplicate Data on Hive overwrite

With Spark 2.2, while doing insert into table in SaveMode.Overwrite , data is getting duplicated. We analyzed this behavior and found that existing data is not getting deleted from Hive Partition due to permission issue.  Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bd-prod, access=EXECUTE, inode="/user/....":hive:hdfs:drwx------ And, Spark JOB doesn't fails in this case leading to duplicate data in Hive Table.

FileAlreadyExistsException in Spark jobs

Description : The FileAlreadyExistsException error occurs in the following scenarios: Failure of the previous task might leave some files that trigger the FileAlreadyExistsException errors as shown below. When the executor runs out of memory, the individual tasks of that executor are scheduled on another executor. As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries. In Spark, if data frame to be saved has different schema than the Hive table schema, generally, columns should be in sequence with partition columns being last. In Spark, say your table is partitioned on  column A Say you have 2 data frames you union them  try to save it in the table, it will result error above Because,  Say data frame 1 had a record with column A="valueA" Say data frame 2 has a record with column A="valueA" A...