Skip to main content

Posts

org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow

  We were running an application which was leading to below error -  Job aborted due to stage failure: Task 137 in stage 5.0 failed 4 times, most recent failure: Lost task 137.3 in stage 5.0 (TID 2090, ncABC.hadoop.com, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 59606960. To avoid this, increase spark.kryoserializer.buffer.max value. at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:330) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 59606960 at com.esotericsoftware.kryo.io.Output.require(Output.java:167) at com.esotericsoftware.kryo.io.Output.wr...

Spark MongoDB Connection - Fetched BSON document does not have all the fields solution

  Spark MongoDB connector does not fetch all the fields present in stored BSON document in a collection.  This is because Mongo collection can have documents with different schema. Typically, all documents in a collection are of similar or related purpose. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data. So, when we read MongoDB Collection using Spark connector. It infers schema as per first row it may read, which might not consist of fields which are present in subsequent tuples/ rows. Suppose We have a MongoDB collection - default.fruits, and it has following documents -  { "_id" : 1, "type" : "apple"} { "_id" : 2, "type" : "orange", "qty" : 10 } { "_id" : 3, "type" : "ban...

Use Encrypted Password in Linux / Unix Shell Script

  Easiest available way is using Openssl Encrypt  Say your actual password is "Password12345". You can encrypt using below command - echo "Password12345" | openssl enc -aes-256-cbc -md sha512 -a -pbkdf2 -iter 100000  -salt -pass pass:Secret@1234# > secret.txt Resultant output in " secret.txt" U2FsdGVkX1/2NoQ6i1uZKK4yk+5gm5cA13EJ2TiPbcw= Save this in a file and use it with your applications. Decrypt  Then you can read file with encrypted password, decrypt it using below command and pass it to henceforth operations/ commands. cat "secret.txt" | openssl enc -aes-256-cbc -md sha512 -a -d -pbkdf2 -iter 100000 -salt -pass pass:Secret@123# Resultant output  Password12345

Spark Performance Tunning Example 1

  We had a Spark Job which was taking over 3 hours to complete.  First, We found the stage which was taking time. Refer details as below - This is simply reading file and mapping data to final save location. So, there are not much joins or calculations involved. Second, We saw event time line, if there is any delay due to serialization/ shuffling/ scheduling. But, there was nothing. Third, we saw total executors and tasks processed by them. There are 5 executors each taking over 3 hours and each executing approximately 500 tasks. Which means that a task is taking almost 2.8 -3 minutes Fourth, to confirm there is no data skew, we sorted the tasks to see maximum duration and maximum input a task processed, and maximum shuffle. And, we found nothing. So conclusively, we could say that there is no problem with Job. Why Job is slow, is because it has less number of executors or eventually less vcores for task processing. Thus, we bumped up Number of Executors from 5 to 30. And, eac...

Hadoop Distcp Error Duplicate files in input path

  One may face following error while copying data from one cluster to other, using Distcp  Command: hadoop distcp -i {src} {tgt} Error: org.apache.hadoop.toolsCopyListing$DulicateFileException: File would cause duplicates. Ideally there can't be same file names. So, what might be happening in your case is you trying to copy partitioned table from one cluster to other. And, 2 different named partitions have same file name. Your solution is to correct Source path  {src}  in your command, such that you provide path uptil partitioned sub directory, not the file. For ex - Refer below : /a/partcol=1/file1.txt /a/partcol=2/file1.txt If you use  {src}  as  "/a/*/*"  then you will get the error  "File would cause duplicates." But, if you use  {src}  as  "/a"  then you will not get error in copying.