We were running an application which was leading to below error -
Job aborted due to stage failure: Task 137 in stage 5.0 failed 4 times, most recent failure: Lost task 137.3 in stage 5.0 (TID 2090, ncABC.hadoop.com, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 59606960. To avoid this, increase spark.kryoserializer.buffer.max value. at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:330) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 59606960 at com.esotericsoftware.kryo.io.Output.require(Output.java:167) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
Solution -
On further analysis, we found that this error was originating from BroadcastNestedLoopJoin
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:162)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:150)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343)
So, we removed BroadcastNestedLoopJoin by updating SQL where clause having NOT IN to NOT EXISTS. Refer details here - http://querydb.blogspot.com/2021/06/spark-disable-broadcast-join-not.html
This solved our problem.
Alternative Solution -
We read this solution at multiple places but we didn't try to set below properties -
--conf spark.kryoserializer.buffer.max=1024m spark.kryoserializer.buffer=512m
And, don't recommend to set it anything other than default values because,
There is a note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max
if needed.
That said, if you have worker with 4 cores then 4*512m ~ 2GB is taken up for Kryo Buffer, and that seems a good chunk of memory.
Comments
Post a Comment