Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -

Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150)

Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again.

Set spark.sql.autoBroadcastJoinThreshold=-1

On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error.

Note that you can see "BroadcastNestedLoopJoin" in Spark WebUI or invoking ".explain(true)" on dataframe to visualize physical plan.

Solution -

To fix above problem, we replaced "not in" clause in our SQL with "not exists" clause. For example -

select * from Table1 where id not in ( select if from table2)

rewritten as -

select * from Table 1 where not exists ( select 1 from Table2 where table1.id=table2.id)

Explanation -

In SQL, not in means that if there is any null value in the not in values, the result is empty. This is why it can only be executed with BroadcastNestedLoopJoin. All not in values must be known in order to ensure there is no null value in the set.

Similarly,

Below example will result in to BroadcastNestedLoopJoin -

df1.join(df2, $"id1" === $"id2" || $"id2" === $"id3", "left")

rewritten as -

df1.join(df2, $"id1" === $"id2", "left").join(df2, $"id2" === $"id3", "left")

Reference -

You can read details about same here - https://kb.databricks.com/sql/disable-broadcast-when-broadcastnestedloopjoin.html

QueryDB

Search This Blog

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

Comments

Post a Comment

Popular posts

Spark MongoDB Connector Not leading to correct count or data while reading

Scala Spark building Jar leads java.lang.StackOverflowError

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

AWS EMR Spark – Much Larger Executors are Created than Requested

Hive Count Query not working