Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -

Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150)

Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again.

Set spark.sql.autoBroadcastJoinThreshold=-1

On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error.

Note that you can see "BroadcastNestedLoopJoin" in Spark WebUI or invoking ".explain(true)" on dataframe to visualize physical plan.

Solution -

To fix above problem, we replaced "not in" clause in our SQL with "not exists" clause. For example -

select * from Table1 where id not in ( select if from table2)

rewritten as -

select * from Table 1 where not exists ( select 1 from Table2 where table1.id=table2.id)

Explanation -

In SQL, not in means that if there is any null value in the not in values, the result is empty. This is why it can only be executed with BroadcastNestedLoopJoin. All not in values must be known in order to ensure there is no null value in the set.

Similarly,

Below example will result in to BroadcastNestedLoopJoin -

df1.join(df2, $"id1" === $"id2" || $"id2" === $"id3", "left")

rewritten as -

df1.join(df2, $"id1" === $"id2", "left").join(df2, $"id2" === $"id3", "left")

Reference -

You can read details about same here - https://kb.databricks.com/sql/disable-broadcast-when-broadcastnestedloopjoin.html

QueryDB

Search This Blog

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

Comments

Post a Comment

Popular posts

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;

Read from a hive table and write back to it using spark sql

Hadoop Distcp Error Duplicate files in input path

Scala Spark building Jar leads java.lang.StackOverflowError