We were running an application which was leading to below error -
Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again.
Set spark.sql.autoBroadcastJoinThreshold=-1
On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error.
Note that you can see "BroadcastNestedLoopJoin" in Spark WebUI or invoking ".explain(true)" on dataframe to visualize physical plan.
Solution -
To fix above problem, we replaced "not in" clause in our SQL with "not exists" clause. For example -
select * from Table1 where id not in ( select if from table2)
rewritten as -
select * from Table 1 where not exists ( select 1 from Table2 where table1.id=table2.id)
Explanation -
In SQL, not in means that if there is any null value in the not in values, the result is empty. This is why it can only be executed with BroadcastNestedLoopJoin. All not in values must be known in order to ensure there is no null value in the set.
Similarly,
Below example will result in to BroadcastNestedLoopJoin -
df1.join(df2, $"id1" === $"id2" || $"id2" === $"id3", "left")
rewritten as -
df1.join(df2, $"id1" === $"id2", "left").join(df2, $"id2" === $"id3", "left")
Reference -
You can read details about same here - https://kb.databricks.com/sql/disable-broadcast-when-broadcastnestedloopjoin.html
Comments
Post a Comment