A non-equi left join is a join between 2 tables in which join condition is given by operators other than equal to "=".
For example -
Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)"
Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)"
Hive does not support Non-Equi join.
Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example -
Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2)
But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception
Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)
In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2.
For example - We had 2 tables and wanted to check if IP is between start range and end range. On analysis, we found that first 2 colon separated values can be compared. So, we modified our join condition like below and it worked -
regexp_extract(B.CLIENT_IPV6_ADDRESS, '[0-9a-fA-F]+:[0-9a-fA-F]+:', 0) = regexp_extract(C.ip_address, '[0-9a-fA-F]+:[0-9a-fA-F]+:', 0) and C.ip_range_start <= B.CLIENT_IPV6_ADDRESS and C.ip_range_end >= B.CLIENT_IPV6_ADDRESS
For example -
Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)"
Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)"
Hive does not support Non-Equi join.
Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example -
Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2)
But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception
Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)
In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2.
For example - We had 2 tables and wanted to check if IP is between start range and end range. On analysis, we found that first 2 colon separated values can be compared. So, we modified our join condition like below and it worked -
regexp_extract(B.CLIENT_IPV6_ADDRESS, '[0-9a-fA-F]+:[0-9a-fA-F]+:', 0) = regexp_extract(C.ip_address, '[0-9a-fA-F]+:[0-9a-fA-F]+:', 0) and C.ip_range_start <= B.CLIENT_IPV6_ADDRESS and C.ip_range_end >= B.CLIENT_IPV6_ADDRESS
Comments
Post a Comment