Sometimes data is unevenly distributed leading to data skew. What it means is a partition has more data due to same/ related keys compared to other partitions. In case of Joins and Aggregations , all data for same key should be co-located, may be processed by one container/ executor . This may be lead to slowness of application. Solution - If data is small than smaller data set can be broadcasted . Thus, increasing join efficiency. This is governed by property - spark.sql.autoBroadcastJoinThreshold Identify if there are too many NULL values then filter them out before joining. And , process records with NULL keys separately then do a union with renaming data set. Salting - To understand salting, Lets understand problem with an example - Table 1 Key 1 1 1 Table 2 Key 1 1 On joining Table1 with Table 2, Since this is same key all data should be shuffled to same container or one JVM, which will return 3*2 ro...