Description: The FileAlreadyExistsException error occurs in the following scenarios:
- Failure of the previous task might leave some files that trigger the FileAlreadyExistsException errors as shown below.
- When the executor runs out of memory, the individual tasks of that executor are scheduled on another executor. As a result, the FileAlreadyExistsException error occurs.
- When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries.
- In Spark, if data frame to be saved has different schema than the Hive table schema, generally, columns should be in sequence with partition columns being last.
- In Spark, say your table is partitioned on column A
- Say you have 2 data frames
- you union them
- try to save it in the table, it will result error above
- Because,
- Say data frame 1 had a record with column A="valueA"
- Say data frame 2 has a record with column A="valueA"
- After union , this will still be part of different partitions. So, 2 executor will write data to same partition, leading error above.
- Thus, after union, perform : df.repartition(A)
- This will make same value of column A to come in to same partition.
Comments
Post a Comment