Exception -
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://mylake/day_id=20231026/part-00009-94b5fdf9-bb52-4774-8d88-82e9c529f77f-c000.snappy.parquet. Column: [ACCOUNT_ID], Expected: string, Found: FIXED_LEN_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
....
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:536)
....
Cause
The vectorized Parquet reader is decoding the decimal type column to a binary format.
Solution
One can test either of below solution -
- Read Parquet file directly from HDFS instead of Hive Table.
- Or, If you have decimal type columns in your source data, you should disable the vectorized Parquet reader. Set spark.sql.parquet.enableVectorizedReader to false in the Spark configuration to disable the vectorized Parquet reader.
Comments
Post a Comment