Skip to main content

Posts

Install AWS Cli in a Virtual Environment

Create a Virtual Environment for your project mkdir $HOME/py36venv python3 -m venv $HOME/py36venv Activate 3.6 virtual Environment source $HOME/py36venv/bin/activate Install AWS Commandline pip install awscli chmod 755 $HOME/py36venv/bin/aws aws --version aws configure AWS Access Key ID [None]: ---------------------- AWS Secret Access Key [None]: ----+----+--------------- Default region name [None]: us-east-2 Default output format [None]: aws s3 ls aws s3 sync local_dir/ s3://my-s3-bucket aws s3 sync s3://my-s3-bucket local_dir/

spark.sql.utils.AnalysisException: cannot resolve 'INPUT__FILE__NAME'

I have a Hive SQL - select regexp_extract(`unenriched`.` input__file__name `,'[^/]*$',0) `SRC_FILE_NM from dl.table1; This query fails running with Spark - spark . sql . utils . AnalysisException : u "cannot resolve 'INPUT__FILE__NAME' given input columns: Anaylsis- INPUT__FILE__NAME is a Hive specific virtual column and it is not supported in Spark. Solution- Spark provides input_file_name function which should work in a similar way: SELECT input_file_name() FROM df but it requires Spark 2.0 or later to work correctly with Spark.

Spark / Hive - org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

Exception - java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) This exception may be occurring because underlying ORC File has a column with data Type Double, whereas Hive table has column type as String. This error can be rectified by correcting the data type.

Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

Exception -  Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) at org.apache.spark.sql.execution.vectorized.ColumnVector.getUTF8String(ColumnVector.java:645) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) Analysis - This might occur because of data type mismatch between Hive Table & written Parquet file. Solution - Correct the data type to match between Hive Table & Parquet