Skip to main content

Posts

Copy Phoenix (HBase) Table with different name

It can be done in 2 steps - Just create a new Phoenix Table with a different name but same schema as existing Table. Use below HBase command that will eventually execute a MR Job to copy the data -  hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name="<Name of new HBase Table>" "<Name of existing HBase Table>"

Access AWS S3 or HCP HS3 (Hitachi) using Hadoop or HDFS or Distcp

Create Credentials File for S3 Keys hadoop credential create fs.s3a.access.key -value <Access_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks hadoop credential create fs.s3a.secret.key -value <Secret_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks Where -  <Access_KEY>- S3 access key <Secret_KEY> - S3 secret key Note -  this will create a file local file system, in home directory with name aws-dev-keys.jceks Put this file to HDFS. For, distributed access. To list the details execute below command-  hadoop credential list -provider localjceks://file/$HOME/aws-dev-keys.jceks List files in S3 Bucket with hadoop Shell hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/myfilelocation/aws-dev-keys.jceks -ls s3a://s3bucketname/ hdfs dfs -Dfs.s3a.access.key=<Access_KEY> -Dfs.s3a.secret.key=<Secret_KEY> -ls s3a://aa-daas-ookla/ Note - Similarly, other hadoop/ ...

Install AWS Cli in a Virtual Environment

Create a Virtual Environment for your project mkdir $HOME/py36venv python3 -m venv $HOME/py36venv Activate 3.6 virtual Environment source $HOME/py36venv/bin/activate Install AWS Commandline pip install awscli chmod 755 $HOME/py36venv/bin/aws aws --version aws configure AWS Access Key ID [None]: ---------------------- AWS Secret Access Key [None]: ----+----+--------------- Default region name [None]: us-east-2 Default output format [None]: aws s3 ls aws s3 sync local_dir/ s3://my-s3-bucket aws s3 sync s3://my-s3-bucket local_dir/

spark.sql.utils.AnalysisException: cannot resolve 'INPUT__FILE__NAME'

I have a Hive SQL - select regexp_extract(`unenriched`.` input__file__name `,'[^/]*$',0) `SRC_FILE_NM from dl.table1; This query fails running with Spark - spark . sql . utils . AnalysisException : u "cannot resolve 'INPUT__FILE__NAME' given input columns: Anaylsis- INPUT__FILE__NAME is a Hive specific virtual column and it is not supported in Spark. Solution- Spark provides input_file_name function which should work in a similar way: SELECT input_file_name() FROM df but it requires Spark 2.0 or later to work correctly with Spark.

Spark / Hive - org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

Exception - java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) This exception may be occurring because underlying ORC File has a column with data Type Double, whereas Hive table has column type as String. This error can be rectified by correcting the data type.