Skip to main content

Posts

Spark 2 - DataFrame.withColumn() takes time - Performance Optimization

We wrote a program that iterates and add columns to Spark dataframe.  We already had a table which had 364 columns and we wanted a final dataframe with 864 columns. Thus, we wrote a program like below -   existingColumns.filter(x => !colList.contains(x)).foreach { x =>       {         //stagingDF = stagingDF.withColumn(x, lit(null).cast(StringType))       } We realized that this caused program to take almost 20 minutes to add 500 columns to dataframe. On analysis, we found that  Spark updates Projection after every withColumn() method invocation. So, instead of updating dataframe. We created a StringBuilder and appended columns to that and Finally built a Select SQL out of it and executed it on dataframe:     val comma = ","     val select ="select "     val tmptable="tmptable"     val from =" from "     val strbuild = new StringBuilder()     val col = " cast(null as string) as "     val colStr = colList.mkString(comma)     strbui

Hive SQL: Multi column explode, Creating Map from Array & Rows to Column

 Say, you have a Table like below -  A [K1, K2, K3] [V1,V2,V3] B [K2] [V2] C [K1,K2,K3,K4,K5] [V1,V2,V3,V4,V5]   And you want a final table like below :   K1 K2 K3 K4 K5 A V1 V2 V3     B   V2       C V1 V2 V3 V4 V5 It Can be done with SQL -  select id, arrMap['K1'] K1, arrMap['K2'] K2, arrMap['K3'] K3, arrMap['K4'] K4, arrMap['K5'] K5, from (select id, str_to_map(concat_ws(',', (collect_list(kv)))) arrMap from (select id, n.pos, concat_ws(':', n.attribute, value[n.pos]) kv from  INPUT_TABLE lateral view posexplode(attributes) n as pos, attribute )T group by id) T1; Explanation: 1) Inner query explodes multip

Spark: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

 Reason-  This occurs because underlying Parquet or ORC file has column with Double Type , whereas Hive reads it as String. Exception -   Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433) Solution -  One way is to correct Schema, either update Hiv

Error in Installing Zeppelin Interpreter

Error -  Caused by: org.sonatype.aether.transfer.ArtifactNotFoundException: Could not find artifact org.apache.zeppelin:zeppelin-shell:jar:0.8.0 Solution -  export JAVA_TOOL_OPTIONS="-Dzeppelin.interpreter.dep.mvnRepo=http://insecure.repo1.maven.org/maven2/" Then execute or install interpreter -  /usr/hdp/current/zeppelin-server/bin/install-interpreter.sh --name shell