Skip to main content

Posts

Showing posts from August, 2020

Kylin- Building Cube using MR Job throws java.lang.ArrayIndexOutOfBoundsException

Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349) at java.io.DataOutputStream.writeInt(DataOutputStream.java:197 ) at org.apache.hadoop.io.BytesWritable.write(BytesWritable.java:188) Solution- Set kylin.engine.mr.config-override. mapreduce.task.io.sort.mb  to 1024 

Spark 2 Application Errors & Solutions

Exception -  Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast This is Driver Exception and can be solved by  setting spark.sql.autoBroadcastJoinThreshold to -1 Or, increasing --driver-memory Exception -  Container  is running beyond physical memory limits. Current usage: X GB of Y GB physical memory used; X GB of Y GB virtual memory used. Killing container YARN killed container as it was exceeding memory limits. Increase  --driver-memory --executor-memory   Exception - ERROR Executor: Exception in task 600 in stage X.X (TID 12345) java.lang.OutOfMemoryError: GC overhead limit exceeded This means that Executor JVM was spending more time in Garbage collection than actual execution.  This JVM feature can be disabled by adding -XX:-UseGCOverheadLimit Increasing Executor memory may help --executor-memory Make data more distributed so that it is not skewed to one executor. Might use parallel GC -XX:+UseParallelGC or -XX

Spark 2 - DataFrame.withColumn() takes time - Performance Optimization

We wrote a program that iterates and add columns to Spark dataframe.  We already had a table which had 364 columns and we wanted a final dataframe with 864 columns. Thus, we wrote a program like below -   existingColumns.filter(x => !colList.contains(x)).foreach { x =>       {         //stagingDF = stagingDF.withColumn(x, lit(null).cast(StringType))       } We realized that this caused program to take almost 20 minutes to add 500 columns to dataframe. On analysis, we found that  Spark updates Projection after every withColumn() method invocation. So, instead of updating dataframe. We created a StringBuilder and appended columns to that and Finally built a Select SQL out of it and executed it on dataframe:     val comma = ","     val select ="select "     val tmptable="tmptable"     val from =" from "     val strbuild = new StringBuilder()     val col = " cast(null as string) as "     val colStr = colList.mkString(comma)     strbui

Hive SQL: Multi column explode, Creating Map from Array & Rows to Column

 Say, you have a Table like below -  A [K1, K2, K3] [V1,V2,V3] B [K2] [V2] C [K1,K2,K3,K4,K5] [V1,V2,V3,V4,V5]   And you want a final table like below :   K1 K2 K3 K4 K5 A V1 V2 V3     B   V2       C V1 V2 V3 V4 V5 It Can be done with SQL -  select id, arrMap['K1'] K1, arrMap['K2'] K2, arrMap['K3'] K3, arrMap['K4'] K4, arrMap['K5'] K5, from (select id, str_to_map(concat_ws(',', (collect_list(kv)))) arrMap from (select id, n.pos, concat_ws(':', n.attribute, value[n.pos]) kv from  INPUT_TABLE lateral view posexplode(attributes) n as pos, attribute )T group by id) T1; Explanation: 1) Inner query explodes multip

Spark: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

 Reason-  This occurs because underlying Parquet or ORC file has column with Double Type , whereas Hive reads it as String. Exception -   Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433) Solution -  One way is to correct Schema, either update Hiv

Error in Installing Zeppelin Interpreter

Error -  Caused by: org.sonatype.aether.transfer.ArtifactNotFoundException: Could not find artifact org.apache.zeppelin:zeppelin-shell:jar:0.8.0 Solution -  export JAVA_TOOL_OPTIONS="-Dzeppelin.interpreter.dep.mvnRepo=http://insecure.repo1.maven.org/maven2/" Then execute or install interpreter -  /usr/hdp/current/zeppelin-server/bin/install-interpreter.sh --name shell