Skip to main content

Posts

Kylin- Building Cube using MR Job throws java.lang.ArrayIndexOutOfBoundsException

Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349) at java.io.DataOutputStream.writeInt(DataOutputStream.java:197 ) at org.apache.hadoop.io.BytesWritable.write(BytesWritable.java:188) Solution- Set kylin.engine.mr.config-override. mapreduce.task.io.sort.mb  to 1024 

Spark 2 Application Errors & Solutions

Exception -  Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast This is Driver Exception and can be solved by  setting spark.sql.autoBroadcastJoinThreshold to -1 Or, increasing --driver-memory Exception -  Container  is running beyond physical memory limits. Current usage: X GB of Y GB physical memory used; X GB of Y GB virtual memory used. Killing container YARN killed container as it was exceeding memory limits. Increase  --driver-memory --executor-memory   Exception - ERROR Executor: Exception in task 600 in stage X.X (TID 12345) java.lang.OutOfMemoryError: GC overhead limit exceeded This means that Executor JVM was spending more time in Garbage collection than actual execution.  This JVM feature can be disabled by adding -XX:-UseGCOverheadLimit Increasing Executor memory may help --executor-memory Make data more distributed so that it is not skewed to one executor. Might use parallel GC -...

Spark 2 - DataFrame.withColumn() takes time - Performance Optimization

We wrote a program that iterates and add columns to Spark dataframe.  We already had a table which had 364 columns and we wanted a final dataframe with 864 columns. Thus, we wrote a program like below -   existingColumns.filter(x => !colList.contains(x)).foreach { x =>       {         //stagingDF = stagingDF.withColumn(x, lit(null).cast(StringType))       } We realized that this caused program to take almost 20 minutes to add 500 columns to dataframe. On analysis, we found that  Spark updates Projection after every withColumn() method invocation. So, instead of updating dataframe. We created a StringBuilder and appended columns to that and Finally built a Select SQL out of it and executed it on dataframe:     val comma = ","     val select ="select "     val tmptable="tmptable"     val from =" from "     val strbuild = new StringBuilder()     val col = ...

Hive SQL: Multi column explode, Creating Map from Array & Rows to Column

 Say, you have a Table like below -  A [K1, K2, K3] [V1,V2,V3] B [K2] [V2] C [K1,K2,K3,K4,K5] [V1,V2,V3,V4,V5]   And you want a final table like below :   K1 K2 K3 K4 K5 A V1 V2 V3     B   V2       C V1 V2 V3 V4 V5 It Can be done with SQL -  select id, arrMap['K1'] K1, arrMap['K2'] K2, arrMap['K3'] K3, arrMap['K4'] K4, arrMap['K5'] K5, from (select id, str_to_map(concat_ws(',', (collect_list(kv)))) arrMap from (select id, n.pos, concat_ws(':', n.attribute, value[n.pos]) kv from  INPUT_TABLE lateral view posexplode(attributes) n as pos, attribute )T group by id) T1; Explanation: 1) Inner query explodes mul...