QueryDB

Posts

Spark - Caused by: java.io.IOException: Not a file:

When you want to read data which is present in sub directories then Spark prompts below error - Caused by: java.io.IOException: Not a file: hdfs://sdatalakedev/a/b/c=dinesh/subDir1 Solution is to set below property - set mapreduce.input.fileinputformat.input.dir.recursive=true

Spark - java.lang.AssertionError: assertion failed

Spark SQL fails to read data from a ORC hive table that has a new column added to it. Giving Exception - java.lang.AssertionError: assertion failed at scala.Predef$. assert (Predef.scala:165) at org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) at org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) at scala.Option.map(Option.scala:145) This happens when following property is set - spark.sql.hive.convertMetastoreOrc= true Solution - Comment out property if being set explicitly or set it to false. Refer https://issues.apache.org/jira/browse/SPARK-18355

org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;

Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.; at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906) at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:192) at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:134) at org.apache.spark.sql.execution.datasource

Kylin- Building Cube using MR Job throws java.lang.ArrayIndexOutOfBoundsException

Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349) at java.io.DataOutputStream.writeInt(DataOutputStream.java:197 ) at org.apache.hadoop.io.BytesWritable.write(BytesWritable.java:188) Solution- Set kylin.engine.mr.config-override. mapreduce.task.io.sort.mb to 1024

Spark 2 Application Errors & Solutions

Exception - Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast This is Driver Exception and can be solved by setting spark.sql.autoBroadcastJoinThreshold to -1 Or, increasing --driver-memory Exception - Container is running beyond physical memory limits. Current usage: X GB of Y GB physical memory used; X GB of Y GB virtual memory used. Killing container YARN killed container as it was exceeding memory limits. Increase --driver-memory --executor-memory Exception - ERROR Executor: Exception in task 600 in stage X.X (TID 12345) java.lang.OutOfMemoryError: GC overhead limit exceeded This means that Executor JVM was spending more time in Garbage collection than actual execution. This JVM feature can be disabled by adding -XX:-UseGCOverheadLimit Increasing Executor memory may help --executor-memory Make data more distributed so that it is not skewed to one executor. Might use parallel GC -XX:+UseParallelGC or -XX