Skip to main content

Posts

PrestoDB (Trino) SQL Error - ORC ACID file should have 6 columns, found Y

  We faced this error while querying Hive Table using Trino -  Error -  SQL Error [16777223]: Query failed (#20230505_155701_00194_n2vdp): ORC ACID file should have 6 columns, found 17 This was happening because Table being queried was Hive Managed Internal Table, which by default in CDP ( Cloudera ) distribution is ACID compliant.  Now, in order for a Hive Table to be ACID complaint -  The underlying file system should be ORC,  and there were a few a changes on ORC file structure like the root column should be a struct with 6 nested columns (which encloses the data and the type of operation). Something like below               struct<     operation: int,     originalTransaction: bigInt,     bucket: int,     rowId: bigInt,     currentTransaction: bigInt,      row: struct<...>      > For more ORC ACID related internals - please take a look here  https://orc.apache.org/docs/acid.html Now, problem in our case was that though Hive Table was declared Inte

Spark Exception - java.lang.NullPointerException

  java.lang.NullPointerException         at org.apache.spark.sql.execution.datasources.orc.OrcColumnVector.getUTF8String(OrcColumnVector.java:167)         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.processNext(Unknown Source)         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)         at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)         at org.apache.spark.scheduler.Task.run(Task.scala:109)         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)         at

Fixing hbck Inconsistencies

  Execute 'hbck_chore_run' in hbase shell to generate a new sub-report. Hole issue: - verify if region is existing in both HDFS and meta. - If not in HDFS it is data loss or cleared by cleaner_chore already. - If not in Meta we can use hbck2 jar reportMissingInMeta option to find out the missing records in meta - Then use addFsRegionsInMeta option to add missing records back to meta - Then restart Active Master and then assigns those regions Orphan Regions: Refer  https://community.cloudera.com/t5/Support-Questions/Hbase-Orphan-Regions-on-Filesystem-shows-967-regions-in-set/td-p/307959 - Do "ls" to see "recovered.edits" if there is no HFile means that region was splitting and it failed. - Replay using  WALPlayer   hbase org.apache.hadoop.hbase.mapreduce.WALPlayer hdfs://bdsnameservice/hbase/data/Namespace/Table/57ed0b774aef9158cfda87c945a0afae/recovered.edits/0000000000001738473 Namespace:Table - Move the Orphan region to some temporary location and clean up

HBase Utility - Merging Regions in HBase

  Growing HBase Cluster, and difficulty to get physical hardware is something that every enterprise deals with... And sometimes, the amount of ingest starts putting pressure on components before new hosts can be added. As, I write this post our cluster was running with 46 nodes and each node having 600 regions per server. This is bad for performance. You can use the following formula to estimate the number of regions for a RegionServer: (regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families)) Reference  https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/hbase-data-access/content/memstore-size-regionservers.html We noticed that HBase doesn't have automatic process to reduce and merge regions. As over the time, many small size or empty regions are formed on cluster which degrades the performance.  While researching to cope up with this problem: we came across following scripts -  https://appsintheopen.com/posts/51-merge-empty-hbase-regions https

Refresh Multiple lines or Single line in Linux Shell

  Below code gives you an example to refresh multiple lines using tput while :; do     echo "$RANDOM"     echo "$RANDOM"     echo "$RANDOM"     sleep 0.2     tput cuu1 # move cursor up by one line     tput el # clear the line     tput cuu1     tput el     tput cuu1     tput el done Below code gives you an example to refresh or reprint same line on STDOUT while true; do echo -ne "`date`\r"; done