Skip to main content

Posts

Spark Exception - java.lang.NullPointerException

  java.lang.NullPointerException         at org.apache.spark.sql.execution.datasources.orc.OrcColumnVector.getUTF8String(OrcColumnVector.java:167)         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.processNext(Unknown Source)         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)         at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.s...

Fixing hbck Inconsistencies

  Execute 'hbck_chore_run' in hbase shell to generate a new sub-report. Hole issue: - verify if region is existing in both HDFS and meta. - If not in HDFS it is data loss or cleared by cleaner_chore already. - If not in Meta we can use hbck2 jar reportMissingInMeta option to find out the missing records in meta - Then use addFsRegionsInMeta option to add missing records back to meta - Then restart Active Master and then assigns those regions Orphan Regions: Refer  https://community.cloudera.com/t5/Support-Questions/Hbase-Orphan-Regions-on-Filesystem-shows-967-regions-in-set/td-p/307959 - Do "ls" to see "recovered.edits" if there is no HFile means that region was splitting and it failed. - Replay using  WALPlayer   hbase org.apache.hadoop.hbase.mapreduce.WALPlayer hdfs://bdsnameservice/hbase/data/Namespace/Table/57ed0b774aef9158cfda87c945a0afae/recovered.edits/0000000000001738473 Namespace:Table - Move the Orphan region to some temporary location and clean up...

HBase Utility - Merging Regions in HBase

  Growing HBase Cluster, and difficulty to get physical hardware is something that every enterprise deals with... And sometimes, the amount of ingest starts putting pressure on components before new hosts can be added. As, I write this post our cluster was running with 46 nodes and each node having 600 regions per server. This is bad for performance. You can use the following formula to estimate the number of regions for a RegionServer: (regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families)) Reference  https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/hbase-data-access/content/memstore-size-regionservers.html We noticed that HBase doesn't have automatic process to reduce and merge regions. As over the time, many small size or empty regions are formed on cluster which degrades the performance.  While researching to cope up with this problem: we came across following scripts -  https://appsintheopen.com/posts/51-merge-empty-hbas...

Refresh Multiple lines or Single line in Linux Shell

  Below code gives you an example to refresh multiple lines using tput while :; do     echo "$RANDOM"     echo "$RANDOM"     echo "$RANDOM"     sleep 0.2     tput cuu1 # move cursor up by one line     tput el # clear the line     tput cuu1     tput el     tput cuu1     tput el done Below code gives you an example to refresh or reprint same line on STDOUT while true; do echo -ne "`date`\r"; done

Which one should I use - PrestoDB or Trino ?

  First thing to understand is why to use Presto or Trino.  We had been running two clusters specifically Hortonworks (HDP) variant & Cloudera (CDP) variant.  Hive Tables built on HDP were mostly ORC whereas Tables that existed for us on CDP were mostly Parquet. We wanted to add ad-hoc querying functionality to our cluster. And, we came across Apache Impala as an excellent tool for this purposes.  Only CDP supported Apache Impala. Impala had limitation to work with Parquet, Kudu, HBase. Before CDP 6.* there was no support for ORC file format with Impala. Thus, we came to know about PrestoDB, which was built at Facebook, and was an excellent distributed SQL Engine for  ad-hoc querying.  It not only supported ORC but has connectors for multiple data sources. A bit history of Presto -  Developed at Facebook ( 2012) Supported by Presto Foundation establish by Linux Foundation (2019) Original Developers & Linux Foundation get into conflict on naming...