Skip to main content

Posts

Showing posts from October, 2023

Spark job fails with Parquet column cannot be converted error

  Exception -  Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://mylake/day_id=20231026/part-00009-94b5fdf9-bb52-4774-8d88-82e9c529f77f-c000.snappy.parquet. Column: [ACCOUNT_ID], Expected: string, Found: FIXED_LEN_BYTE_ARRAY   at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) .... Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException   at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)   at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:536)  .... Cause The vectorized Parquet reader is decoding the decimal type column to a binary format. Solution One can test either of below solution -  Read Parquet file directly from HDFS instead of Hive Table. Or, If

HBase Performance Optimization- Page2

  Refer Page 1 of this article @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization.html Normally, we run multiple workloads on the cluster. This includes Analytical as well as API calls. This also involves read & write traffic as well... HBase provides the following mechanisms for managing the performance of a cluster handling multiple workloads: . Quotas . Request Queues . Multiple-Typed Queues Quotas HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on the following limits Limit overall network throughput and number of RPC requests Limit amount of storage used for table or namespaces Limit number of tables for each namespace or user Limit number of regions for each namespace For this to work -  Set the hbase.quota.enabled property in the hbase-site.xml file to true. Enter the command to set the set the limit of the quota, type of quota, and to which entity to apply the quota. The command and its syntax are: $hbase_shell> set_quota

Spark Custom Kafka Partitioner

  Custom Partitoner can be implemented by extending org.apache.kafka.clients.producer.Partitioner.  This can be used with Spark-SQL Kafka Data Source by setting property "kafka.partitioner.class" For example  df.write.format("kafka").option("kafka.partitioner.class", "com.mycustom.ipartitioner") We implemented one such custom partitioner extending org.apache.kafka.clients.producer.RoundRobinPartitioner.  Complete Source code is available @ https://github.com/dinesh028/engineering/blob/master/Kafka/com/aquaifer/producer/KeyPartitioner.scala This paritioner  -  Reads a configuration file which has Kafka Key and PrimaryKey Name mapping. Value in Kafka is a JSON Message which has a Primary Key with unique Value.  Idea is to partition messages based on this unique value, such that messages with same value for primarykey go into same partition. Once, configurations are loaded. For each byte array message-  convert it to String JSON Parse JSON Get uniqu

Fix - HBase Master UI - hbck.jsp - NULLPointerException

  At times HBase Master UI hbck report shows nullpointer exception - https://hmaster:16010/hbck.jsp This page displays two reports: the  HBCK Chore Report  and the  CatalogJanitor Consistency Issues  report. Only report titles show if there are no problems to list. Note some conditions are  transitory  as regions migrate. See below for how to run reports. ServerNames will be links if server is live, italic if dead, and plain if unknown. Solution- If this page displays nullpointer exception then execute -  echo "hbck_chore_run" |hbase shell if page still displays null pointer exception and not the report then  execute -  echo " catalogjanitor_run " |hbase shell

Hue Oracle Database- Delete non standard user entries from Parent and Child Tables.

  We recently implemented authentication for Hue (https://gethue.com/) before that folks were allowed to authenticate with any kind of username and use Hue WebUI.  After implementing authentication it was found that Hue Database consisted entries for old unwanted garbage users which were still able to use the service and by pass authentication. Thus, it was required to delete such user entries from table HUE.AUTH_USER. But there were many associated child constraints, with the table which made deleting an entry violate constraints.  Thus, it was required to find out all child tables associated with HUE.AUTH_USER. First delete entries from Child tables followed by deleting entries from Parent HUE.AUTH_USER. We used following query to find all child tables, constraints and associated columns -  select a.table_name parent_table_name, b.r_constraint_name parent_constraint, c.column_name parent_column, b.table_name child_table, b.constraint_name as child_constraint, d.column_name child_colu

HBase Performance Optimization

  Please refer -  First blog in series to reduce Regions on Region Server - https://querydb.blogspot.com/2023/03/hbase-utility-merging-regions-in-hbase.html Second to delete column's in HBase - https://querydb.blogspot.com/2019/11/hbase-bulk-delete-column-qualifiers.html In this article, we would discuss options to further optimize HBase. We could use COMPRESSION=>'SNAPPY' for Column families. And, invoke Major Compaction right after setting the property. This will reduce size of tables by 70% yet giving same read & write performance. Once size of regions & tables is compressed then we can re invoke the Merge Region utility to reduce number of regions per server. Set Region Split policy as - SPLIT_POLICY=>'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy' Enable Request Throttle by setting hbase.quota.enabled  to true Our HBase Cluster is used by  Real Time Api's as well as Analytical Spark & MR Jobs. Analytical workloads crea