QueryDB

Posts

Spark 3 ( Scala 2.12) integration with HBase or Phoenix

Clone Git Repo - git clone https://github.com/dinesh028/hbase-connectors.git As of December 2022, the hbase-connectors releases in maven central are only available for Scala 2.11 and cannot be used with Spark 3.x The connector has to be compiled from source for Spark 3.x, see also HBASE-25326 Allow hbase-connector to be used with Apache Spark 3.0 Build as in this example (customize HBase, Spark and Hadoop versions, as needed): mvn -Dspark.version=3.3.1 -Dscala.version=2.12.15 -Dscala.binary.version=2.12 -Dhbase.version=2.4.15 -Dhadoop-three.version=3.3.2 -DskipTests clean package Use Jar with Spark - spark-shell --jars ~/hbase-connectors/spark/hbase-spark/target/hbase-spark*.jar References - https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md https://kontext.tech/article/628/spark-connect-to-hbase Similarly, do build Phoenix connector or use Cloudera Repo to download Spark3 Jar @ https://repository.cloudera.com/service/rest/repository/bro

Set following properties to access - S3, S3a, S3n

"fs.s3.awsAccessKeyId", access_key "fs.s3n.awsAccessKeyId", access_key "fs.s3a.access.key", access_key "fs.s3.awsSecretAccessKey", secret_key "fs.s3n.awsSecretAccessKey", secret_key "fs.s3a.secret.key", secret_key "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem" "fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" "fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem" If one needs to copy data from one S3 Bucket to other with different credential keys. Then - If you are on Hadoop cluster with version 2.7, and using s3a:// then - use URI as following - s3a://DataAccountKey:DataAccountSecretKey/DataAccount/path If you are on EMR or Hadoop 2.8+ then one can add properties per-bucket, as following - fs.s3a.bucket.DataAccount.access.key DataAccountKey fs.s3a.bucket.DataAccount.secret.key DataAccountSecretKey fs.s3.bucket.DataAccount.awsAccessKeyId Da

Debugging KAFKA connectivity integration with Remote Application including Spring Boot, Spark, Console Consumer, Open SSL

Our downstream partners wanted to consume data from Kafka Topic. They did open network & firewall ports with respective zookeeper & broker servers. But, Spring Boot application or Console Consumer failed to consume messages from Kafka topic. Refer log stack trace below - [2024-01-10 13:33:34,759] DEBUG [Consumer clientId=consumer-o2_prism_group-1, groupId=o2_prism_group] Node -1 disconnected. (org.apache.kafka.clients.NetworkClient) [2024-01-10 13:33:34,762] WARN [Consumer clientId=consumer-o2_prism_group-1, groupId=o2_prism_group] Bootstrap broker ncxxx001.h.c.com:9093 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient) [2024-01-10 13:33:34,860] DEBUG [Consumer clientId=consumer-o2_prism_group-1, groupId=o2_prism_group] Initialize connection to node ncxxx001.h.c.com:9093 (id: -1 rack: null) for sending metadata request (org.apache.kafka.clients.NetworkClient) [2024-01-10 13:33:34,861] DEBUG [Consumer clientId=consumer-o2_prism_group-1, groupId=o2_prism

Improve API Performance

Spark: Decimal Column are shown in scientific notation instead of numbers

This issue relates to - When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation SPARK-25177 Solution - One can use format_number UDF to convert scientific notation into String, as shown below -