Skip to main content

Load Balance or dynamic discovery of HiveServer2 Connection from Beeline or Hive Shell


To provide high availability or load balancing for HiveServer2, Hive provides a function called dynamic service discovery where multiple HiveServer2 instances can register themselves with Zookeeper. Instead of connecting to a specific HiveServer2 directly, clients connect to Zookeeper which returns a randomly selected registered HiveServer2 instance.


For example - 

Below command connects to Hive Server on MachineA

  • beeline -u "jdbc:hive2://machineA:10000"

Below command connects to Zookeeper Node: to determine one of the available Hive Server's to make a connection

  • beeline -u "jdbc:hive2://machineA:2181,machineB:2181,machineC:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-mob-batch?tez.queue.name=myyarnqueue"

We can Create ZNode with Zookeeper as follows - 

  1. Open Zookeeper command line interface
    • zookeeper-client
  2. Connect to Zookeeper Server
    • connect machineA:2181,machineB:2181,machineC:2181
  3. Create ZNode
    • create /hiveserver2-mob-batch
  4. Manually, Register HS2 with Zookeeper under a namespace
    • create /hiveserver2-mob-batch/serverUri=machineA:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000082
    • create /hiveserver2-mob-batch/serverUri=machineB:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000081
    • create /hiveserver2-mob-batch/serverUri=machineC:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000051
  5. Verify the Namespace by executing below
    • ls /hiveserver2-mob-batch
    • [serverUri=machineC:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000051, serverUri=machineA:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000082, serverUri=machineB:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000081]

To deregister a particular HiveServer2, in the Zookeeper command line interface, run the following command
  • delete /hiveserver2-mob-batch/serverUri=machineC:10000;version=3.1.3000.7.1.2.0-96;sequence=0000000051
After you deregister the HiverServer2 from Zookeeper, it will not return the deregistered HiveServer2 for new client connections. However, any active client session is not affected by deregistering the HiveServer2 from Zookeeper.

To deregister all HiveServer2 instances of a particular version, run the following command from the command line:
  • hive --service hiveserver2 --deregister <version_number>

Now, Even after,  we do above manual configuration in Zookeeper. We might still get an error like below, when invoking beeline/ hive - 

22/04/11 19:43:31 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=machineA:3181:4181:participant, version=0, server.3=machineB:3181:4181:participant, server.2=machineC:3181:4181:participant}
Error: org.apache.hive.jdbc.ZooKeeperHiveClientException: Unable to read HiveServer2 configs from ZooKeeper (state=,code=0)

This is because following steps needs to ensured by Admin Team for zookeeper discovery for HS2

Configuration Requirements

1. Set hive.zookeeper.quorum to the ZooKeeper ensemble (a comma separated list of ZooKeeper server host:ports running at the cluster)

2. Customize hive.zookeeper.session.timeout so that it closes the connection between the HiveServer2’s client and ZooKeeper if a heartbeat is not received within the timeout period.

3. Set hive.server2.support.dynamic.service.discovery to true

4. Set hive.server2.zookeeper.namespace to the value that you want to use as the root namespace on ZooKeeper. The default value is hiveserver2.

5. The adminstrator should ensure that the ZooKeeper service is running on the cluster, and that each HiveServer2 instance gets a unique host:port combination to bind to upon startup. 



As a developer, we applied following hack to get random HiveServer2 from Zookeper. Thus, distributing load across HS2- 

beeline -u "jdbc:hive2://$( ( echo "connect machineA:2181,machineB:2181,machineC:2181"; echo "ls /hiveserver2-mob-batch") | zookeeper-client | grep -oP '(?<=serverUri=).*?(?=;)'| shuf | head -1)/default;principal=hive/_HOST@MYDOMAIN"

What above command is doing - 
  1. Open zookeeper-client
    1. connect machineA:2181,machineB:2181,machineC:2181
    2. ls /hiveserver2-mob-batch
  2. Parse HS2 URL's, as mentioned between - serverUri=  and ;
  3. Does Random shuffling of all URL's - shuf
  4. Pick up first random URL - head -1
  5. Concatenate string to form JDBC URL - jdbc:hive2:// ...


Comments

Popular posts

Spark MongoDB Connector Not leading to correct count or data while reading

  We are using Scala 2.11 , Spark 2.4 and Spark MongoDB Connector 2.4.4 Use Case 1 - We wanted to read a Shareded Mongo Collection and copy its data to another Mongo Collection. We noticed that after Spark Job successful completion. Output MongoDB did not had many records. Use Case 2 -  We read a MongoDB collection and doing count on dataframe lead to different count on each execution. Analysis,  We realized that MongoDB Spark Connector is missing data on bulk read as a dataframe. We tried various partitioner, listed on page -  https://www.mongodb.com/docs/spark-connector/v2.4/configuration/  But, none of them worked for us. Finally, we tried  MongoShardedPartitioner  this lead to constant count on each execution. But, it was greater than the actual count of records on the collection. This seems to be limitation with MongoDB Spark Connector. But,  MongoShardedPartitioner  seemed closest possible solution to this kind of situation. But, it per...




Scala Spark building Jar leads java.lang.StackOverflowError

  Exception -  [Thread-3] ERROR scala_maven.ScalaCompileMojo - error: java.lang.StackOverflowError [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.generic.TraversableForwarder$class.isEmpty(TraversableForwarder.scala:36) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.isEmpty(ListBuffer.scala:45) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.toList(ListBuffer.scala:306) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.result(ListBuffer.scala:300) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.Stack$StackBuilder.result(Stack.scala:31) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.Stack$StackBuilder.result(Stack.scala:27) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:50) [Thread-3] INFO scala_maven.ScalaCompile...




MongoDB Chunk size many times bigger than configure chunksize (128 MB)

  Shard Shard_0 at Shard_0/xyz.com:27018 { data: '202.04GiB', docs: 117037098, chunks: 5, 'estimated data per chunk': '40.4GiB', 'estimated docs per chunk': 23407419 } --- Shard Shard_1 at Shard_1/abc.com:27018 { data: '201.86GiB', docs: 116913342, chunks: 4, 'estimated data per chunk': '50.46GiB', 'estimated docs per chunk': 29228335 } Per MongoDB-  Starting in 6.0.3, we balance by data size instead of the number of chunks. So the 128MB is now only the size of data we migrate at-a-time. So large data size per chunk is good now, as long as the data size per shard is even for the collection. refer -  https://www.mongodb.com/community/forums/t/chunk-size-many-times-bigger-than-configure-chunksize-128-mb/212616 https://www.mongodb.com/docs/v6.0/release-notes/6.0/#std-label-release-notes-6.0-balancing-policy-changes




Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),...




AWS EMR Spark – Much Larger Executors are Created than Requested

  Starting EMR 5.32 and EMR 6.2 you can notice that Spark can launch much larger executors that you request in your job settings. For example - We started a Spark Job with  spark.executor.cores  =   4 But, one can see that the executors with 20 cores (instead of 4 as defined by spark.executor.cores) were launched. The reason for allocating larger executors is that there is a AWS specific Spark option spark.yarn.heterogeneousExecutors.enabled (exists in EMR only, does not exist in Open Source Spark) that is set to true by default that combines multiple executor creation requests on the same node into a larger executor container. So as the result you have fewer executor containers than you expected, each of them has more memory and cores that you specified. If you disable this option (--conf "spark.yarn.heterogeneousExecutors.enabled=false"), EMR will create containers with the specified spark.executor.memory and spark.executor.cores settings and will not co...