QueryDB

Posts

Spark: handle column nullability- The 0th field 'colA' of input row cannot be null

When you create a Spark DataFrame - One or more Columns can have schema nullable = false. What it means is that these column(s) can not have null values. When null value is assigned to such columns, we see following exception - 2/7/2023 3:16:00 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 6) java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'colA' of input row cannot be null. So, as to avoid above error - we are required to update the Schema of DataFrame: to set nullable=true One of the way to do that is using When.Otherwise Clause like below - . withColumn("col_name", when(col("col_name").isNotNull, col("col_name")).otherwise(lit(null))) This will tell Spark that Column can be null (, in case) Other way to do it is creating custom method to be called on Dataframe that returns new Dataframe with modified schema. import org.apache.spark

CLEO Portal (Integration Cloud) for SFTP - Rest API Transfer

This blog details about using CLEO REST API to transfer file(s) between SFTP. Normally, people utilize SFTP/ FTP protocol to connect to server. But, if server is integrated with CLEO Portal then one can make use of CLEO HTTP(s) Rest API to transfer file as following - GENERATE AUTHORIZATION TOKEN USING USERNAME & PASSWORD curl -X POST https://YOURTENANT.cleointegration.com/ api/authentication -H 'content-type: application/x-www-form-urlencoded' -d 'grant_type=password&username= USERNAME &password= PASSWORD &totp_code=undefined' Above Curl invocation will return a JSON Message with access Token that should be passed as header Authorization : Bearer * for future invocation to API. { "token_type":"bearer", "access_token":"eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJrdXFiMDNXTlEtYVZra1lRWGlSa1RBIiwiY3ljbGVJZCI6IkYiLCJpc3MiOiJodHRwOlwvXC93d3cuY2xlby5jb20iLCJvbW5pSWQiOiJrdXFiMDNXTlEtYVZra1lRWGlSa1RBIiwiZXhwIjoxNjc0MDkwNTU5

HTTP 400 Bad Request - Apache HTTP Client - HTTP POST JSON Message

For one of our projects, we faced issue where-in - We were utilizing Apache HTTP Client 4.5 and doing HTTP Post of JSON Message. This was resulting HTTP Response as HTTP /1.1 400 Bad Request ... On analysis, we found that HTTP Post was failing just for Big JSON Files of size greater then 7 MB. So, initially we thought it to be a server side issue. But after further analysis we found that Unix CURL command was able to successfully POST message to API. Thus, we came to know that something was wrong with Scala (JAVA) client code that was using HTTP Client. Solution - We further updated the code and used java.net.{HttpURLConnection, URL} instead of org.apache.http.client.methods.{HttpPost} and it worked fine for us. But, we were still buzzed with the problem why HTTPClient is not working. We tried using a proxy to capture HTTP Header and Body. So, as to compare difference between various user agents. And, we identified that Content-Length for User-Agent : Apache-HttpClient/4.5

Spark Streaming Kafka Errors

We observed couple of errors / info messages while running Spark Streaming Applications, which might come handy for debugging in future. Refer below - 1) We noticed that Spark streaming Job was running but consuming nothing. We could see following messages being iteratively printed in logs - 22/11/18 08:13:53 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=difnrt-uat-001] Group coordinator mymachine.com:9093 (id: 601150796 rack: null) is unavailable or invalid, will attempt rediscovery 22/11/18 08:13:53 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=difnrt-uat-001] Discovered group coordinator mymachine.com:9093 (id: 601150796 rack: null) 22/11/18 08:13:53 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=difnrt-uat-001] (Re-)joining group Possible Causes / Fixes - May be due to Kafka Coordinator service. Try restarting Kafka it may fix the issue. In our case, we couldn't get help from Admin Team. So, we changed "group.i

Spark Streaming - org.apache.kafka.clients.consumer.OffsetOutOfRangeException

While Running Spark Streaming application, we observed following exception - Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 410, mymachine.com, executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {mygrouidid=233318826} at org.apache.kafka.clients.consumer.internals.Fetcher.initializeCompletedFetch(Fetcher.java:1260) at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:607) at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1313) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1240) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1168) at org.apache.spark.streaming.kafka010.InternalKafkaConsumer.poll(KafkaDataConsumer.scala:200) at org.apache.spark.streaming.ka