Skip to main content

Posts

Spark HBase Connector - Don't support IN Clause

We came across a scenario in using "shc-core-1.1.0.3.1.5.0-152.jar". A Spark data frame was created on one of the HBase Tables. We queried this data frame like "select * from df where col in ('A', 'B', 'C')" and found that filter on col is not working.  But, if re-write the same SQL like "select * from df where col = 'A' or col= 'B' or col= 'C' then it works.

Copy code of Git Repo in to a different Git Repo with History Commits

  1) Git clone  git clone <url to Source repo> temp-dir 2) Check different branches git branch -a 3) Checkout all the branches that you want to copy git checkout branch-name 4) fetch all the tags  git fetch --tags 5) Clear the link to Source repo git remote rm origin 6) Link your local repository to your newly created NEW repository git remote add origin <url to NEW repo> 7) Push all your branches and tags with these commands: git push origin --all git push --tags 8) Above steps complete copy from Source repo to New repo

Spark -Teradata connection Issues

Exception:  Caused by: java.lang.NullPointerException         at com.teradata.tdgss.jtdgss.TdgssConfigApi.GetMechanisms(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<init>(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<clinit>(Unknown Source) Brief: tdgssconfig.jar can't be found on the classpath. Please add same on classpath. Exception:  java.sql.SQLException: [Teradata Database] [TeraJDBC 15.10.00.33] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or '(' between the 'FROM' keyword and the 'SELECT' keyword. Brief: Normally Spark JDBC expects DBTable property to be a Table Name. So, internally it prepends "select * from" to Table Name. Like  select * from <Table Name> But, If we specify SQL instead of  Table Name then internally SQL will become something like: select * from select ... ; Above m

Splunk Data to Hadoop Ingestion

One of the approach to get data from Splunk to Hadoop is to use REST API provided by Splunk. Such that periodically data is ingested to Hadoop Data Lake.  Simple command like below can help in such scenario: curl  -u '<username>:<password>' \    -k https://splunkhost:8089/services/search/jobs/export \   -d search="search index=myindex | head 10" \   -d output_mode=raw \    | hdfs dfs -put -f - <HDFS_DIR>    Above command will get top 10 rows from Splunk index "myindex" and will ingest it to Hadoop Data Lake

Sqoop Import: New Line Character in one of the column value

Sometimes data produced by Sqoop Import may contain New Line Character. This may result failure to correctly read the data. To resolve same follow either of below solution: Specify following options with Sqoop: --map-column-java <Column name that contains New Line>=String --hive-drop-import-delims Or, Update Sqoop SQL and select the column with regex replacement, like: regexp_replace(<Column name that contains New Line>, '[[:space:]]+', ' ')