Skip to main content

Posts

Validate Emails using Python

 Validate Email using Python-  https://pypi.org/project/email-validator/  Installation pip install email-validator Usage Below script read list of emails from file "test.emails". Loop & validate each email in the file. #!/usr/bin/python from email_validator import validate_email, EmailNotValidError filename="/home/dinesh/setuptools-7.0/test.emails" total_count=0 valid_count=0 invalid_count=0 with open(filename, "r") as a_file: for line in a_file: stripped_line = line.strip() print(stripped_line) total_count=total_count+1 try: # Validate. valid = validate_email(stripped_line) valid_count=valid_count+1 # Update with the normalized form. #email = valid.email except EmailNotValidError as e: # email is not valid, exception message is human-readable print(str(e)) invalid_count=invalid_count+1 print("Total Count"+str(total_count)) print("Valid Count"+str(valid_count)) print("Invalid Count...

Spark HBase Connector - Don't support IN Clause

We came across a scenario in using "shc-core-1.1.0.3.1.5.0-152.jar". A Spark data frame was created on one of the HBase Tables. We queried this data frame like "select * from df where col in ('A', 'B', 'C')" and found that filter on col is not working.  But, if re-write the same SQL like "select * from df where col = 'A' or col= 'B' or col= 'C' then it works.

Copy code of Git Repo in to a different Git Repo with History Commits

  1) Git clone  git clone <url to Source repo> temp-dir 2) Check different branches git branch -a 3) Checkout all the branches that you want to copy git checkout branch-name 4) fetch all the tags  git fetch --tags 5) Clear the link to Source repo git remote rm origin 6) Link your local repository to your newly created NEW repository git remote add origin <url to NEW repo> 7) Push all your branches and tags with these commands: git push origin --all git push --tags 8) Above steps complete copy from Source repo to New repo

Spark -Teradata connection Issues

Exception:  Caused by: java.lang.NullPointerException         at com.teradata.tdgss.jtdgss.TdgssConfigApi.GetMechanisms(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<init>(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<clinit>(Unknown Source) Brief: tdgssconfig.jar can't be found on the classpath. Please add same on classpath. Exception:  java.sql.SQLException: [Teradata Database] [TeraJDBC 15.10.00.33] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or '(' between the 'FROM' keyword and the 'SELECT' keyword. Brief: Normally Spark JDBC expects DBTable property to be a Table Name. So, internally it prepends "select * from" to Table Name. Like  select * from <Table Name> But, If we specify SQL instead of  Table Name then internally SQL will become something like: s...

Splunk Data to Hadoop Ingestion

One of the approach to get data from Splunk to Hadoop is to use REST API provided by Splunk. Such that periodically data is ingested to Hadoop Data Lake.  Simple command like below can help in such scenario: curl  -u '<username>:<password>' \    -k https://splunkhost:8089/services/search/jobs/export \   -d search="search index=myindex | head 10" \   -d output_mode=raw \    | hdfs dfs -put -f - <HDFS_DIR>    Above command will get top 10 rows from Splunk index "myindex" and will ingest it to Hadoop Data Lake