QueryDB

Posts

Machine Learning Part 1

Machine learning uses algorithms to find patterns in data, and then uses a model that recognizes those patterns to make predictions on new data. Machine learning may be broken down into - Supervised learning algorithms use labeled data - Classification, Regression. Unsupervised learning algorithms find patterns in unlabeled data - Clustering, Collaborative Filtering, Frequent Pattern Mining Semi-supervised learning uses a mixture of labeled and unlabeled data. Reinforcement learning trains algorithms to maximize rewards based on feedback. Classification - Mailing Servers like Gmail uses ML to classify if an email is Spam or not based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information. For example- An items is important or not. A transaction is fraud or not based upon known labeled examples of transactions which were classified fraud or

Buzzwords - Deep learning, machine learning, artificial intelligence

Deep learning, machine learning, artificial intelligence – all buzzwords and representative of the future of analytics. Basic thing about all these buzzwords is to provoke a review of your own data to identify new opportunities. Like - Retail Marketing Healthcare Telecommunication Finance Demand Forecasting Recommendation engines and targeting Predicting patient disease risk Customer churn Risk analytics Supply chain optimization Customer 360 Diagnostics and alerts System log analysis Customer 360 Pricing optimization Click-stream analysis Fraud Anomaly detection Fraud Market segmentation and targeting Social media analysis Preventive maintenance Credit scoring Recommendations Ad optimization Smart meter analysis While writing this blog, I realized that I have worked upon highlighted use cases. But, it didn't involved all these buzzwords. The basic philosophy behind these things is Knowing the Unkown. Once, you know the business

Spark - Ways to Cache data

SQL CACHE TABLE, Dataframe.cache, spark.catalog.cacheTable These persist in both on-heap RAM and local SSD's with the MEMORY_AND_DISK strategy. You can inspect where the RDD partitions are stored (in-memory or on disk) using Spark UI. The in-memory portion is stored in columnar format optimized for fast columnar aggregations and automatically compressed to minimize memory and GC pressure. This cache should be considered scratch/temporary space as the data will not survive a Worker failure. dbutils.fs.cacheTable(), and Table view -> Cache Table These only persist to the local SSDs mounted at /local_disk. This cache will survive cluster restarts.

Spark Datasets vs Dataframe vs SQL

Datasets are composed of typed objects, which means that transformation syntax errors(like a typo in the method name) and analysis errors (like an incorrect input variable type) can be caught at compile time. DataFrames are composed of untyped Row objects, which means that only syntax errors can be caught at compile time. Spark SQL is composed of a string, which means that syntax errors and analysis errors are only caught at runtime. Error SQL DataFrames DataSets Syntax Run Time Compile Time Compile Time Analysis Run Time Run Time Compile Time Also, note that Spark has encoders for all predefined basic data types like Int, String, etc. But, in case required then we have to write custom encoder to form a typed custom object dataset.

HBase Bulk Delete Column Qualifiers

Refer below Map Reduce Program that can be used to delete column qualifier from HBase Table - -------- # Set HBase class path export HADOOP_CLASSPATH=`hbase classpath` #execute MR hadoop jar Test-0.0.1-SNAPSHOT.jar com.test.mymr.DeleteHBaseColumns --------- Refer - https://github.com/dinesh028/engineering/blob/master/src/com/test/mymr/DeleteHBaseColumns.java