QueryDB

Posts

Showing posts from December, 2019

Machine Learning Part 5

In this blog, we will describe another example that utilize KMeans & Spark to determine locations. Before that, we would suggest you to got to previous blogs - https://querydb.blogspot.com/2019/12/machine-learning-part-4.html https://querydb.blogspot.com/2019/12/machine-learning-part-3.html https://querydb.blogspot.com/2019/12/machine-learning-part-2.html https://querydb.blogspot.com/2019/12/machine-learning-part-1.html In this blog, we will analyze and try to make predictions on Fire detection GIS Data- https://fsapps.nwcg.gov/gisdata.php We will have historical data of Wild fires. And, we will try to analyze. That is eventually helpful to reduce response time in case of fire, reduce cost, reduce damages due to fire, etc. Fire can grow exponentially based on various factors like - Wild life, Wind Velocity, terrain surface, etc. Incident tackle time is limited by various factors one of which is moving firefighting equipment. If we are plan in advance where to pl

Machine Learning Part 4

In the previous blog, we learned about creating K-Means Clustering Model . In this blog we will use the created model in a streaming use case for analysis in real time. For previous blog refer @ https://querydb.blogspot.com/2019/12/machine-learning-part-3.html 1) Load the Model created in previous blog. 2) Create a dataframe with cluster id, and centroid location ( centroid longitude , centroid latitude) 3) Create a Kafka Streaming dataframe. 4) Parse the Message into a Typed Object. 5) Use Vector assembler to put all features in to a vector 6) Transform the dataframe using model to get predictions. 7) Join with dataframe created in #2 8) Print the results to console or save it to HBase. Note that this example also describes about Spark Structured Streaming, where-in, We created a streaming Kafka Source And, a custom Foreach Sink to write data to HBase. Refer code @ https://github.com/dinesh028/SparkDS/tree/master/src/indore/dinesh/sachdev/uber/streaming

Machine Learning Part 3

Refer code @ https://github.com/dinesh028/SparkDS/blob/master/src/indore/dinesh/sachdev/uber/UberClusteringDriver.scala Now days, Machine Learning is helping to improve cities. The analysis of location and behavior patterns within cities allows optimization of traffic, better planning decisions, and smarter advertising. For example, analysis of GPS data to optimize traffic flow, Many companies are using it for Field Technician optimization. It can also be used for recommendations, anomaly detection, and fraud. Uber is using same to optimize customer experience - https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/ In this blog, we will see clustering and the k-means algorithm. And, its usage to analyze public Uber data. Clustering is a family of unsupervised machine learning algorithms that discover groupings that occur in collections of data by analyzing similarities between input examples. Some examples of clustering uses inc

Machine Learning Part 2

Refer code @ https://github.com/dinesh028/SparkDS/blob/master/src/indore/dinesh/sachdev/FlightDelayDriver.scala This post talks about predicting flight delays. Since, there is growing interest in predicting flight delays beforehand in order to optimize operations and improve customer satisfaction. We will use Historic Flight status data with Random Forest Classifier algorithm to find common patterns in late departures in order to predict flight delays and share the reasons for those delays. And, we will be using Apache Spark for same. As mentioned in Part 1 - Classification is kind of supervised learning which requires Features (if questions) and Labels (outcome) in advance to build the model. What are we trying to predict (Label)? Whether a flight will be delayed or not. (TRUE or FALSE) What are the “if questions” or properties that you can use to make predictions? What is the originating airport? What is the destination airport? What is the scheduled time

How to determine Spark JOB Resources over YARN

This is a question which I believe is not answerable in a direct way. Because it depends upon various run-time factors - Data Distribution Skewed Data Operations that we are performing Join, Windowing, etc. Shuffle Time, GC Overhead, etc A run-time analysis of DAG & JOB execution help us tune optimal resources for a Spark Job. But, we will try to provide very basic answer to this question in the blog. Note that it is a run time behavior, so the answer may not fit all the use cases. Suppose you have a multi tenant cluster and you determine how much hardware resources is available for your yarn queue or user. Say you have - 6 Nodes, and Each node has 16 cores, 64 GB RAM Also, note the configurations of you Edge Node from where you will trigger the Spark JOB. As multiple Jobs will spawned from same edge node. So, resources of edge node can be a bottleneck too. 1 core and 1 GB is needed for OS and Hadoop Daemon. So, you have 6 machines with 15 cores and 63 GB

Machine Learning Part 1

Machine learning uses algorithms to find patterns in data, and then uses a model that recognizes those patterns to make predictions on new data. Machine learning may be broken down into - Supervised learning algorithms use labeled data - Classification, Regression. Unsupervised learning algorithms find patterns in unlabeled data - Clustering, Collaborative Filtering, Frequent Pattern Mining Semi-supervised learning uses a mixture of labeled and unlabeled data. Reinforcement learning trains algorithms to maximize rewards based on feedback. Classification - Mailing Servers like Gmail uses ML to classify if an email is Spam or not based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information. For example- An items is important or not. A transaction is fraud or not based upon known labeled examples of transactions which were classified fraud or

Buzzwords - Deep learning, machine learning, artificial intelligence

Deep learning, machine learning, artificial intelligence – all buzzwords and representative of the future of analytics. Basic thing about all these buzzwords is to provoke a review of your own data to identify new opportunities. Like - Retail Marketing Healthcare Telecommunication Finance Demand Forecasting Recommendation engines and targeting Predicting patient disease risk Customer churn Risk analytics Supply chain optimization Customer 360 Diagnostics and alerts System log analysis Customer 360 Pricing optimization Click-stream analysis Fraud Anomaly detection Fraud Market segmentation and targeting Social media analysis Preventive maintenance Credit scoring Recommendations Ad optimization Smart meter analysis While writing this blog, I realized that I have worked upon highlighted use cases. But, it didn't involved all these buzzwords. The basic philosophy behind these things is Knowing the Unkown. Once, you know the business

Spark - Ways to Cache data

SQL CACHE TABLE, Dataframe.cache, spark.catalog.cacheTable These persist in both on-heap RAM and local SSD's with the MEMORY_AND_DISK strategy. You can inspect where the RDD partitions are stored (in-memory or on disk) using Spark UI. The in-memory portion is stored in columnar format optimized for fast columnar aggregations and automatically compressed to minimize memory and GC pressure. This cache should be considered scratch/temporary space as the data will not survive a Worker failure. dbutils.fs.cacheTable(), and Table view -> Cache Table These only persist to the local SSDs mounted at /local_disk. This cache will survive cluster restarts.

Spark Datasets vs Dataframe vs SQL

Datasets are composed of typed objects, which means that transformation syntax errors(like a typo in the method name) and analysis errors (like an incorrect input variable type) can be caught at compile time. DataFrames are composed of untyped Row objects, which means that only syntax errors can be caught at compile time. Spark SQL is composed of a string, which means that syntax errors and analysis errors are only caught at runtime. Error SQL DataFrames DataSets Syntax Run Time Compile Time Compile Time Analysis Run Time Run Time Compile Time Also, note that Spark has encoders for all predefined basic data types like Int, String, etc. But, in case required then we have to write custom encoder to form a typed custom object dataset.