Skip to main content

Posts

Machine Learning Part 5

In this blog, we will describe another example that utilize KMeans & Spark to determine locations. Before that, we would suggest you to got to previous blogs - https://querydb.blogspot.com/2019/12/machine-learning-part-4.html https://querydb.blogspot.com/2019/12/machine-learning-part-3.html https://querydb.blogspot.com/2019/12/machine-learning-part-2.html https://querydb.blogspot.com/2019/12/machine-learning-part-1.html In this blog, we will analyze and try to make predictions on Fire detection GIS Data-  https://fsapps.nwcg.gov/gisdata.php We will have historical data of Wild fires. And, we will try to analyze. That is eventually helpful to reduce response time in case of fire, reduce cost, reduce damages due to fire, etc. Fire can grow exponentially based on various factors like - Wild life, Wind Velocity, terrain surface, etc. Incident tackle time is limited by various factors one of which is moving firefighting equipment. If we are plan in advance where to pl

Machine Learning Part 4

In the previous blog, we learned about creating K-Means Clustering Model . In this blog we will use the created model in a streaming use case for analysis in real time. For previous blog refer @  https://querydb.blogspot.com/2019/12/machine-learning-part-3.html 1)  Load the Model created in previous blog. 2) Create a dataframe with cluster id, and centroid location ( centroid longitude , centroid latitude) 3) Create a Kafka Streaming dataframe. 4) Parse the Message into a Typed Object. 5) Use Vector assembler to put all features in to a vector 6) Transform the dataframe using model to get predictions. 7) Join with dataframe created in #2 8) Print the results to console or save it to HBase. Note that this example also describes about Spark Structured Streaming, where-in, We created a streaming Kafka Source  And, a custom Foreach Sink to write data to HBase.  Refer code @  https://github.com/dinesh028/SparkDS/tree/master/src/indore/dinesh/sachdev/uber/streaming

Machine Learning Part 3

Refer code  @ https://github.com/dinesh028/SparkDS/blob/master/src/indore/dinesh/sachdev/uber/UberClusteringDriver.scala Now days, Machine Learning is helping to improve cities. The analysis of location and behavior patterns within cities allows optimization of traffic, better planning decisions, and smarter advertising. For example, analysis of GPS data to optimize traffic flow, Many companies are using it for Field Technician optimization. It can also be used for recommendations, anomaly detection, and fraud. Uber is using same to optimize customer experience - https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/ In this blog, we will see clustering and the k-means algorithm. And, its usage to analyze public Uber data. Clustering is a family of unsupervised machine learning algorithms that discover groupings that occur in collections of data by analyzing similarities between input examples. Some examples of clustering uses inc

Machine Learning Part 2

Refer code @  https://github.com/dinesh028/SparkDS/blob/master/src/indore/dinesh/sachdev/FlightDelayDriver.scala This post talks about predicting flight delays. Since, there is growing interest in predicting flight delays beforehand in order to optimize operations and improve customer satisfaction. We will use Historic Flight status data with Random Forest Classifier algorithm to find common patterns in late departures in order to predict flight delays and share the reasons for those delays. And, we will be using Apache Spark for same. As mentioned in Part 1 - Classification is kind of supervised learning which requires Features (if questions) and Labels (outcome) in advance to build the model. What are we trying to predict (Label)? Whether a flight will be delayed or not. (TRUE or FALSE) What are the “if questions” or properties that you can use to make predictions? What is the originating airport? What is the destination airport? What is the scheduled time

How to determine Spark JOB Resources over YARN

This is a question which I believe is not answerable in a direct way. Because it depends upon various run-time factors - Data Distribution Skewed Data Operations that we are performing Join, Windowing, etc. Shuffle Time, GC Overhead, etc A run-time  analysis of DAG & JOB execution help us tune optimal resources for a Spark Job. But, we will try to provide very basic answer to this question in the blog. Note that it is a run time behavior, so the answer may not fit all the use cases. Suppose you have a multi tenant cluster and you determine how much hardware resources is available for your yarn queue or user. Say you have - 6 Nodes, and Each node has 16 cores, 64 GB RAM Also, note the configurations of you Edge Node from where you will trigger the Spark JOB. As multiple Jobs will spawned from same edge node. So, resources of edge node can be a bottleneck too. 1 core and 1 GB is needed for OS and Hadoop Daemon. So, you have 6 machines with 15 cores and 63 GB