Refer code @https://github.com/dinesh028/SparkDS/blob/master/src/indore/dinesh/sachdev/uber/UberClusteringDriver.scala
Now days, Machine Learning is helping to improve cities. The analysis of location and behavior patterns within cities allows optimization of traffic, better planning decisions, and smarter advertising. For example, analysis of GPS data to optimize traffic flow, Many companies are using it for Field Technician optimization. It can also be used for recommendations, anomaly detection, and fraud.
Uber is using same to optimize customer experience - https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/
In this blog, we will see clustering and the k-means algorithm. And, its usage to analyze public Uber data.
Clustering is a family of unsupervised machine learning algorithms that discover groupings that occur in collections of data by analyzing similarities between input examples. Some examples of clustering uses include customer segmentation and text categorization.
It is a process that can be understood as -
- Analyzing input data to find the patterns.
- Train algorithm
- Build the model that recognize pattern and segment data.
- Use the model to identify similar segments.
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters (k). Clustering using the k-means algorithm begins by initializing all the coordinates to k number of centroids.
With every pass of the algorithm, each point is assigned to its nearest centroid, based on some distance metric, usually Euclidean distance. The centroids are then updated to be the “centers” of all the points assigned to it in that pass.
This repeats until there is a minimum change in the centers.
First, step we do is read the sample data.
Second, We use a Vector assembler to put features in to a feature vectors, which are vectors of numbers representing the value for each feature.
Third, we transform the read dataframe to add additional column for feature vector.
Fourth, we create a k-means estimator; we set the parameters to define the number of clusters and the column name for the cluster IDs. Then we use the k-means estimator fit method, on the VectorAssembler transformed DataFrame, to train and return a k-means model.
Fifth, We use model to get a dataframe with cluster id's. This dataframe can be used to perform SQL operations to get meaningful statistics. For example -
- Which clusters had the highest number of pickups?
- Which hours of the day had the highest number of pickups?
- Which hours of the day and which cluster had the highest number of pickups?
- Which clusters had the highest number of pickups during morning rush hour?
- Which clusters had the highest number of pickups during evening rush hour?
Sixth, We can save the model.
Seventh, we can load the saved model and use it in PROD cases.
Comments
Post a Comment