QueryDB

Posts

Spark integration with Apache Kudu

Following code piece depicts integration of Kudu and Spark. Refer - https://github.com/dinesh028/engineering/blob/master/resources/samples/Spark-Kudu-integration-code.txt //spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.10.0 import org.apache.kudu.client._ import org.apache.kudu.spark.kudu.KuduContext import collection.JavaConverters._ import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.Row val arr= new java.util.ArrayList[Row] () arr.add(Row("jai","ganesh")) val arraySchema = new StructType().add("id",StringType, false).add("name", StringType, true) val df = spark.createDataFrame(arr,arraySchema) df.printSchema val kuduContext = new KuduContext("mymaster.devhadoop.wm.com:7051", spark.sparkContext) //This will create the table but will not insert any data kuduContext.createTable("ds.my_test_table"

Error to integrate Impala with Kudu

Create Table Failed W0106 18:18:54.640544 368440 negotiation.cc:307] Unauthorized connection attempt: Server connection negotiation failed: server connection from 172.136.38.157:35678: unauthenticated connections from publicly routable IPs are prohibited. See --trusted_subnets flag for more information.: 172.136.38.157:35678 After setting up kudu, we can enable it to work with Impala. We, can check the cluster status - kudu cluster ksck <master> The cluster doesn't have any matching tables ================== Errors: ================== error fetching info from tablet servers: Not found: No tablet servers found FAILED -- Also, tablet Server UI is not opening. Solution- This error might be because Kudu Service has to know about trusted networks, which we can set - Kudu Service Advanced Configuration Snippet (Safety Valve) for gflagfile Kudu (Service-Wide) --trusted_subnets=127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,169.254.0

Machine Learning Part 6

In this this blog we will see - Spark Graph Frames Use of Graph Frames in Machine Learning to execute algorithms like - Page Rank A graph is made up of vertices and edges that connect them. 1. Vertices are Objects 2. Edges are relationships A regular graph is a graph where each vertex has the same number of edges. A directed graph is a graph where the edges have a direction associated with them. Example – 1. Facebook friends – A is friend of B, and so is B friend of A. 2. Instagram followers- A, B, C are followers of D. But, D may not be follower of either A, B, or C 3. Websites – Every page is a node and every linking page is an edge. Page rank Algorithm measures the importance of a page by number of links to a page and number of links to each linking page. 4. Recommendation Engines - Recommendation algorithms can use graphs where the nodes are the users and products, and their respective attributes and the edges ar

Machine Learning Part 5

In this blog, we will describe another example that utilize KMeans & Spark to determine locations. Before that, we would suggest you to got to previous blogs - https://querydb.blogspot.com/2019/12/machine-learning-part-4.html https://querydb.blogspot.com/2019/12/machine-learning-part-3.html https://querydb.blogspot.com/2019/12/machine-learning-part-2.html https://querydb.blogspot.com/2019/12/machine-learning-part-1.html In this blog, we will analyze and try to make predictions on Fire detection GIS Data- https://fsapps.nwcg.gov/gisdata.php We will have historical data of Wild fires. And, we will try to analyze. That is eventually helpful to reduce response time in case of fire, reduce cost, reduce damages due to fire, etc. Fire can grow exponentially based on various factors like - Wild life, Wind Velocity, terrain surface, etc. Incident tackle time is limited by various factors one of which is moving firefighting equipment. If we are plan in advance where to pl

Machine Learning Part 4

In the previous blog, we learned about creating K-Means Clustering Model . In this blog we will use the created model in a streaming use case for analysis in real time. For previous blog refer @ https://querydb.blogspot.com/2019/12/machine-learning-part-3.html 1) Load the Model created in previous blog. 2) Create a dataframe with cluster id, and centroid location ( centroid longitude , centroid latitude) 3) Create a Kafka Streaming dataframe. 4) Parse the Message into a Typed Object. 5) Use Vector assembler to put all features in to a vector 6) Transform the dataframe using model to get predictions. 7) Join with dataframe created in #2 8) Print the results to console or save it to HBase. Note that this example also describes about Spark Structured Streaming, where-in, We created a streaming Kafka Source And, a custom Foreach Sink to write data to HBase. Refer code @ https://github.com/dinesh028/SparkDS/tree/master/src/indore/dinesh/sachdev/uber/streaming