QueryDB

Posts

Showing posts from 2017

Hive Analytics Functions - rank() vs dense_rank() vs percent_rank() vs row_umber() vs cume_dist()

RANK - Rank of each row within partition of result set. DENSE_RANK - Mostly, similar to RANK. But, there will be no gaps in ranking. PERCENT_RANK - Relative Rank of row within group of rows. ROW_NUMBER - Sequential number of row within partition of a result set. CUME_DIST - For row r, the number of rows with value lower than or equal to value of r , divided by number of rows evaluated in partition. Practice - hive> create table test (v string ) row format delimited fields terminated by ','; hive> alter table test add columns (t string); hive> load data local inpath '/root/test' overwrite into table test; test data in local looks like below - a,1 a,2 a,3 a,1 a,2 b,1 c,1 c,2 d,1 e,1 Execute below query and analyze the result - hive> select v, t, rank() over (partition by v ), dense_rank() over (partition by v ), row_number() over (partition by v ), percent_rank()over (partition by v ), cume_dist() over (partition by v ...

SPARK running explained - 3

SPARK running explained - 2 SPARK running explained - 1 1. YARN Cluster Manager – Basic YARN architecture is described as below, and it is similar to Spark Standalone Cluster Manager. Main components – a. Resource manager – (Just like Spark Master process) b. Node manager – (similar to Spark’s worker processes) Unlike running on Spark’s standalone cluster, applications on YARN run in containers (JVM processes to which CPU and memory resources are granted). There is an “Application Master” for each application, running in its own container, it’s responsible for requesting application resources from Resource Manager. Node managers track resources used by containers and report to the resource manager. Below depicts the Spark application (cluster-deploy mode) running on YARN cluster with 2 nodes- 1. Client submit applic...

SPARK running explained - 2

SPARK running explained - 1 Set Speculative execution by setting – 1. “spark.speculation” to true, default is false. 2. “spark.speculation.interval” Spark checks with given interval to see if any task needs to be restarted 3. “spark.speculation.quantile” percentage of tasks that need to complete before speculation is started for a stage 4. “spark.speculation.multiplier” how many times a task needs to run before it needs to be restarted Data locality means Spark tries to run tasks as close to the data location as possible. Five levels of data locality – 1. PROCESS_LOCAL - Execute a task on the executor that cached the partition 2. NODE_LOCAL - Execute a task on the node where the partition is available 3. RACK...

SPARK running explained - 1

Spark runtime components The main Spark components running in a cluster: client, driver, and executors. The client process starts the driver program. It can be spark-submit or spark-shell or spark-sql or custom application. Client process:- 1. Prepares the classpath and all configuration options for the Spark application 2. Passes application arguments to application running in driver. There is always one driver per Spark application. The driver orchestrates and monitors execution of a application. Subcomponents of driver- 1. Spark context 2. Scheduler These subcomponents are responsible for- 1. Requesting memory and CPU resources from cluster managers 2. Breaking application logic into stages and tasks 3. Se...

Scala - Scalable Language

Scala, short for Scalable Language- • Created by Martin Odersky • Is object-oriented & functional Programming language • Scala runs on the JVM Installations • Install Java • Set Your Java Environment. Ex- JAVA_HOME, PATH, etc • Install Scala • After installation, verify version by typing on command prompt or shell >scala –version >java –version If you have a good understanding on Java, then it will be very easy for you to learn Scala. But, we would again des...