Skip to main content

Posts

Showing posts from June, 2017

SPARK running explained - 3

SPARK running explained - 2 SPARK running explained - 1 1.        YARN Cluster Manager – Basic YARN architecture is described as below, and it is similar to Spark Standalone Cluster Manager. Main components – a.        Resource manager – (Just like Spark Master process) b.       Node manager – (similar to Spark’s worker processes) Unlike running on Spark’s standalone cluster, applications on YARN run in containers (JVM processes to which CPU and memory resources are granted). There is an “Application Master” for each application, running in its own container, it’s responsible for requesting application resources from Resource Manager. Node managers track resources used by containers and report to the resource manager. Below depicts the Spark application (cluster-deploy mode) running on YARN cluster with 2 nodes- 1.        Client submit application to Resource Manager 2.        Resource Manger asks one node manager to allocate  container for Application

SPARK running explained - 2

SPARK running explained - 1 Set Speculative execution by setting – 1.        “spark.speculation” to true, default is false. 2.        “spark.speculation.interval” Spark checks with given interval to see if any task needs to be restarted 3.        “spark.speculation.quantile” percentage of tasks that need to complete before speculation is started for a stage 4.        “spark.speculation.multiplier” how many times a task needs to run before it needs to be restarted Data locality means Spark tries to run tasks as close to the data location as possible. Five levels of data locality – 1.        PROCESS_LOCAL - Execute a task on the executor that cached the partition 2.        NODE_LOCAL - Execute a task on the node where the partition is available 3.        RACK_LOCAL - Execute the task on the same rack as the partition, Rack information is available from YARN cluster. 4.        NO_PREF - No preferred locations are associated with the task 5.        ANY - Default

SPARK running explained - 1

Spark runtime components The main Spark components running in a cluster: client, driver, and executors. The client process starts the driver program. It can be spark-submit or spark-shell or spark-sql or custom application. Client process:- 1.        Prepares the classpath and all configuration options for the Spark application 2.        Passes application arguments to application running in driver. There is always one driver per Spark application. The driver orchestrates and monitors execution of a application. Subcomponents of driver- 1.        Spark context 2.        Scheduler These subcomponents are responsible for- 1.        Requesting memory and CPU resources from cluster managers 2.        Breaking application logic into stages and tasks 3.        Sending tasks to executors 4.        Collecting the results Driver program can run in 2 ways – 1.        Cluster-deploy Mode – Driver runs as separate JVM process in a cluster & cluster manages its r