Spark runtime components
The main Spark components running in a cluster: client, driver,
and executors.
The client process starts the driver program. It can be
spark-submit or spark-shell or spark-sql or custom application. Client process:-
1.
Prepares the classpath and all configuration
options for the Spark application
2.
Passes application arguments to application
running in driver.
There is always one driver per Spark application. The driver
orchestrates and monitors execution of a application. Subcomponents of driver-
1.
Spark context
2.
Scheduler
These subcomponents are responsible for-
1.
Requesting memory and CPU resources from cluster
managers
2.
Breaking application logic into stages and tasks
3.
Sending tasks to executors
4.
Collecting the results
Driver program can run in 2 ways –
1.
Cluster-deploy Mode – Driver runs as separate JVM
process in a cluster & cluster manages its resources.
2.
Client-deploy mode – Driver is running in the
client’s JVM process and communicates with the executors managed by the
cluster.
The executors are the JVM processes that,
1.
Accept tasks from the driver
2.
Execute those tasks
3.
Return the results to the driver.
4.
Each executor has several task slots for running
tasks in parallel
5.
Although these task slots are often referred to
as CPU cores in Spark, they’re implemented as threads and don’t have to
correspond to the number of physical CPU cores on the machine.
Once the driver is started, it starts and configures an
instance of SparkContext. There can be only one Spark context per JVM. Although
Spark can run in local mode but in production it ran with one of the supported
cluster managers i.e. YARN, Mesos, Spark Standalone.
Spark standalone cluster is a Spark-specific cluster.
Spark standalone
|
YARN
|
This cluster is built specifically for SPARK applications, Thus it
doesn’t support communication with HDFS secured with the Kerberos
authentication protocol.
|
For this use YARN
|
Provides faster job startup
|
Slower job startup than standalone cluster
|
YARN is Hadoop’s resource manager and execution system with
pros-
1.
Many organization already have Hadoop clusters with
YARN as resource manager.
2.
YARN allows run all kinds of applications, not
just SPARK
3.
Provides methods for isolating and prioritizing
applications
4.
Supports Kerberos-secured HDFS
5.
Don’t have to install Spark on all nodes in the
cluster
Mesos is a scalable and fault-tolerant distributed systems
kernel. Unlike other clusters, which only schedule memory, Mesos provides
scheduling of other types of resources (CPU, disk, port). It has fine-grained
job scheduling. Mesos is a “scheduler of scheduler frameworks” because of its
two-level scheduling architecture, for example – With Myriad project you can
run YARN on top of Mesos.
Job and resource
scheduling
Resources for Spark applications are scheduled as executors
(JVM processes) and CPU (task slots) and then memory is allocated to them.
1.
The cluster manager starts the executor
processes requested by the driver
2.
Also, starts driver process in case of
cluster-deploy mode.
3.
Cluster manger can restart & stop processes
4.
Cluster manger can set maximum CPU‘s that
executors can use.
Spark scheduler communicates with driver and executors and decides
which executors will run which tasks. This is called job scheduling, and it
affects resources usage in the cluster.
There are 2 types of Scheduling –
1.
Cluster resource scheduling
2.
Spark resource scheduling – Set spark.scheduler.mode
– { FAIR or FIFO }
Note - SparkContext is thread-safe
Comments
Post a Comment