SPARK running explained - 1
Set Speculative execution by setting –
1.
“spark.speculation” to true, default is false.
2.
“spark.speculation.interval” Spark checks with
given interval to see if any task needs to be restarted
3.
“spark.speculation.quantile” percentage of tasks
that need to complete before speculation is started for a stage
4.
“spark.speculation.multiplier” how many times a
task needs to run before it needs to be restarted
Data locality means Spark tries to run tasks as close to the
data location as possible. Five levels of data locality –
1.
PROCESS_LOCAL - Execute a task on the executor
that cached the partition
2.
NODE_LOCAL - Execute a task on the node where
the partition is available
3.
RACK_LOCAL - Execute the task on the same rack
as the partition, Rack information is available from YARN cluster.
4.
NO_PREF - No preferred locations are associated
with the task
5.
ANY - Default if everything else fails
Note –
1.
“spark.locality.wait” – Determines the time the
scheduler waits for each locality level before moving to next. Default is 30
seconds.
2.
Set wait time specific to locality level –
a.
“spark.locality.wait.process”
b.
“spark.locality.wait.node”
c.
“spark.locality.wait.rack”
Spark memory
scheduling
Spark manages the JVM heap memory allocated by the cluster
manager by separating it into several segments
1.
Set “spark.executor.memory” for memory you want
allocated for your executors. This memory is managed by cluster manager.
2.
Spark reserves parts of that memory for-
a.
Cached data storage – set “spark.storage.memoryFraction”,
default 0.6
b.
Temporary shuffle data – set “spark.shuffle.memoryFraction”,
default 0.2
c.
As above 2 parts of heap can grow before Spark
can measure and limit them. There are 2 safety parameters –
i.
“spark.storage.safetyFraction”, default 0.9
ii.
“spark.shuffle.safetyFraction”, default 0.8
d.
Safety parameters lowers the memory fraction by safety
fraction. Thus, defaults too-
i.
0.6*0.9 = 54%
ii.
0.2*0.8 = 16%
e.
And rest heap is for other Java objects and resources
needed to run tasks.
3.
To set driver memory –
a.
Set “spark.driver.memory”
b.
If you start Spark application programmatically
then that application contains your driver. Therefore, to increase the memory
available to your driver, use the -Xmx Java option to set the maximum size of
the Java heap of the containing process.
Running Spark on the
local machine
1.
Local
mode – This mode runs the entire cluster in a single JVM and is useful for
testing purposes. To run Spark in local mode, set the master parameter to one
of the following values –
a.
local[<n>] - Run a single executor using
<n> integer threads
b.
local - Run a single executor using one thread
c.
local[*] - Run a single executor using a number
of threads equal to the number of CPU cores available on the local machine.
d.
local[<n>,<f>] - Run a single
executor using <n> integer threads and allow maximum <f> failures
per task.
Note - If you use --master local, there
will be one thread, and you may notice that log lines are missing from the
driver’s output. Because, In Spark Streaming that single thread is occupied by
receiver and the driver wouldn’t have any threads left to print out results.
Thus, specify at least 2 threads.
2.
Local
cluster mode – The difference between local cluster mode and full
standalone cluster is that the master isn’t a separate process but runs in the
client JVM. To run Spark in this mode, set the master –
a.
local-cluster[<n>,<c>,<m>] - <n>
integer executors, and each using <c> threads, and <m> megabytes of
memory. Each executor in local cluster mode runs in a separate JVM.
3.
Spark Standalone cluster mode – This is built specifically
for Spark and can’t execute any other type of applications. The standalone
cluster consists of –
a.
Master process
i.
Acts as the cluster manager
ii.
Accepts application to run
iii.
Schedules worker resources (CPU cores)
b.
Worker (also called slave) processes
i.
Launch application executors (also, driver in
cluster-deploy mode) for task execution
Note - Spark has to be installed on all
nodes in the cluster in order for them to be usable as slaves.
Spark standalone cluster running on two
nodes with two workers (Cluster-deploy mode) –
1.
Client process submit application to master
2.
Master instructs one of worker to start driver
3.
Worker spawns driver JVM
4.
Master instructs workers to launch executor JVMs
5.
Workers spawns executor JVMs
6.
Driver & executors communicate with each
other, independently.
Note –
1.
In a Spark standalone cluster, for an
application, there can be only one executor per worker process. If you need
more executors per machine, you can start multiple worker processes.
2.
If there are more applications in cluster then, each
would have its own set of executors and a separate drivers
Viewing Spark processes- You can use the JVM Process Status Tool
(jps command) to view them.
1.
Master and worker processes appear as “Master”
and “Worker”
2.
A driver running in the cluster appears as “DriverWrapper”
3.
A driver running in client mode as “SparkSubmit”
4.
Executor processes appear as “CoarseGrainedExecutorBackend”
Specify Number of executors –
1.
To control how many executors are allocated for
your application. Set “spark.cores.max” to the total number of cores you wish
to use.
2.
set “spark.executor.cores” to the number of
cores per executor
3.
Equivalent properties for above points --executor-cores
and --total-executor-cores
Killing applications
–
spark-class org.apache.spark.deploy.Client kill
<master_URL> <driver_ID>
Application automatic
restart –
When submitting an
application in cluster-deploy mode, a special command-line option (--supervise)
tells Spark to restart the driver process if it fails.
Comments
Post a Comment