Data locality / placement

Spark relies on data locality, aka data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

In Spark on YARN Spark tries to place tasks alongside HDFS blocks.

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.

Spark’s compute nodes / workers should be running on storage nodes.

Concept of locality-aware scheduling.

Spark tries to execute tasks as close to the data as possible to minimize data transfer (over the wire).

Figure 1. Locality Level in the Spark UI

There are the following task localities (consult org.apache.spark.scheduler.TaskLocality object):

PROCESS_LOCAL
NODE_LOCAL
NO_PREF
RACK_LOCAL
ANY

Task location can either be a host or a pair of a host and an executor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-data-locality.adoc

spark-data-locality.adoc

Data locality / placement

Files

spark-data-locality.adoc

Latest commit

History

spark-data-locality.adoc

File metadata and controls

Data locality / placement