Spark RDD Characteristics

RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.

Internally, each RDD is characterized by five main properties:

A list of partitions

A function for computing each split

A list of dependencies on other RDDs

Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Operations

Creation Operation

Transformation Operation

Storage Operation

LRU(Least Recently Used)

Cache
Persist(unPersist/destroy)
Checkpoint

Action Operation

Dependencies

Narrow Dependencies

Shuffle/Wide Dependencies

Characteristics

Partitions

PreferredLocations

Dependencies

Iterator

Partitioner

Stage

ResultStage

ShuffleMapStage

Others

DAG
Lineage
Shared Variables

Broadcast Variables
Accumulators

Knowledge makes me travel through time and space.

Spark RDD Characteristics

RDD

Operations

Creation Operation

Transformation Operation

Storage Operation

Action Operation

Dependencies

Narrow Dependencies

Shuffle/Wide Dependencies

Characteristics

Partitions

PreferredLocations

Dependencies

Iterator

Partitioner

Stage

ResultStage

ShuffleMapStage

Others

Links