RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map
, filter
, and persist
.
Internally, each RDD is characterized by five main properties:
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Operations
Creation Operation
Transformation Operation
Storage Operation
LRU(Least Recently Used)
Cache
Persist(unPersist/destroy)
Checkpoint
Action Operation
Dependencies
Narrow Dependencies
Shuffle/Wide Dependencies
Characteristics
Partitions
PreferredLocations
Dependencies
Iterator
Partitioner
Stage
ResultStage
ShuffleMapStage
Others
DAG
Lineage
Shared Variables
Broadcast Variables
Accumulators
Links
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing - Spark Documentation
- Author:HyperJ
- Source:HyperJ’s Blog
- Link:Spark RDD Characteristics