Spark RDD Characteristics

RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.

Internally, each RDD is characterized by five main properties:

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Operations

Creation Operation

Transformation Operation

Storage Operation

LRU(Least Recently Used)

  • Cache

  • Persist(unPersist/destroy)

  • Checkpoint

Action Operation

Dependencies

Narrow Dependencies

Shuffle/Wide Dependencies

Characteristics

Partitions

PreferredLocations

Dependencies

Iterator

Partitioner

Stage

ResultStage

ShuffleMapStage

Others

  • DAG

  • Lineage

  • Shared Variables

    Broadcast Variables
    Accumulators