Spark SQL Optimization

Configuration

Application Properties
spark.driver.cores(1)

Number of cores to use for the driver process, only in cluster mode.

spark.driver.maxResultSize(1g)

Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.

spark.driver.memory(1g)

Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the –driver-memory command line option or in your default properties file.

spark.executor.memory(1g)

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the –driver-memory command line option in the client mode.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the –driver-java-options command line option or in your default properties file.

Runtime Environment
spark.driver.extraJavaOptions(none)

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the –driver-memory command line option in the client mode.

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the –driver-java-options command line option or in your default properties file.

spark.executor.extraJavaOptions(none)

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the –driver-memory command line option in the client mode.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the –driver-java-options command line option or in your default properties file.

Execution Behavior
spark.executor.cores(1 in YARN mode, all the available cores on the worker in standalone and Mesos coarse-grained modes.)

The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.

Spark Yarn Properties
spark.yarn.executor.memoryOverhead(executorMemory * 0.10, with minimum of 384)

The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).

spark.yarn.driver.memoryOverhead(driverMemory * 0.10, with minimum of 384)

The amount of off-heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).

spark.yarn.am.memoryOverhead(AM memory * 0.10, with minimum of 384)

Same as spark.yarn.driver.memoryOverhead, but for the YARN Application Master in client mode.