Statistics¶
Estimates of various statistics. The default estimation logic simply lazily multiplies the corresponding statistic produced by the children.
Statistics -> CatalogStatistics¶
- sizeInBytes: Physical size in bytes. For leaf operators this defaults to 1, otherwise it defaults to the product of children's
sizeInBytes
- rowCount: Estimated number of rows
- attributeStats: Statistics for Attributes
- hints: Query hints
ColumnStat -> CatalogColumnStat¶
- distinctCount: number of distinct values
- min: minimum value
- max: maximum value
- nullCount: number of nulls
- avgLen: average length of the values
- maxLen: maximum length of the values
- histogram: histogram of the values
Histogram[HistogramBin]¶
- height: number of rows in each bin
- bins: equi-height histogram bins
- lo: lower bound of the value range in this bin
- hi: higher bound of the value range in this bin
- ndv: approximate number of distinct values in this bin
HintInfo¶
- broadcast
- join/shuffle
DataFrameStatFunctions¶
Statistic functions for DataFrames.(Since: 1.4.0)
- approxQuantile: Calculates the approximate quantiles of numerical columns of a DataFrame
- bloomFilter: Builds a Bloom filter over a specified column
- corr: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame
- countMinSketch: Builds a Count-min Sketch over a specified column
- cov: Calculate the sample covariance of two numerical columns of a DataFrame
- crosstab: Computes a pair-wise frequency table of the given columns
- freqItems: (Scala-specific) Finding frequent items for columns, possibly with false positives
- sampleBy: Returns a stratified sample without replacement based on the fraction given on each stratum
Other¶
Dataset#describe¶
- count, mean, stddev, min, max
- StatFunctions.summary(ds, Seq("count", "mean", "stddev", "min", "25%", "50%", "75%", "max"))
Statistics¶
API for statistical functions in MLlib
- colStats[MultivariateOnlineSummarizer]: column-wise summary statistics
- corr: Pearson correlation matrix
- chiSqTest: chi-squared test
- kolmogorovSmirnovTest: Kolmogorov-Smirnov test