Feature Extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Note Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.

DictVectorizer

DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.

FeatureHasher
Text feature extraction
CountVectorizer, HashingVectorizer
Bag of Words(tokenization, counting and normalization)

Sparsity
Tf–idf term weighting
Decode

Image feature extraction