The sklearn.feature_extraction
module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
Note
Feature extraction
is very different fromFeature selection
: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.
DictVectorizer
DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.
FeatureHasher
Text feature extraction
CountVectorizer, HashingVectorizer
Bag of Words(tokenization, counting and normalization)
Sparsity
Tf–idf term weighting
Decode
Image feature extraction
Links
- Author:HyperJ
- Source:HyperJ’s Blog
- Link:Feature Extraction