Quick Start with Vectorizers¶
Vectorizers provides a number of tools for working with various kinds of unstructured data with a focus on sequence data. The library is built to be compatible with scikit-learn and can be used in scikit-learn pipelines.
Installing¶
Vectorizers can be installed via pip (coming soon) and via conda-forge (coming later).
(Coming soon) .. code:: bash
pip install vectorizers
(Currently available) .. code:: bash
pip install git+https://github.com/TutteInstitute/vectorizers.git
To manually install this package:
wget https://github.com/TutteInstitute/vectorizers/archive/master.zip
unzip master.zip
rm master.zip
cd vectorizers-master
python setup.py install
Basic Usage¶
The vectorizers package provides a number of tools for vectorizing different kinds of input data. All of them are available as classes that follow sciki-learn’s basic API for transformers, converting input data into vectors in one form or another. For example to convert sequences of categorical data into ngram vector representations one might use
import vectorizers
ngrammer = vectorizers.NgramVectorizer(ngram_size=2)
ngram_vetcors = ngrammer.fit_transform(input_sequences)
These classes can easily be fit into sklearn pipelines, passing vector representations on to other scikit-learn (or scikit-learn compatible) classes. See the `Vectorizers API`_ documentation for more details on the available classes.
Vetcorizers also provides a number of utility transformers in the vectorizers.transformers
namespace. These provide convenience transformations of data – either transforms on vectorized
data, including feature weighting tools, or transformations of structured and unstructured data
into sequences more amenable to other vectorizers classes.