Vectorizers: Transform unstructured data into vectors¶
There are a large number of machine learning tools for effectively exploring and working with data that is given as vectors (ideally with a defined notion of distance as well). There is also a large volume of data that does not come neatly packaged as vectors. It could be text data, variable length sequence data (either numeric or categorical), dataframes of mixed data types, sets of point clouds, or more. Usually, one way or another, such data can be wrangled into vectors in a way that preserves some relevant properties of the original data. This library seeks to provide a suite of a wide variety of general purpose techniques for such wrangling, making it easier and faster for users to get various kinds of unstructured sequence data into vector formats for exploration and machine learning.
- Rich Document Vectors
- Step 0: Tokenization
- Step 1: Generate Word Vectors
- Step 2: Generate Simple Document Vectors Using the Learned Vocabulary
- Step 3a: Combine Word and Document Vectors with WassersteinVectorizer
- Step 3b: Combine Word and Document Vectors with SinkhornVectorizer
- Step 3c: Combine Word and Document Vectors with ApproximateWassersteinVectorizer
- Compare the Different Document Vectors
- CategoricalColumnTransformer