Variable Length Sequences

Variable length sequence data can present a significant challenge to machine learning algorithms and data science analysis.

Part of this problem is driven by the wide varieties of variable length sequence data that are encountered in the wild. To that end we present a taxonomy of the kinds of variable length sequences that we typically encounter and our suggestions for how to think about them.

We generally find it useful when describing variable length sequence data to describe what it is a sequence of. The basic types that we commonly encounter are: categorical values, scalar values and vector values. Certainly scalar data could be thought of a simple one dimensional vector data but given the different techniques that can, and often are, applied to such data we feel that treating it as a seperate data type is warranted.

Next we describe it as either ordered or unordered sequences. Yes, an unordered sequence is an odd turn of phrase but we be find it to be a useful simplifying notion. An unordered sequence is often referred to as a bag in data science literature. For a example a bag of words is the phrase used to describe an unordered collection of word tokens. We would describe such a collection as an unordered categorical sequence.

Lasty, given an ordered sequence we require one extra piece of information: is the ordered regular or irregular. Regular sequences are often described as heartbeat data and generally assume equal spacing between all our values. Irregular sequences are often referred to as event data and each value is associated with a particular position allowing variable spacing amongst our values.

Variable length sequence data comes in a vast variety of forms. Different forms of variable length sequence data are amenable to different techniques. To deal with this variety of data we propose this simple taxonomy of variable length sequence data and provide links and suggestions for techniques conducive to each type.

  • Type of values: categorical, scalar, vector

  • Order of values: Ordered or Unordered

  • Regularity of values: Regular or Irregular

Sequence Types

Categorical

regularity

order

type

sequence

example

unordered

categorical

sequence

->

bag of words

regular

ordered

categorical

sequence

->

text document

irregular

ordered

categorical

sequence

->

time stamped labelled events

Scalar

regularity

order

type

sequence

example

unordered

Scalar

sequence

->

random variable

regular

ordered

Scalar

sequence

->

heartbeat time-series

irregular

ordered

Scalar

sequence

->

time stamped values or event sequence

Vector

regularity

order

type

sequence

example

unordered

Vector

sequence

->

point cloud

regular

ordered

Vector

sequence

->

spatial-trajectory data

irregular

ordered

Vector

sequence

->

time stamped locations

Vectorizer Functions

This library adheres the sklearn transformer paradigm. With most functions having a fit, fit_transform and transform functions. As such they can be easily arranged in sklearn pipelines to ensure that all of your data transformation steps are encapsulated cleanly.

For the most part our vectorizers take in a sequence of variable length sequences and learn a fixed width representation of these sequences. Another way of thinking of this is transforming a jagged array of vectors into a fixed width array of vectors. Fixed width representations are significantly more conducive to traditional machine learning algorithms.

Transformers on the other hand are more generic utility functions that massage data in various useful ways.

Due to the variety of vectorization techniques in this library a user might find it easier to determine the type of variable length sequences they are dealing with and use the following index to find the relevant functions.

regularity

order

type

sequence

example

functions

unordered

categorical

sequence

->

bag of words

NgramVector izer, EdgeListVectorizer

regular

ordered

categorical

sequence

->

text document

NgramVectorizer, LZCompressionVectorizer, BPEVectorizer

irregular

ordered

categorical

sequence

->

time stamped labelled events

HistogramVectorizer

All of these vectorizers take data in the form of a sequence of variable length sequences of categorical values (such as strings). All of these methods presume that a user has already decomposed their data into something of this form.

The most common sources of variable length categorical data are text documents or data frames with categorical columns. In both cases some pre-processing will be necessary to convert such data into sequences of variable length sequences.

In the case of text documents this often involves tokenization and lemmatization steps. An example of applying such transformations on text data before vectorization can be found in document vectorizer.

Good tokenization and lemmatization libraries include: HuggingFace, SentencePiece, spaCy, and nltk.

In the case of a data frame with multiple categorical columns one might make use of our libraries CategoricalColumnTransformer for transforming a data frame with one or more columns into a variable length sequence of categorical sequences. This is typically done by specifying one categorical column to represent ones objects and another set of categorical columns to be used to describe said objects. For an examples of how one might use this see an introduction to CategoricalColumnTransformer or the more complicated CategoricalColumnTransformer vignette.

regularity

order

type

sequence

example

functions

unordered

Scalar

sequence

->

random variable

HistogramVectorizer, DistributionVectorizer

regular

ordered

Scalar

sequence

->

heartbeat time-series

SlidingWindowTransformer

irregular

ordered

Scalar

sequence

->

time stamped values or event sequence

KDEV ectorizer

One should note that regular ordered scalar sequences references a Transformer function instead of a Vectorizer. That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an unordered scalar sequence and then apply the appropriate techniques.

regularity

order

type

sequence

example

functions

unordered

Vector

sequence

->

point cloud

WassersteinVectorizer, DistributionVectorizer

regular

ordered

Vector

sequence

->

spatial-trajectory data

SlidingWindowTransformer

irregular

ordered

Vector

sequence

->

time stamped locations

we accept pull requests

One should note that regular ordered vector sequences references a Transformer function instead of a Vectorizer. That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an unordered vector sequence and then apply the appropriate techniques.

WassersteinVectorizer should be considered the gold standard for vectorizing point clouds of data. It makes use linear optimal transport to linearize and thus provide a reasonably scalable vectorization of a point cloud so that Euclidean or Cosine distance on this space will be a reasonable approximation of Wasserstein distance between the point cloud distrubitons. SinkhornVectorizer can handle much larger distributions of data and is generally more efficient but this efficiency may come with some loss of quality. Lastly, we include an ApproximateWassersteinVectorizer which is a heuristic linear algebra based solution which poorly approximates our WassersteinVectorizer but is very, very fast.