Variable Length Sequences¶

Variable length sequence data can present a significant challenge to machine learning algorithms and data science analysis.

Part of this problem is driven by the wide varieties of variable length sequence data that are encountered in the wild. To that end we present a taxonomy of the kinds of variable length sequences that we typically encounter and our suggestions for how to think about them.

We generally find it useful when describing variable length sequence data to describe what it is a sequence of. The basic types that we commonly encounter are: categorical values, scalar values and vector values. Certainly scalar data could be thought of a simple one dimensional vector data but given the different techniques that can, and often are, applied to such data we feel that treating it as a seperate data type is warranted.

Next we describe it as either ordered or unordered sequences. Yes, an unordered sequence is an odd turn of phrase but we be find it to be a useful simplifying notion. An unordered sequence is often referred to as a bag in data science literature. For a example a bag of words is the phrase used to describe an unordered collection of word tokens. We would describe such a collection as an unordered categorical sequence.

Lasty, given an ordered sequence we require one extra piece of information: is the ordered regular or irregular. Regular sequences are often described as heartbeat data and generally assume equal spacing between all our values. Irregular sequences are often referred to as event data and each value is associated with a particular position allowing variable spacing amongst our values.

Variable length sequence data comes in a vast variety of forms. Different forms of variable length sequence data are amenable to different techniques. To deal with this variety of data we propose this simple taxonomy of variable length sequence data and provide links and suggestions for techniques conducive to each type.

Type of values: categorical, scalar, vector
Order of values: Ordered or Unordered
Regularity of values: Regular or Irregular

Sequence Types¶

Categorical¶

regularity	order	type	sequence		example
	unordered	categorical	sequence	->	bag of words
regular	ordered	categorical	sequence	->	text document
irregular	ordered	categorical	sequence	->	time stamped labelled events

Scalar¶

regularity	order	type	sequence		example
	unordered	Scalar	sequence	->	random variable
regular	ordered	Scalar	sequence	->	heartbeat time-series
irregular	ordered	Scalar	sequence	->	time stamped values or event sequence

Vector¶

regularity	order	type	sequence		example
	unordered	Vector	sequence	->	point cloud
regular	ordered	Vector	sequence	->	spatial-trajectory data
irregular	ordered	Vector	sequence	->	time stamped locations

Vectorizer Functions¶

This library adheres the sklearn transformer paradigm. With most functions having a fit, fit_transform and transform functions. As such they can be easily arranged in sklearn pipelines to ensure that all of your data transformation steps are encapsulated cleanly.

For the most part our vectorizers take in a sequence of variable length sequences and learn a fixed width representation of these sequences. Another way of thinking of this is transforming a jagged array of vectors into a fixed width array of vectors. Fixed width representations are significantly more conducive to traditional machine learning algorithms.

Transformers on the other hand are more generic utility functions that massage data in various useful ways.

Due to the variety of vectorization techniques in this library a user might find it easier to determine the type of variable length sequences they are dealing with and use the following index to find the relevant functions.

regularity	order	type	sequence		example	functions
	unordered	categorical	sequence	->	bag of words	NgramVector izer, EdgeListVectorizer
regular	ordered	categorical	sequence	->	text document	NgramVectorizer, LZCompressionVectorizer, BPEVectorizer
irregular	ordered	categorical	sequence	->	time stamped labelled events	HistogramVectorizer

All of these vectorizers take data in the form of a sequence of variable length sequences of categorical values (such as strings). All of these methods presume that a user has already decomposed their data into something of this form.

The most common sources of variable length categorical data are text documents or data frames with categorical columns. In both cases some pre-processing will be necessary to convert such data into sequences of variable length sequences.

In the case of text documents this often involves tokenization and lemmatization steps. An example of applying such transformations on text data before vectorization can be found in document vectorizer.

Good tokenization and lemmatization libraries include: HuggingFace, SentencePiece, spaCy, and nltk.

In the case of a data frame with multiple categorical columns one might make use of our libraries CategoricalColumnTransformer for transforming a data frame with one or more columns into a variable length sequence of categorical sequences. This is typically done by specifying one categorical column to represent ones objects and another set of categorical columns to be used to describe said objects. For an examples of how one might use this see an introduction to CategoricalColumnTransformer or the more complicated CategoricalColumnTransformer vignette.

regularity	order	type	sequence		example	functions
	unordered	Scalar	sequence	->	random variable	HistogramVectorizer, DistributionVectorizer
regular	ordered	Scalar	sequence	->	heartbeat time-series	SlidingWindowTransformer
irregular	ordered	Scalar	sequence	->	time stamped values or event sequence	KDEV ectorizer

One should note that regular ordered scalar sequences references a Transformer function instead of a Vectorizer. That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an unordered scalar sequence and then apply the appropriate techniques.

regularity	order	type	sequence		example	functions
	unordered	Vector	sequence	->	point cloud	WassersteinVectorizer, DistributionVectorizer
regular	ordered	Vector	sequence	->	spatial-trajectory data	SlidingWindowTransformer
irregular	ordered	Vector	sequence	->	time stamped locations	`we accept pull requests`

One should note that regular ordered vector sequences references a Transformer function instead of a Vectorizer. That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an unordered vector sequence and then apply the appropriate techniques.

WassersteinVectorizer should be considered the gold standard for vectorizing point clouds of data. It makes use linear optimal transport to linearize and thus provide a reasonably scalable vectorization of a point cloud so that Euclidean or Cosine distance on this space will be a reasonable approximation of Wasserstein distance between the point cloud distrubitons. SinkhornVectorizer can handle much larger distributions of data and is generally more efficient but this efficiency may come with some loss of quality. Lastly, we include an ApproximateWassersteinVectorizer which is a heuristic linear algebra based solution which poorly approximates our WassersteinVectorizer but is very, very fast.