What is Factorized Self-Attention for time series models?

It's self-attention with extra steps to capture time and space relationships

Apr 02, 2025

Ever look at a stock chart and wonder how today's price might relate to both last week's movement AND what other stocks are doing right now?

That's the challenge time series models face - trying to capture both the patterns in time and relationships between other variables.

The extra complexity of time series

Time series data isn't just a flat sequence like text.

It has two dimensions:

Temporal patterns (how values change over time)
Relationships between variables (how different time series correlate)

Standard attention mechanisms - the ones that revolutionized language models - treat everything as a one-dimensional sequence.

This works fine for text, but it's like trying to understand a chess game by only looking at the sequence of moves without seeing the board position. You miss the spatial relationships between pieces that are just as important as the sequence of turns.

For financial data, this distinction is huge. A stock's movement relates to both its own history (time) and other market indicators (variables).

The basics of attention

Before diving into factorized attention, let's quickly review how attention works.

Attention allows a given token to gain context from other tokens. It does this by selectively learning what other tokens matter to its meaning. This selection is learned through model training.

In transformer models, each piece of data (a patch of time series) creates three components:

Query (Q): What information am I looking for?
Key (K): What information do I offer?
Value (V): What content do I contain?

The process works like this:

For a given token (patch), compute the Q vector. This shows what information this token is looking for.
Then compute the K vector for every other token. This shows what information these tokens have to share.
We then use the dot product fo ind the most similar matches between Q and K - how close what we are looking for is what a token offers. This forms the attention weights.
The weights then are used to compute a new context vector by multiplying the attenion weights with the V vector.
Our context vector is the new representation of the token, now with context from surrounding tokens.

In language models, words attend to other words. In time series, patches of data attend to other patches.

Factorized attention explained

For time series we split the attention process into two separate parts, time-wise and space-wise.

Time-wise Attention

This looks at relationships across time for the same variable. It's like asking "How does today's NVIDIA price relate to its price last week or last month?"

Space-wise Attention

This looks at relationships between different variables at the same time point. It's like asking "How does NVIDIA's movement relate to the QQQ index right now?"