What is Factorized Self-Attention for time series models?
It's self-attention with extra steps to capture time and space relationships
Ever look at a stock chart and wonder how today's price might relate to both last week's movement AND what other stocks are doing right now?
That's the challenge time series models face - trying to capture both the patterns in time and relationships between other variables.
The extra complexity of time series
Time series data isn't just a flat sequence like text.
It has two dimensions:
Temporal patterns (how values change over time)
Relationships between variables (how different time series correlate)
Standard attention mechanisms - the ones that revolutionized language models - treat everything as a one-dimensional sequence.
This works fine for text, but it's like trying to understand a chess game by only looking at the sequence of moves without seeing the board position. You miss the spatial relationships between pieces that are just as important as the sequence of turns.
For financial data, this distinction is huge. A stock's movement relates to both its own history (time) and other market indicators (variables).
The basics of attention
Before diving into factorized attention, let's quickly review how attention works.
Attention allows a given token to gain context from other tokens. It does this by selectively learning what other tokens matter to its meaning. This selection is learned through model training.
In transformer models, each piece of data (a patch of time series) creates three components:
Query (Q): What information am I looking for?
Key (K): What information do I offer?
Value (V): What content do I contain?
The process works like this:
For a given token (patch), compute the Q vector. This shows what information this token is looking for.
Then compute the K vector for every other token. This shows what information these tokens have to share.
We then use the dot product fo ind the most similar matches between Q and K - how close what we are looking for is what a token offers. This forms the attention weights.
The weights then are used to compute a new context vector by multiplying the attenion weights with the V vector.
Our context vector is the new representation of the token, now with context from surrounding tokens.
In language models, words attend to other words. In time series, patches of data attend to other patches.
Factorized attention explained
For time series we split the attention process into two separate parts, time-wise and space-wise.
Time-wise Attention
This looks at relationships across time for the same variable. It's like asking "How does today's NVIDIA price relate to its price last week or last month?"
Space-wise Attention
This looks at relationships between different variables at the same time point. It's like asking "How does NVIDIA's movement relate to the QQQ index right now?"
A trading example: time-wise vs space-wise attention
Imagine you're modeling NVDA alongside the QQQ tech index:
Time-wise attention helps NVIDIA's price today look at its own history:
Is this pattern similar to what happened last Friday?
Am I following my typical end-of-month behavior?
What happened the last time I formed this chart pattern?
Each symbol is looking at the sequence of time steps in it’s own series.
Space-wise attention helps NVIDIA look at QQQ and other stocks:
Is the broader tech sector moving in the same direction?
Am I outperforming or underperforming the market today?
Could QQQ's movement predict my next move?
Each symbol is looking at what other symbols did in the same time step.
Factorized attention for time series
Factorized attention gives time series models what they've been missing - the ability to separately model time and variable relationships.
By respecting the inherent structure of the data, these models can make more accurate predictions with more efficient computation.
In tomorrow's post, we'll dive deeper into the mathematics that make this possible.
How to Implement Factorized Attention for Time Series
Yesterday I explained the concept of factorized attention – splitting normal attention into time-wise and space-wise components.
View all related code here on my GitHub.