Time series forecasting underpins critical applications across diverse fields, including climate science, energy management, healthcare and finance. Olaf Yunus Laitinen Imanov (DTU Compute, Technical University of Denmark), Derya Umut Kulali (Eskisehir Technical University) and Taner Yilmaz (Afyon Kocatepe University) et al present a novel approach, PatchFormer, designed to overcome the limitations of current methods which often demand extensive task-specific data and feature engineering. This research introduces a patch-based foundation model leveraging hierarchical masked reconstruction for self-supervised pretraining and efficient transfer learning via lightweight adapters. Demonstrating state-of-the-art zero-shot multi-horizon forecasting on 24 benchmark datasets, PatchFormer achieves a remarkable 27.3 percent reduction in mean squared error compared to strong baselines, all while drastically reducing the need for labelled data , requiring 94 percent less.
The research, published in IEEE Transactions on Neural Networks and Learning Systems, addresses limitations in existing Time series forecasting methods that often require extensive domain-specific feature engineering and large amounts of labelled data. The team achieved state-of-the-art zero-shot multi-horizon forecasting by employing a two-stage approach: large-scale self-supervised pretraining and efficient fine-tuning via adapter modules. This innovative methodology allows PatchFormer to learn robust temporal representations from unlabeled data, significantly reducing the need for task-specific training.
The core of PatchFormer lies in its hierarchical patch tokenization, which segments time series into semantically meaningful patches of varying scales, ranging from 16 to 128 timesteps, and learns multiscale temporal representations. This approach not only captures both short-term fluctuations and long-term trends but also reduces computational complexity, moving from O(L2) to O(L2/P2). Researchers then implemented a contrastive masked reconstruction technique, combining masked patch reconstruction with dynamic masking strategies that adapt to the characteristics of each sequence. This self-supervised pretraining was conducted on an impressive 87 billion data points, encouraging both local accuracy and global consistency in the learned representations.
Experiments conducted on 24 benchmark datasets, spanning weather, energy, traffic, finance, and healthcare, demonstrated PatchFormer’s exceptional performance. The study reveals a remarkable 27.3 percent reduction in mean squared error (MSE) compared to strong baseline models, while simultaneously requiring 94 percent less task-specific training data. Furthermore, the model exhibits near log-linear scaling with pretraining data, up to 100 billion points, indicating its potential for even greater performance gains with increased data exposure. Notably, PatchFormer processes length-512 sequences 3.8times faster than traditional full-sequence transformers, highlighting its computational efficiency. The research establishes patch-based Foundation models as a practical paradigm for universal time series forecasting, offering a significant advancement over existing methods. This breakthrough opens exciting possibilities for applications requiring accurate and efficient time series predictions, from improved climate modelling and energy grid management to more reliable financial forecasting and healthcare monitoring.
Scientists Method
Scientists developed PatchFormer, a novel patch-based time series forecasting model, to overcome limitations in existing approaches requiring extensive domain-specific feature engineering and labelled data. The study pioneers a self-supervised pretraining strategy coupled with lightweight adapters for efficient transfer learning across diverse time series datasets. Researchers segmented input time series into patches, enabling the learning of multiscale temporal representations through learnable aggregation across varying temporal scales, specifically, patch sizes ranged from 16 to 128 timesteps. This hierarchical patch tokenization reduces computational complexity from O(L²) to O(L²/P²), where L represents sequence length and P the patch size.
The core of the work involved a contrastive masked reconstruction technique for self-supervised pretraining, combining masked patch reconstruction with contrastive learning. Dynamic masking strategies were implemented, adapting to the unique characteristics of each sequence to enhance learning efficiency, this ensured both local accuracy and global consistency within the reconstructed time series. Experiments employed a two-stage adaptation process, beginning with knowledge distillation and culminating in adapter-based fine-tuning, allowing for rapid deployment with only 2-5% parameter updates. The team engineered a system capable of processing length-512 sequences 3.8times faster than conventional full-sequence transformers.
Researchers defined a multivariate time series. Each patch underwent a learnable transformation, projecting it into an embedding space: ei = WeFlatten(X(i)p) + be ∈ model, where We represents the weight matrix and be the bias term. Multi-scale extraction generated K sequences with patch sizes {P₁, …, PK} where Pk = 2⁻¹P₁, and hierarchical aggregation combined these scales using learned attention weights αk, represented as Eagg = K Σₖ=₁ αkE(k) ↑, upsampling E(k) ↑ injects temporal information, resulting in Efinal = Eagg + P. The transformer encoder then processed these patch embeddings through Lenc layers, utilising multi-head self-attention (MHSA) and feed-forward networks (FFN), with attention complexity reduced to O((L/P)² · d) = O(L² · d/P²). This innovative methodology achieved a 27.3 percent reduction in mean squared error compared to strong baselines, while requiring 94 percent less task-specific training data.
PatchFormer achieves superior zero-shot forecasting accuracy across diverse
Scientists have developed PatchFormer, a novel patch-based time series forecasting model demonstrating state-of-the-art zero-shot multi-horizon performance. The research team achieved a remarkable 27.3 percent reduction in mean squared error (MSE) compared to strong baseline models across 24 benchmark datasets, spanning weather, energy, traffic, finance, and healthcare, while simultaneously requiring 94 percent less task-specific training data. Experiments revealed that PatchFormer effectively segments time series into patches and learns multiscale temporal representations through learnable aggregation across temporal scales, significantly improving forecasting accuracy. The breakthrough delivers a hierarchical patch tokenization strategy, extracting patches of varying lengths from 16 to 128 timesteps, which captures both short-term fluctuations and long-term trends while reducing computational complexity.
Measurements confirm that this approach reduces the complexity from O(L2) to O(L2/P2), where L represents the sequence length and P the patch size. Data shows that the model exhibits near log-linear scaling with pretraining data, successfully processing up to 100 billion data points, and processes length-512 sequences 3.8times faster than traditional full-sequence transformers. This enhanced processing speed is crucial for real-time applications requiring rapid predictions. Researchers employed a contrastive masked reconstruction technique for self-supervised pretraining, combining masked patch reconstruction with dynamic masking strategies tailored to the characteristics of each time series.
The dynamic masking adapts to sequence characteristics, encouraging both local accuracy and global consistency during the pretraining phase. Tests prove that cross-domain knowledge distillation, involving a two-stage adaptation process, enables rapid deployment with only 2-5 percent parameter updates, making the model highly adaptable to new datasets. Furthermore, the study meticulously measured performance across diverse datasets, consistently demonstrating superior forecasting capabilities. The team recorded substantial improvements in forecasting accuracy, particularly in scenarios with limited labeled data, highlighting the effectiveness of the self-supervised pretraining approach. These results establish patch-based foundation models as a practical paradigm for universal time series forecasting, opening avenues for more accurate and efficient predictions in critical domains like climate modelling and financial analysis.
PatchFormer achieves zero-shot forecasting with less data than
Scientists have developed PatchFormer, a novel patch-based time series foundation model designed for accurate zero-shot forecasting across diverse domains. This model employs hierarchical tokenization, contrastive pretraining, and efficient adaptation techniques to achieve state-of-the-art performance with significantly reduced reliance on task-specific data, requiring 94% less than conventional methods. Key innovations include a hierarchical multi-scale architecture that combines patch representations at multiple resolutions, resulting in a 14-15% performance improvement, and effective pretraining on a substantial 87 billion data points to learn transferable representations, leading to a 27.3% reduction in mean squared error. Researchers demonstrated strong zero-shot generalization capabilities, notably improving traffic forecasting by 31% through weather pretraining. Furthermore, PatchFormer exhibits sample efficiency, achieving baseline performance with only 500 samples compared to the 10,000 typically needed when training from scratch, and computational efficiency, processing sequences 3.8times faster than full-sequence transformers. The authors acknowledge limitations in uncertainty calibration and extreme event prediction as areas for future work, alongside potential advancements in multimodal pretraining, causal representation learning, continual learning for evolving data distributions, and federated pretraining for privacy-preserving collaboration.
👉 More information
🗞 PatchFormer: A Patch-Based Time Series Foundation Model with Hierarchical Masked Reconstruction and Cross-Domain Transfer Learning for Zero-Shot Multi-Horizon Forecasting
🧠 ArXiv: https://arxiv.org/abs/2601.20845
