Adapting large artificial intelligence models for specific tasks typically demands significant computational resources and access to original training data, creating a barrier to wider adoption. Bhoomit Vasani from Amazon AGI, Jack FitzGerald from EdgeRunner AI, Anjie Fang from Amazon AGI, and Sushmit Vaish from Amazon AGI now present a method called PHLoRA, which extracts low-rank adapters from existing, fully-trained models without needing any additional data or retraining. This innovative approach analyses the changes made to the model’s weights during fine-tuning and reconstructs compact adapter modules, offering substantial cost savings and reduced latency for deploying AI across various applications. The team demonstrates that PHLoRA effectively captures the essential information from the original fine-tuning process, allowing existing models to benefit from efficient, scalable inference without performance loss, and effectively democratising access to advanced AI capabilities.
By analysing the difference in weights between a base model and its fine-tuned counterpart, scientists developed a method to reconstruct adapter modules that can be merged or dynamically routed during inference, or served in scalable settings. This approach reduces latency and yields substantial cost savings, decoupling fine-tuning from adapter generation and allowing adapters to be extracted from existing models or third-party checkpoints.
Post-hoc Low-Rank Adaptation for LLMs
This research details PHLoRA (Post-Hoc Low-Rank Adaptation), a method for efficiently adapting large language models (LLMs) after fine-tuning. The core challenge is that fine-tuning LLMs requires significant computational resources and storage, and while adapter methods reduce the number of trainable parameters, they still require storing and serving multiple adapters for different tasks. This research aims to compress and optimize already fine-tuned LLMs without retraining. PHLoRA is a post-hoc method that extracts low-rank approximations from the weight changes resulting from LoRA fine-tuning. First, the LLM is fine-tuned using LoRA, creating low-rank adaptation matrices.
Then, after fine-tuning, the difference between the fine-tuned weights and the original weights is calculated. Singular Value Decomposition (SVD) is then applied to these weight differences to identify the most important singular values and vectors. A low-rank approximation of the weight differences is reconstructed using these key components and added to the original weights, creating the adapted model. This process compresses the adaptation information into fewer parameters, reducing storage and potentially improving inference speed. The research demonstrates that SVD is the optimal method for finding the best low-rank approximation, minimizing error.
PHLoRA significantly compresses adaptation information without substantial performance degradation on various tasks. The compressed models require less storage and can be served more efficiently. Energy analysis shows that the method preserves important information at different compression levels, demonstrating a trade-off between compression and accuracy.
Adapters Extracted From Fine-Tuned Models Significantly Reduce Latency
Scientists developed PHLoRA, a novel method for extracting low-rank adaptation modules from fully fine-tuned models without requiring training data or gradients. This work demonstrates the ability to reconstruct adapters by computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, enabling either merging or dynamic routing during inference. The team achieved over a tenfold reduction in model-load latency compared to full-rank checkpoints by utilizing these compact adapters. Experiments across text, image, and video benchmarks demonstrate that PHLoRA preserves substantial energy from the full weight changes while allowing for safe pruning of adapters.
Measurements confirm that the extracted adapters yield negligible degradation in downstream task performance when re-merged with the base model. Specifically, the team reports preserved energy levels, quantifying the fraction of important weight updates retained by the adapters. The research shows that PHLoRA is compatible with existing tools like the HuggingFace PEFT library and multi-adapter serving frameworks, simplifying deployment. Energy-based analysis reveals that the method effectively captures essential information within the weight updates, with measurements of preserved energy correlating with adapter performance. Furthermore, the team evaluated performance on benchmarks including TAT, QA, MKFE, and MedMCQA, as well as VQA, RAD, and CaptionGen, demonstrating consistent performance across modalities. The method delivers up to a fourfold reduction in inference cost while maintaining accuracy, showcasing its potential for efficient and scalable deployment.
PHLoRA Enables Efficient Adapter Reconstruction and Pruning
The team presents PHLoRA, a method for deriving low-rank adapters from fully fine-tuned models without requiring access to original training data or gradients. This approach reconstructs adapter modules from existing full-rank models, enabling efficient and scalable inference through techniques like dynamic routing or standard adapter platforms. Experiments across text, image, and video tasks demonstrate that PHLoRA preserves substantial energy from the original model weights, allowing for safe pruning and minimal performance degradation when adapters are re-integrated. Notably, PHLoRA achieves competitive accuracy while significantly reducing inference costs, with potential savings of up to fourfold compared to standard adapter inference and full-rank model inference in dynamic routing scenarios.
This improvement stems from increased inference throughput, processing more requests per unit of time. By enabling the creation of adapter-ready models from existing checkpoints, PHLoRA democratizes scalable inference for a wider range of applications and legacy models. The authors acknowledge that current experiments focused on moderate-sized datasets and supervised fine-tuning. Future research will explore scaling PHLoRA to larger, more challenging benchmarks and diverse modalities, as well as extending its compatibility with advanced fine-tuning techniques.
👉 More information
🗞 PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint
🧠 ArXiv: https://arxiv.org/abs/2509.10971
