Hubble Suite with 1B and 8B Parameters Advances LLM Memorization Study of 100B-500B Tokens

The potential for large language models to inadvertently memorise and reveal sensitive training data presents a significant challenge to their widespread adoption, and Johnny Tian-Zheng Wei from the University of Southern California, alongside Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, and James Flemings, have introduced Hubble, a comprehensive and openly available suite of models designed to advance the study of this critical issue. Hubble consists of both standard and carefully perturbed language models, allowing researchers to systematically investigate how the frequency and timing of sensitive data within the training process impacts memorisation risks. Their work establishes that the likelihood of a model recalling specific information is directly related to how often that information appears relative to the overall size of the training data, and demonstrates that continued exposure is vital for retaining memorised content. This research provides crucial insights into mitigating memorisation risks, suggesting that increasing training data size and prioritising the early presentation of sensitive information are effective strategies, while also offering a valuable testbed for developing and benchmarking techniques in membership inference and machine unlearning.

LLM Memorization, Privacy, and Unlearning Techniques

This research details a comprehensive investigation into the memorization capabilities of large language models (LLMs), and methods to understand and control that memorization. Researchers trained models, with either 1 billion or 8 billion parameters, on datasets designed to test memorization of copyrighted material, private information, and standard benchmark tasks. They then evaluated these models to measure the extent to which they memorized each type of data, and explored techniques for unlearning or mitigating memorization. The core goal is to develop models that perform well on tasks without excessively memorizing sensitive or copyrighted data.

The study reveals that larger models memorize more data than smaller models, even when trained on the same dataset. A key innovation is a perturbed training approach, which creates models exhibiting memorization patterns similar to those trained specifically on different types of data, such as copyright material or private information. This allows researchers to study memorization across various domains without needing to train separate models for each. However, the team found it challenging to simultaneously achieve unlearning on both data the model should retain and data it shouldn’t, highlighting the difficulty of selectively erasing memorization. The research also demonstrates that memorization isn’t uniform; it’s significantly influenced by the type of data the model is exposed to.

Memorization Risks in Large Language Models

The research team engineered the Hubble suite, a collection of fully open-source large language models, specifically designed to investigate memorization risks within these systems. The core of this work involves eight models, both standard and perturbed, with parameter sizes of 1 billion or 8 billion, pretrained on datasets containing either 100 billion or 500 billion tokens. This configuration allows researchers to determine how the frequency of sensitive data within the training corpus impacts memorization, establishing that less frequent data in larger corpora is memorized less effectively. Additionally, six perturbed models were trained with sensitive data inserted at different stages of pretraining, revealing that data not continually reinforced during training can be effectively forgotten.

To simulate realistic privacy leakage scenarios, the team incorporated biographies, constructed from both templated text sourced from the YAGO knowledge base and annotated court cases from the European Court of Human Rights, to study the memorization of personally identifiable information. They also included dialogues from the PersonaChat dataset, assigning random usernames to simulate indirect leakage of sensitive attributes, demonstrating that even small models can infer personal characteristics from public text. The researchers also deliberately introduced contamination by inserting standard benchmarks, alongside newly created test sets, to enable detailed analysis of test set contamination and the development of methods for detecting memorization. The entire Hubble suite, encompassing models, training code, configurations, datasets, and evaluation tools, is publicly available, following the principles of open science, and is designed to be accessible to researchers with limited computational resources. This commitment to transparency and reproducibility facilitates a broad range of memorization research and provides a robust platform for advancing the field of language model security and privacy.

Memorization Risk Scales With Dataset Size

The research team presents Hubble, a suite of large language models specifically designed to investigate how these models memorize information. The study comprises eight models, both standard and perturbed, with either 1 billion or 8 billion parameters, trained on datasets of 100 billion or 500 billion tokens. The perturbed models were trained with deliberately inserted text, including book passages, biographies, and test sets, to simulate scenarios where sensitive data might be memorized. Results demonstrate that the risk of memorization is directly linked to the frequency of sensitive data relative to the overall size of the training dataset; a password appearing once in a smaller dataset is memorized more readily than the same password within a larger one.

Further experiments involved six additional perturbed models, where the inserted text was introduced at different stages of the pretraining process. These tests revealed that sensitive data, if not continually reinforced during training, can be effectively forgotten by the model. The team also conducted a comprehensive set of evaluations to measure memorization, including assessing loss, and generative testing, to determine how well the models recalled the inserted information. These evaluations established lower bounds on memorization capabilities, providing insights into the extent of information retained.

The study’s findings yield two key best practices for mitigating memorization risks: diluting sensitive data by increasing the size of the training corpus, and ordering sensitive data to appear earlier in the training process. Measurements confirm that models trained on 500 billion tokens exhibit weaker memorization of sensitive data compared to those trained on 100 billion tokens, even with the same level of duplicated content. These findings establish Hubble as a valuable tool for ongoing research into language model memorization, membership inference, and machine unlearning.

Memorization Risk Scales With Data Frequency

Hubble, a suite of open-source language models, represents a significant advance in the study of memorization risks within these powerful systems. Researchers developed both standard and perturbed models, systematically inserting controlled text, including book passages, biographies, and test sets, to investigate how easily sensitive information is retained. The core finding establishes that the frequency of sensitive data relative to the overall size of the training corpus determines the extent of memorization; a password appearing once in a smaller dataset is more readily memorized than the same password within a larger one. Furthermore, the team demonstrated that sensitive data, if not continually reinforced during training, can be effectively forgotten, suggesting practical strategies for mitigating memorization risks. Based on these findings, the researchers propose two best practices: increasing the size of the training corpus to dilute sensitive data, and ordering sensitive data to appear earlier in the training process.

👉 More information
🗞 Hubble: a Model Suite to Advance the Study of LLM Memorization
🧠 ArXiv: https://arxiv.org/abs/2510.19811

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Levitated Oscillators Achieve Coupled Dynamics with Simulated ‘Ghost’ Particle Interaction

Quantum Computers Extract Scattering Phase Shift in One-Dimensional Systems Using Integrated Correlation Functions

January 10, 2026
Framework Achieves Multimodal Prompt Injection Attack Prevention in Agentic AI Systems

Quantum Private Query Security Advances Database Protection, Mitigating Post-Processing Threats

January 10, 2026
Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

January 10, 2026