Researchers at The Hong Kong Polytechnic University (PolyU) have achieved data storage and retrieval using proteins engineered to act as digital data carriers, demonstrating the potential for a sustainable and high-capacity alternative to conventional storage methods. Inspired by the sequence pattern of collagen, Professor Zhongping Yao and his interdisciplinary team designed a protein “backbone” to embed encoded files, successfully expressing data-bearing proteins via E. coli. This approach outperforms both DNA and peptides in storage capacity and stability, with proteins preservable in powder or solution form, addressing limitations of hard drives and cloud storage strained by the growth of data from artificial intelligence. The team’s findings have been published in Nature Communications.
Engineered Proteins Overcome DNA & Peptide Storage Limitations
Departing from conventional molecular storage strategies, the PolyU team, including Zhongping Yao, Associate Head and Professor of the Department of Applied Biology and Chemical Technology, drew inspiration from the structural arrangement of collagen to design a protein “backbone” capable of embedding digital information. This approach addresses limitations inherent in both DNA and peptide-based systems, offering increased storage capacity and improved stability. Previous attempts at molecular data storage often relied on DNA, limited by its four nucleotide types, or peptides, constrained by short molecular sequences and costly chemical synthesis. Yao explains that the team’s findings have been published in Nature Communications. The team circumvented these issues by leveraging proteins, which possess significantly longer amino acid sequences, enabling higher storage efficiency.
Crucially, these data-bearing proteins can be produced biologically, utilizing systems like bacteria to generate large quantities at a reduced cost. The protein samples in this research achieved 30 times the storage density at only 10% of the cost of the peptide-based method. A key achievement was the development of algorithms-driven software capable of reconstructing complete protein sequences and converting them back into readable bit strings, completing the data cycle. This software allows for the reconstruction of full protein sequences and conversion back into bit strings for full data retrieval. The team demonstrated the ability to preserve proteins in both powder and solution form, enhancing their practicality and longevity. Beyond simple storage, the researchers explored the potential of these proteins to enable random data access and even cryptographic protection, opening possibilities for secure data archiving and potentially, storage within living organisms.
Collagen-Inspired Template Enables Stable Protein Data Encoding
Building on earlier work with peptides, the team led by Professor Zhongping Yao of the Department of Applied Biology and Chemical Technology, turned to proteins as data carriers due to their potential for higher storage efficiency and capacity; proteins possess substantially longer amino acid sequences than peptides, addressing a key limitation of previous approaches. The team’s innovation centers on a protein template inspired by the structural pattern of collagen, a naturally stable protein, to enhance resistance to degradation and maintain solubility. By embedding data-bearing amino acid sequences into this collagen-like template, the researchers successfully expressed the proteins using E. coli bacteria, enabling large-scale and low-cost production. Retrieving the encoded data required digesting the proteins and analyzing the resulting peptide fragments using liquid chromatography, tandem mass spectrometry, followed by reconstruction of the full protein sequences with algorithms-driven software developed in-house.
This software allows for the reconstruction of full protein sequences and conversion back into bit strings for complete data retrieval, and incorporates an error-correction scheme. Notably, the proteins exhibited superior stability compared to DNA, remaining readable for extended periods even in challenging conditions, and can be preserved in both powder and solution form. The PolyU team also demonstrated random access and data encryption capabilities by attaching affinity tags and antibodies to the proteins, opening possibilities for secure and targeted data retrieval.
Moving forward, we aim to achieve mass storage capabilities, faster data writing and reading speeds, and further reductions in protein production costs, while designing diverse protein templates to achieve new functionalities to protein-based data storage.
Liquid Chromatography-Mass Spectrometry Reconstructs Protein Bit Strings
The innovation addresses a significant challenge; previously, protein sequencing techniques were primarily used for identification, requiring only partial sequence matches against existing databases. Full data recovery demands accurate reconstruction of the entire sequence, a task now accomplished through the team’s algorithms-driven software and an error-correction scheme. E. coli. Yao explains that the protein samples achieved 30 times the storage density at only 10% of the cost of the peptide-based method, highlighting a performance improvement. The team’s earlier work on peptide storage demonstrated suitability for space exploration in China’s next-generation manned spacecraft, and this latest development further solidifies proteins as a promising medium for addressing the escalating demands of data from artificial intelligence.
The inherent stability, ease of preservation and high storage capacity of proteins make them excellent carriers for the long-term storage of large volumes of data.
Functionalized Proteins Achieve Random Access & Data Encryption
This advancement addresses a key limitation of non-functionalized proteins, where retrieving specific data segments requires decoding the entire dataset; the team successfully attached affinity tags to proteins carrying targeted data, then utilized corresponding antibodies to selectively “capture” those proteins during purification. This targeted retrieval represents a significant step toward practical data management within a protein-based storage framework. By leveraging these functionalized proteins, they encoded secret messages that could only be retrieved using a known affinity compound, proving the potential for secure data storage at the molecular level. The team members include Zhongping Yao, Associate Head and Professor of the Department of Applied Biology and Chemical Technology. “The protein samples in our research achieved 30 times the storage density at only 10% of the cost of the peptide-based method,” Yao explained.
This builds on earlier work demonstrating the suitability of peptide-based data storage for China’s next-generation manned spacecraft, and this latest development further solidifies proteins as a promising medium for addressing the escalating demands of data from artificial intelligence. The team’s algorithms-driven software allows for the reconstruction of full protein sequences and conversion back into bit strings, for full data retrieval after digestion and analysis via liquid chromatography, tandem mass spectrometry. Yao anticipates future developments will focus on increasing storage capacity, accelerating data transfer speeds, and reducing production costs, while also exploring diverse protein templates to unlock even more functionalities.
The protein samples in our research achieved 30 times the storage density at only 10% of the cost of the peptide-based method.
