Data storage using proteins offers high capacity and stability, enabling utilization of protein techniques for data storage and retrieval. However, expressing unnatural proteins with random sequences for data storage and sequencing them for accurate data retrieval remain challenging. In this study, by encoding digital data into amino acid sequences and incorporating them into collagen-like protein templates, we achieve successful expression of the proteins via E. coli for data storage; the data-bearing proteins containing selective amino acids and arginine intervals can be sequenced through tryptic digestion followed by LC-MS/MS analysis to achieve complete data recovery, even for protein mixtures encoding multiple datasets. We further demonstrate much higher stability of the data-bearing protein than DNA, and random access and cryptographic data protection using affinity-tagged proteins. This work establishes a robust framework for protein-based data storage, opening up avenues for data storage and retrieval, protein engineering and chemistry, synthetic biology, proteomics, and beyond.
Zhou et al. (Sat,) studied this question.