Compression_of_genomic_sequencing_data

Compression of genomic sequencing data

Methods of compressing data tailored specifically for genomic data

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 (Arabidopsis thaliana) Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

General concepts

While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g., GenBank flat file database), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., microsatellite sequences) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.^[1]^[2]^[3]

Figure 1: The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data back to text form.

Algorithm design choices

A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.

List of genomic re-sequencing data compression tools

The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes.^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[13] Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported ^[6] for two revisions of the same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.

More information Software, Description ...

Genomic Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
Software	Description	Compression Ratio	Data Used for Evaluation	Approach/Encoding Scheme	Link	Use Licence	Reference
PetaSuite	Lossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files	60% to 90%	Human genome sequences from the 1000 Genomes Project		https://petagene.com	Commercial	^[14]
Genozip	A universal compressor for genomic files – compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files	^[15] ^[16]	Human genome sequences from the 1000 Genomes Project	Genozip extensible framework	http://genozip.com	Commercial, but free for non-commercial use	^[17]
Genomic Squeeze (G-SQZ)	Lossless compression tool designed for storing and analyzing sequencing read data	65% to 76%	Human genome sequences from the 1000 Genomes Project	Huffman coding	http://public.tgen.org/sqz	-Undeclared-	^[8]
CRAM (part of SAMtools)	Highly efficient and tunable reference-based compression of sequence data	^[18]	European Nucleotide Archive	deflate and rANS	http://www.ebi.ac.uk/ena/software/cram-toolkit	Apache-2.0	^[19]
Genome Compressor (GeCo)	A tool using a mixture of multiple Markov models for compressing reference and reference-free sequences		Human nuclear genome sequence	Arithmetic coding	http://bioinformatics.ua.pt/software/geco/ or https://pratas.github.io/geco/	GPLv3	^[13]
GenomSys codecs	Lossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092^[20] (MPEG-G)	60% to 90%	Human genome sequences from the 1000 Genomes Project	Context-adaptive binary arithmetic coding (CABAC)	https://www.genomsys.com	Commercial	^[21]
fastafs	Compression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system (api-like )integration without the need to fully decompress archives for random / partial access.		FASTA files	Huffman coding as implemented by Zstd	https://github.com/yhoogstrate/fastafs	GPL-v2.0	^[22]

More information Software, Description ...

Share this article:

This article uses material from the Wikipedia article Compression_of_genomic_sequencing_data, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[Gian-1] [1]
Giancarlo, R.; Scaturro, D.; Utro, F. (2009). "Textual data compression in computational biology: A synopsis". Bioinformatics. 25 (13): 1575–1586. doi:10.1093/bioinformatics/btp117. PMID 19251772.

[2] [2]
Nalbantog̃Lu, O. U.; Russell, D. J.; Sayood, K. (2010). "Data Compression Concepts and Algorithms and their Applications to Bioinformatics". Entropy. 12 (1): 34. doi:10.3390/e12010034. PMC 2821113. PMID 20157640.

[Morteza-3] [3]
Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi:10.3390/info7040056.

[Brandon-4] [4]
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783.

[Deo-5] [5]
Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi:10.1093/bioinformatics/btr505. PMID 21896510.

[Wang-6] [6]
Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471.

[Pinho-7] [7]
Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC 3287168. PMID 22139935.

[Tembe-8] [8]
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID 20605925.

[Chris-9] [9]
Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi:10.1093/bioinformatics/btn582. PMID 18996942.

[Pavlichin-10] [10]
Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi:10.1093/bioinformatics/btt362. PMID 23793748.

[11] [11]
Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference Sequence Construction for Relative Compression of Genomes". String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 7024. pp. 420–425. doi:10.1007/978-3-642-24583-1_41. ISBN 978-3-642-24582-4. S2CID 16007637.

[12] [12]
Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering Relative Compression of Genomes". arXiv:1103.2351 [cs.CE].

[Pratas-13] [13]
Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.

[14] [14]
"The Importance of Data Compression in the Field of Genomics". IEEE Pulse. 2019-04-26. Retrieved 2024-02-22.

[15] [15]
Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files". bioRxiv. doi:10.1101/2022.09.12.507582. S2CID 252357508.

[16] [16]
Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in Genozip 15". bioRxiv. doi:10.1101/2023.07.07.548069. S2CID 259764998.

[17] [17]
Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021). "Genozip: a universal extensible genomic data compressor". Bioinformatics. 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. PMC 8388020. PMID 33585897.

[CRAMbench-18] [18]
CRAM benchmarking

[CRAM-19] [19]
CRAM format specification (version 3.0)

[20] [20]
"ISO/IEC 23092-2:2019 Information technology — Genomic information representation — Part 2: Coding of genomic information". iso.org.

[Alberti-21] [21]
Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation". bioRxiv 10.1101/426353.

[22] [22]
Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021). "FASTAFS: file system virtualisation of random access compressed FASTA files". BMC Bioinformatics. 22 (1): 535. doi:10.1186/s12859-021-04455-3. PMC 8558547. PMID 34724897.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

Software	Description	Compression Ratio	Data Used for Evaluation	Approach/Encoding Scheme	Link	Use License	Reference
Genome Differential Compressor (GDC)	LZ77-style tool for compressing multiple genomes of the same species	180 to 250-fold / 70 to 100-fold	Nuclear genome sequence of human and Saccharomyces cerevisiae	Huffman coding	http://sun.aei.polsl.pl/gdc	GPLv2	^[5]
Genome Re-Sequencing (GRS)	Reference sequence-based tool independent of a reference SNP map or sequence variation information	159-fold / 18,133-fold / 82-fold	Nuclear genome sequence of human, Arabidopsis thaliana (different revisions of the same genome), and Oryza sativa	Huffman coding	https://web.archive.org/web/20121209070434/http://gmdd.shgmo.org/Computational-Biology/GRS/	free of charge for non-commercial use	^[6]
Genome Re-sequencing Encoding (GReEN)	Probabilistic copy model-based tool for compressing re-sequencing data using a reference sequence	~100-fold	Human nuclear genome sequence	Arithmetic coding	http://bioinformatics.ua.pt/software/green/	-Undeclared-	^[7]
DNAzip	A package of compression tools	~750-fold	Human nuclear genome sequence	Huffman coding	http://www.ics.uci.edu/~dnazip/	-Undeclared-	^[9]
GenomeZip	Compression with respect to a reference genome. Optionally uses external databases of genomic variations (e.g. dbSNP)	~1200-fold	Human nuclear genome sequence (Watson) and sequences from the 1000 Genomes Project	Entropy coding for approximations of empirical distributions	https://sourceforge.net/projects/genomezip/	-Undeclared-	^[10]

Compression_of_genomic_sequencing_data

Compression of genomic sequencing data

General concepts

Base variants

Relative genomic coordinates

Prior information about the genomes

Encoding genomic coordinates

Algorithm design choices

Reference sequence

Encoding schemes

List of genomic re-sequencing data compression tools

References

Share this article: