Coverage describes the number of sequencing reads that are uniquely mapped to a reference and “cover” a known part of the genome. Ideally, the sequencing reads that uniquely aligned are uniformly distributed across the reference genome and hence provide uniform coverage. In reality, coverage is not uniform and may be underrepresented in genetic regions of interest due to a variety of factors (see table below). These include the fact that the genome itself is complex, containing genes, noncoding DNA, repetitive sequences, and other elements that can make it difficult to align the sequencing read to the proper genomic coordinates.
The number of sequencing reads that map to a known region is also an important part of coverage. A sufficient number of properly mapped reads is required to find and correctly identify genetic mutations. With high sequencing coverage, researchers can find the proverbial ‘needle in the haystack’, able to identify low frequency mutations or discover mutations in a heterogeneous sample such as a tumor biopsy. Poor coverage, whether due to an insufficient number of reads or sequencing reads that are mapped incorrectly, will result in the inability to detect the variants of interest.
White Paper: The importance of coverage: advantages of amplicon-based approaches in next-generation sequencing
How does throughput relate to sequencing coverage?
Having coverage is clearly important to ensure that the genomic region of interest can be studied with high confidence. For regions with little to no coverage, researchers frequently increase the sequencing throughput for their studies. That is, obtain more sequencing reads and data to increase coverage for a genetic region by brute force. However, this method is inefficient, increases costs, and does not address the underlying reasons for the poor coverage itself. By increasing throughput, genomic regions with sufficient coverage will now be over-represented and the reads are in effect, wasted. Areas with zero coverage before may not have coverage just by sequencing more sample.
A more efficient way to address coverage is by using a targeted sequencing approach. Through targeted sequencing, researchers can focus on just their regions of interest instead of needing to sequence the entire genome. This provides the benefit of ensuring sufficient coverage, including in parts of the genome that may not have been accessible previously, with lower sequencing costs.
Potential reasons for poor sequencing coverage and uniformity
Reasons for poor coverage | Why this can affect coverage |
Sample quality | Degraded samples are more difficult to prepare with shorter sequencing reads. Shorter sequencing reads are more difficult to map to the correct region since they may be less unique. |
Sample input | May not have enough sample to sequence and the DNA is not representative of the entire genome |
Homologous regions | Homologous regions have similar sequences. More difficult to map the read to the correct portion of the reference genome |
Regions of low complexity | Sequence reads with low complexity may be mapped to the wrong part of the genome, resulting in coverage bias. |
Hypervariable regions | Due to the high number of variants, the sequencing read will look very different compared to the reference genome and may not be mapped appropriately. |
GC content | Potential sequencing bias due to the % content of guanine-cytosine nucleotides |
References
- https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
- Watson et al., J Immuno 198:3371 (2017)
For Research Use Only. Not for use in diagnostic procedures.