The importance of aligning against the GRCh38 reference assembly for diagnostic accuracy

GRCh38

 

A few weeks ago, we became one of the first diagnostic companies to begin aligning against the GRCh38(hg38) reference genome for exome and genome analysis. It was a substantial R&D effort which we undertook to provide greater accuracy of analysis, which is expected to further increase diagnostic yield for patients. But why does it matter so much?

The GRCh38 assembly is noted for providing four significant improvements over GRCh37 and other earlier versions:

  • Inclusion of the mitochondrial genome
  • Sequence coverage of centromeres
  • General assembly updates
  • Better representation of variation

In a nutshell, GRCh38 fills in previous gaps in the reference sequence, fixes earlier errors and provides better coverage of alternative loci – regions of the genome where, due to significant divergence in genetic variants across the population, more than one sequence is needed to ensure all the different variants can align to the reference assembly.

Filling in gaps
The addition of millions of previously unmapped base pairs in GRCh38 provides sequence information for regions of the genome that were previously poorly covered. Most notably, this has impacted coverage of centromeres and their surrounding region. But filling in smaller gaps has also resulted in the addition of previously unrepresented genes and transcripts, improvement of gene models and removal of deprecated gene records.

Fixing earlier errors
A few misassembled areas are significantly restructured in GRCh38. But just as important are the many thousands of single nucleotide changes that have been made across the genome. These fixes help avoid the problem of genome assemblers calling SNVs where they don’t really exist, improving the annotation and analysis of clinically relevant genes.

Better coverage of alternative loci
GRCh38 better represents human diversity, placing a greater emphasis on alternative loci. When GRCh37 was released, there were a total of only 9 alternative loci covering just 3 regions of the genome. When GRCh38 was released, it contained a total of 261 alternate loci across 178 regions – numbers that have continued to increase in subsequently released patches. The improved representation of diversity is important because it helps in assigning reads to their correct chromosomal location.

The cumulative impact of all of these changes is that GRCh38 provides better coverage of many clinically relevant genes. In addition, the changes improve downstream annotation as hundreds of variants curated to GRCh38 coordinates in clinical annotation databases, but that can not be translated to older GRCh37(hg19) coordinates, are now properly represented. The result is greater analysis accuracy which in turn produces an increase in diagnostic yield – the ultimate goal of all diagnostic test providers.

 

Scroll Up