by Rose Miyatsu
June 21, 2022
Introduction:
Read more here: https://news.ucsc.edu/2022/06/10-mill ... nces.html(UC Santa Cruz) 10 million sequences of COVID-19’s genomic code have now been organized into a phylogenetic tree in the UC Santa Cruz SARS-CoV-2 Browser, which is the largest tree of genomic sequences of a single species ever assembled. This accomplishment is impressive for both the computer engineering feat of processing such a massive amount of data and the incredible dedication and coordination of the researchers involved.
“It is an astounding thing that has happened there,” said Clay Fischer, Project Manager for the UCSC Genome Browser.
All of these sequences are assembled by the researchers into a phylogenetic tree that shows the evolutionary history of the virus, with different branches representing the lineages that have mutated throughout the pandemic. This tree is powered by a software tool called UShER that was developed at the UC Santa Cruz Genomics Institute and is hosted on the UCSC Genome Browser website.
Many hands from around the world have brought the Genomics Institute these 10 million sequences that live on the UShER tree. Clinicians worldwide have administered tests to be sent off to local labs, which then sent the samples on for sequencing. Once they are sequenced, they become digital files that are uploaded to databases for genomic information such as GISAID, GenBank, or the COG-UK database.
Angie Hinrichs, a senior software architect at the UCSC Genome Browser and self-described “data wrangler,” built a pipeline to pull these sequences into the UShER tree automatically. But this process was complicated as some databases, like GISAID, had restrictions that necessitated the manual download of sequences.
