Modeling the daily fraction of a SARS-CoV-2 variant as a function of time in local regions using isotonic regression
Here we extract all regional data from GISAID that have a minimum of 10 sequences representing a variant in the virus, with at least 14 days of sampling. The sampling days do not have to be contiguous. The tables show all political/geographical regions that meet these criteria, whether they are significant or not. The daily fraction of a variant as a function of time is modeled using isotonic regression; the null hypothesis that the fraction does not change over time. We then test the null against the hypotheses that the fraction of the new variant is either increasing or decreasing. We randomize the data in each geographic region 400 times, and refit the isotonic logistic regression to the randomized data, to evaluate changes in frequency of a new mutation could be occurring by chance alone, or is significantly increasing (as shown in the first 3 tables and sets of plots) or decreasing (as shown in the last 3 tables and sets of plots). Because we perform 400 randomizations the lowest p-value we can obtain is 0.0025. If over one time period a mutation is increasing, and another period of time it is decreasing, both can be significant. The "# days" column is the number of days with sample available, and the time window is the number of days spanned by the sampling.
The accompanying plots show the increase in the new variant over time. The dot size is proportional to the number of sequences sampled that day, and the staircase line is the maximum likelihood estimate under the constraint that the logarithm of the odds ratio is non-decreasing.
This code is by Nick Hengartner and further descriptions of these analyses and plots can be found associated with Fig. 3 in:
Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus.
Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, Hengartner N, Giorgi EE, Bhattacharya T, Foley B, Hastie KM, Parker MD, Partridge DG, Evans CM, Freeman TM, de Silva TI*, McDanal C, Perez LG, Tang H, Moon-Walker A, Whelan SP, LaBranche CC, Saphire EO, and Montefiori DC.
*on behalf of the Sheffield COVID-19 Genomics Group
Cell, June 2020
The "correlated variant" feature can be used to enable tracking mutations that are part of a subclade.
The G clade is the dominant form of the SARS COV-2 pandemic as of summer of 2020. It carries with it 4 nucleotide changes reactive to the Wuhan form: C241T, C3037T, C14408T, A23403G
Note: GISAID formally refers to an ancestral state of the G clade with just 3 base changes, as their definition of the G clade: C241T, C3037T, A23403G.
The change mutation at C14408 was part of the set of 4 mutations that were expanded together and now the now dominant G clade. A23403G encodes the D614G mutation.
The GR clade carries the G clade four base changes, plus a 3 contiguous base changes G28881A, G28882A and G28883C. The GR clade includes the S D614G mutation and the N G204R mutation.
The GH clade carries the G clade four base changes, plus the G25563T mutation. The GH clade includes the S D614G mutation and the NS3 (ORF3a) Q57H mutation.
GISAID data provided on this website is subject to GISAID's Terms and Conditions