Modeling the daily fraction of a SARS-CoV-2 variant as a function of time in local regions using isotonic regression
Here we extract all regional data from GISAID that have a minimum of 10 sequences representing a variant in the virus, with at least 14 days of sampling. The sampling days do not have to be contiguous. The tables show all political/geographical regions that meet these criteria, whether they are significant or not. The daily fraction of a variant as a function of time is modeled using isotonic regression; the null hypothesis that the fraction does not change over time. We then test the null against the hypotheses that the fraction of the new variant is either increasing or decreasing. We randomize the data in each geographic region 400 times, and refit the isotonic logistic regression to the randomized data, to evaluate changes in frequency of a new mutation could be occurring by chance alone, or is significantly increasing (as shown in the first 3 tables and sets of plots) or decreasing (as shown in the last 3 tables and sets of plots). Because we perform 400 randomizations the lowest p-value we can obtain is 0.0025. If over one time period a mutation is increasing, and another period of time it is decreasing, both can be significant. The "# days" column is the number of days with sample available, and the time window is the number of days spanned by the sampling.
The accompanying plots show the increase in the new variant over time. The dot size is proportional to the number of sequences sampled that day, and the staircase line is the maximum likelihood estimate under the constraint that the logarithm of the odds ratio is non-decreasing. The dotted line is the fraction of the variant over the considered time window. It provides a baseline for "no change" in the fraction of the variant.
This code is by Nick Hengartner and further descriptions of these analyses and plots can be found associated with Fig. 3 in:
Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus.
Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, Hengartner N, Giorgi EE, Bhattacharya T, Foley B, Hastie KM, Parker MD, Partridge DG, Evans CM, Freeman TM, de Silva TI*, McDanal C, Perez LG, Tang H, Moon-Walker A, Whelan SP, LaBranche CC, Saphire EO, and Montefiori DC.
*on behalf of the Sheffield COVID-19 Genomics Group
Cell, June 2020
Lineage definitionsThis tool lists CoV-2 lineages as defined by Pangolin (cov-lineages.org). The WHO Greek letter designations are in parentheses.
The "Correlated variant" feature can be used to enable tracking mutations that are part of a variant lineage.
As an example, one can use this tool to explore how often the E484K mutation is increasing or decreasing in the world at any geographic level based on all Spike backbones using just the top part of the tool, and with the default “Correlated variant” setting of “Do not consider”.
But one of the contexts in which the E484K mutation can be found in is in the B.1.1.7 variant Spike backbone; B.1.1.7 tends in increase in frequency once it has entered a population, and one can explore how this compares to E484K+B.1.1.7. This tool will identify all geographic locations in GISAID that have more than 10 examples of the E484K+B.1.1.7, and will determine if the fraction of E484+B.1.1.7 is increasing or decreasing relative to other forms of B.1.1.7 over time in those populations.