COVID-19 Viral Genome Analysis Pipeline COVID-19 Viral Genome Analysis Pipeline home COVID-19 Viral Genome Analysis Pipeline home
COVID-19 Viral Genome Analysis Pipeline
Enabled by data from   gisaid-logo

Isotonic Regression

Last data update: Sep 18, 2020

See analysis on
   Spike 614
   Spike 477   Spike 477, Excluding the original form with D614
   Spike 936   Spike 936, Excluding the original form with D614
Position Site of Interest
Region   Site
Assumption  New amino acid (mutant) form is increasing decreasing over time
Correlated variant help Include only sequences with   Exclude all sequences with
        Site   AA
Do not consider. Include all sequences
Geographical region

Modeling the daily fraction of a SARS-CoV-2 variant as a function of time in local regions using isotonic regression

Here we extract all regional data from GISAID that have a minimum of 10 sequences representing a variant in the virus, with at least 14 days of sampling. The sampling days do not have to be contiguous. The tables show all political/geographical regions that meet these criteria, whether they are significant or not. The daily fraction of a variant as a function of time is modeled using isotonic regression; the null hypothesis that the fraction does not change over time. We then test the null against the hypotheses that the fraction of the new variant is either increasing or decreasing. We randomize the data in each geographic region 400 times, and refit the isotonic logistic regression to the randomized data, to evaluate changes in frequency of a new mutation could be occurring by chance alone, or is significantly increasing (as shown in the first 3 tables and sets of plots) or decreasing (as shown in the last 3 tables and sets of plots). Because we perform 400 randomizations the lowest p-value we can obtain is 0.0025. If over one time period a mutation is increasing, and another period of time it is decreasing, both can be significant. The "# days" column is the number of days with sample available, and the time window is the number of days spanned by the sampling.

The accompanying plots show the increase in the new variant over time. The dot size is proportional to the number of sequences sampled that day, and the staircase line is the maximum likelihood estimate under the constraint that the logarithm of the odds ratio is non-decreasing.

This code is by Nick Hengartner and further descriptions of these analyses and plots can be found associated with Fig. 3 in:

Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus.
Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, Hengartner N, Giorgi EE, Bhattacharya T, Foley B, Hastie KM, Parker MD, Partridge DG, Evans CM, Freeman TM, de Silva TI*, McDanal C, Perez LG, Tang H, Moon-Walker A, Whelan SP, LaBranche CC, Saphire EO, and Montefiori DC.
*on behalf of the Sheffield COVID-19 Genomics Group
Cell, June 2020

Correlated variant

The "correlated variant" feature can be used to enable tracking mutations that are part of a subclade.

For example, the GR and GH clades are sub-lineages of the G clade (G clade carries 4 mutations and includes the D614G mutation), To track changes in GR or GH frequencies, using the subset of sequences that carries D614G will enable and exploration of how the GR and GH clades are changing within the context of the G614 clade.

The G clade is the dominant form of the SARS COV-2 pandemic as of summer of 2020. It carries with it 4 nucleotide changes reactive to the Wuhan form: C241T, C3037T, C14408T, A23403G
Note: GISAID formally refers to an ancestral state of the G clade with just 3 base changes, as their definition of the G clade: C241T, C3037T, A23403G.
The change mutation at C14408 was part of the set of 4 mutations that were expanded together and now the now dominant G clade. A23403G encodes the D614G mutation.

The GR clade carries the G clade four base changes, plus a 3 contiguous base changes G28881A, G28882A and G28883C. The GR clade includes the S D614G mutation and the N G204R mutation.

The GH clade carries the G clade four base changes, plus the G25563T mutation. The GH clade includes the S D614G mutation and the NS3 (ORF3a) Q57H mutation.

last modified: Thu Sep 17 08:33 2020

GISAID data provided on this website is subject to GISAID's Terms and Conditions

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health