COVID-19 Viral Genome Analysis Pipeline COVID-19 Viral Genome Analysis Pipeline home COVID-19 Viral Genome Analysis Pipeline home
COVID-19 Viral Genome Analysis Pipeline
Enabled by data from   gisaid-logo


XSPIKE Explanation

XSPIKE: eXplore the SPIKE protein sequence in the SARS CoV-2 virus

As with SHIVER and Ember, it strips out columns from that alignment in which a "-" appears in the reference (Wuhan) sequence. This means the n'th character of a string representing a sequence will correspond to site number n in the sequence. It also means that if there is some new variant that involves insertions, it will be lost in the analysis.

The analysis has three parts:

1. Highest entropy sites

It identifies the top 50 sites that have the highest entropy. A table will be output with two columns: site number, and entropy.

The primary analysis is based on the amino acids in these high-entropy sites, which will filter out the rare variants to enable a focus on the most common variant forms of a lineage.

If the "Make plot" option is specified, then a plot of entropy vs site number is also created. The plot is in blue, with red vertical lines corresponding to the sites that were chosen as top-entropy sites.

2. Pairwise correlations (optional)

It computes pairwise correlations among those sites. This second step is optional, and (especially if the number of sites is large) can be time-consuming. To invoke this, use the "Analyze pairwise correlations" option. It looks at every pair of high-entropy sites, and computes two different measures of correlation: mutual information, and Cramer's V.

(Note that the Cramer's V statistic is sometimes over-sensitive to small counts; there is a default threshold=3 to avoid this problem.)

xspike writes out the site pairs that correspond to highest correlation, according to those two measures.

If "Make plot" is specified, then circle plots and heat maps are made to show these correlations

3. Variants and continent-wise counts

It creates a digest of variants and a continent-wise count for each of those variants. A "variant" in the context of xspike is a pattern of M characters where M is the number of high-entropy sites obtained in the first step. The most commonly occurring patterns are listed first.

Note that the first few lines in the header, before the patterns are listed, will be digits, such as:
   12456667
67841801181
90045414516
HVDYDENDVPT
These digits are to specify the site number for each column, and the numbers are read vertically downward; so in the above example, the first column corresponds to site 69, and the last column to site 716.

The table of counts that accompanies these patterns, shows how many times each pattern has appeared in the dataset, and these counts are furthermore broken out by continent.

Note that if xspike is run with a geographical region for example, with "USA.California" then the "Local" column contains the totals for California. In this case only California sequences will be used to define high-entropy sites, and the order of appearance for the patterns will be based on the counts in the Local column. However, regardless of specified geographic regions, the totals for the continents will be the full totals based on the full dataset.

The context column identifies the most common /full/ spike sequence associated with the given pattern, and then expresses that sequence as a mutation string that identifies the sites at which that full sequence differs from the full reference sequence; this includes sites that may not be among the high-entropy sites used to define the xspike patterns and reconstructs the most common form of the natural sequence to carry the common pattern, which may be useful for reagent design.
Global     UK  Eu-UK  NAmer   Asia Africa  SAmer  Ocean  Local  Exact  Pct [Context]
The number of times the exact form of the /full/ Spike is observed the local geographic region selected for the run is noted in the "Exact" column, and the percentage of sequences that carry the pattern of interest in the exact most common /full/ Spike is noted in the "Pct" column.

The output of Step 3 can be quite large. A variant is included as long as it appears at least two times in the dataset.


NOTE: We are NOT tracking insertions in Spike sequences in this output; insertions are still very rare, but are found on occasion.

In particular, we have found them associated with a few rare Pango lineages including:
B.1.621 T95I, insert144T, Y144S, Y145N, R346K, E484K, N501Y, D614G, P681H, D950N
A.2.5.2 del141-143, insert215AGG, D215Y, L452R, D614G
AT.1 P9L, del136-144, D215G, H245P, E484K, D614G, N679K, insert679GIAL, E780K
B.1.214.2 insert214TDR, Q414K, N450K, D614G, T716I

Where "insert" indicates an insertion at the given position followed by the list amino acids added, and "del" indicates a deletion.

last modified: Wed Jul 14 06:32 2021



GISAID data provided on this website is subject to GISAID's Terms and Conditions

Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health