XSPIKE: eXplore the SPIKE protein sequence in the SARS CoV-2 virus
As with SHIVER and Ember, it strips out columns from that alignment in
which a "-" appears in the reference (Wuhan) sequence. This means the
n'th character of a string representing a sequence will correspond to
site number n in the sequence. It also means that if there is some
new variant that involves insertions, it will be lost in the analysis.
The analysis has three parts:
1. Highest entropy sites
It identifies the top 50 sites that have the highest entropy.
A table will be output with two columns: site number, and entropy.
The primary analysis is based on the amino acids in these high-entropy
sites, which will filter out the rare variants to enable a focus on
the most common variant forms of a lineage.
If the "Make plot" option is specified, then a plot of entropy vs site
number is also created. The plot is in blue, with red vertical lines
corresponding to the sites that were chosen as top-entropy sites.
2. Pairwise correlations (optional)
It computes pairwise correlations among those sites.
This second step is optional, and (especially if the number of sites
is large) can be time-consuming. To invoke this, use the "Analyze pairwise correlations"
option. It looks at every pair of high-entropy sites, and computes
two different measures of correlation: mutual information, and
(Note that the Cramer's V statistic is sometimes over-sensitive to small
counts; there is a default threshold=3 to avoid this problem.)
xspike writes out the site pairs that correspond to highest correlation,
according to those two measures.
If "Make plot" is specified, then circle plots and heat maps are made to
show these correlations
3. Variants and continent-wise counts
It creates a digest of variants and a continent-wise count
for each of those variants.
A "variant" in the context of xspike is a pattern of M characters
where M is the number of high-entropy sites obtained in the first
step. The most commonly occurring patterns are listed first.
Note that the first few lines in the header, before the patterns are
listed, will be digits, such as:
These digits are to specify the site number for each column, and the
numbers are read vertically downward; so in the above example, the
first column corresponds to site 69, and the last column to site 716.
The table of counts that accompanies these patterns, shows how many
times each pattern has appeared in the dataset, and these counts are
furthermore broken out by continent.
Note that if xspike is run with a geographical region for example, with
"USA.California" then the "Local" column contains the totals for
California. In this case only California sequences will be used to
define high-entropy sites, and the order of appearance for the
patterns will be based on the counts in the Local column. However,
regardless of specified geographic regions,
the totals for the continents will be the full totals based
on the full dataset.
The context column identifies the most common /full/ spike sequence
associated with the given pattern, and then expresses that sequence as
a mutation string that identifies the sites at which that full
sequence differs from the full reference sequence; this includes sites
that may not be among the high-entropy sites used to define the xspike
patterns and reconstructs the most common form of the natural sequence to carry
the common pattern, which may be useful for reagent design.
Global UK Eu-UK NAmer Asia Africa SAmer Ocean Local Exact Pct [Context]
The number of times the exact form of the /full/ Spike is observed the local
geographic region selected for the run is noted in the "Exact" column, and the
percentage of sequences that carry the pattern of interest
in the exact most common /full/ Spike is noted in the "Pct" column.
The output of Step 3 can be quite large. A variant is included
as long as it appears at least two times in the dataset.
last modified: Wed Sep 1 06:33 2021