SymTyper’s Concepts¶
Definitions¶
HIT¶
This is a clade-relevant definition. To be a HIT against a clade reference sequence, a query needs to unambiguously align with a defined similarity over a defined percentage of its length. Furthermore, the e-value of the first hit needs to be at least K orders of magnitude larger than that of an alternative clade.
NOHIT¶
This is a clade-relevant definition. A Sequence is considered a NOHIT if it does not have any satisfactory alignments against a clade.
AMBIGUOUS¶
This is a clade-relevant definition. An ambiguous sequence is one that has more than one satisfctory clade hit.
Perfect¶
This is a subtype-relevant definition. Perfect refers to a query sequence that aligns unambiguously to one sequence in the reference database (e.g., 100% similarity to 100% of the length of the target) for which the best hit’s raw bit score is at least 3 orders of magnitude larger than the raw bit score for the second hit.
Unique¶
This is a subtype-relevant definition. Unique refers to a query sequence that aligns to a single reference in the database with a user-defined (e.g., \(>=\) user defined % similarity to 100% target length) for which the best hit’s raw bit score is at least 3 orders of magnitude larger than the raw bit score for the second hit.
New¶
This is a subtype-relevant definition. A New subtype applies to a sequnence with no significant hit to any of the subtype database sequences.
ShortNew¶
This is a subtype-relevant definition. ShortNew refers to a query sequence that aligns with high similarity to a unique reference sequence according to the dynamic similarity threshold (Equation 1: Dynamic Similarity) below.
Multiples¶
This is a subtype-relevant definition. A query sequence of type multiple is a sequence that aligns with equal similarity to multiple subtypes sequences.
Short¶
This is a subtype-relevant definition. A query of type short, is one that does not meet the minimum similarity and length requirements (e.g., \(<\) 90% similarity to \(<\) 90% of the length of the target).
Dynamic Similarity¶
The dynamic similarity threshold is computed to allow query sequences that are shorter than the database references to be considered as potential hits. However, the shorter the sequnces, the higher the required stringency. The dynamic similarity threshold is computed as:
\(required\_similarity = 100 - \frac{C - min_c}{1-min_c} * (100 - min_s)\)
where:
| C | is the coverage fraction of the query over the hit sequences |
| \(min_c\) | is the minimum accepted coverage fraction of the query and the hit sequences |
| \(min_s\) | is the minimum similarity threshold between the query and the hit sequences |
Ambiguous Hit Correction¶
An ambiguous hit occurs when a sequences aligns with multiple subtypes. To try to infer the correct subtype of the sequence, we employ a strategy similar to the wisdom of the crowd, and allow similar sequences to help contribute information about the closest subtype of the sequence. To do so, ambiguous sequences are clustered using high stringency and a subtype distribution (or spectrum) is computed for each cluster.
Suppose a cluster has a distribution: 88 C1.1, 45 C1.18, 6 C1.21 and 2 C1.28. This means that at least 88 sequences in the cluster were subtyped as C1.1. and only 1 was subtyped as C1.28.
Clusters’ distributions are usually highly skewed with few high frequency subtypes and a greater number of low frequency types. Since there distributions are subsequently used to infer the Lowest Common Ancestor (LCA) sequence as a proxy, it is very improtant to rid the data of unlikely subtype that can bias the computation of the LCA. For the previous distribution, the wisdom of the crowd tells us that this cluster of sequences is closest to C1.1. and unlikely to be C1.28 and therfore drops it for the C1.28. The same can be said about C1.21 since only 6 sequences have been aligned to it. The corrected distribution is thus likely 88 C1.1, 45 C1.18. This distribution will be subsequently used to map the reads to the common ancestor in the phylogeny.
The algoirthm used to correct the subtypes distribution uses a similar approach by formalizing which subtypes to drop for the distribution using a strigency parameter p. To do so, we iteratively drop the the subtypes that have counts within the \(p^{th}\) percentile of the distribution and stop when no subtypes can be dropped.
Resolved¶
An ambiguous read is said to be resolved if its filtered distribution after the Ambiguous Hit Correction contains a single subtype.
Lowest Common Ancestor¶
In a phylogenetic tree, an internal node, \(N\), is the lowest common ancestor (or most recent common ancestor) of a set of leaves \(L\), if \(N\) is the first common parent of all the leaves of in \(L\)
Placement Tree¶
A phylogeny of the subtypes in each clade where an internal node can be labeled using the number of seqeuencing reads for which is considered to be the most recent ancestor
TSV Format¶
A file with tab delimited columns
Samples File¶
A file cotaining the samples – one per line – in the dataset.
Input File Formats¶
Fasta Input Format¶
Sequence ids in the fasta file are required to have the following format.
Sample_ID::Seq_Number
- Sample_ID: refers to the sample to which the sequence belongs. The sampleID should be present in the Samples File
- Seq_Number: is a unique identifier for a the sequence.
Note that the two colons (::) are used to separate the Sample_ID and the Seq_Number.
Clade Output Format¶
HITS OUTPUT¶
- Query sequence id
- Hit start in query
- Hit end in query
- First hit id
- Second hit id
- First hit e-value
- Second hit e-value
NOHITS OUTPUT¶
- Query sequence id
AMBIGUOUS OUTPUT¶
- Query sequence id
- First hit id
- Second hit id
- First hit e-value
- Second hit e-value
LOWOUT¶
- Query sequence id
- First hit id
- Hit e-value
MULTIPLE OUTPUT¶
- Query sequence id
- List of hits ids
Subtype Output Formats¶
NEWOUT¶
- Query sequence id
PERFECT OUTPUT¶
- Query sequence id
- Best hit id
- Query length / Hit length
- Percent identity
SHORT OUTPUT¶
- Query sequence id
- Query length
- Best hit id
- Best hit lenght
SHORTNEW OUTPUT¶
- Query sequence id
- Best hit id
- Query length / Hit length
- Percent identity
UNIQUE OUTPUT¶
- Query sequence id
- Best hit id
ResolveMultipleHits Output Formats¶
Corrected Output All Clade¶
Tab separated fields and colon separated values. Ex.
Cluster: CL_415 numSeq: 6 clade: C breakDown:180:4 175M:2 subtypes: C3.24_HE579012: 6, C3k_AY589737: 6, C3.23_HE579011: 6
The previous line tell us that CL_145 representes 6 Sequences, 2 form sample 175M and 4 from sample 180. These sequences are in Clade C and have the subtype distribution listed in subtype list.
Resolved Output All Clades¶
- Cluster ID
- Number of sequences in the cluster
- Clade
- Subtype of sequences in the cluster
Corrected Output Per Clade¶
This file format is similar to that in Corrected Output All Clade except that the subtype list represents the corrected (or effective), rather than initial, subtypes.
Newick NHX Format¶
NHX is based on the New Hampshire (NH) standard (also called “Newick tree format”). Files in this format can be view using any application that supports it, such as the online treeview program (http://etetoolkit.org/treeview/).
For more details on the NHX format, see: http://www.genetics.wustl.edu/eddy/forester/NHX.html