Alphabets and Sequences

Alphabetic sequences and associated tools and data.

Seq is a subclass of a python string with additional annotation and an alphabet. The characters in string must be contained in the alphabet. Various standard alphabets are provided.

Classes

Alphabet    -- A subset of non-null ascii characters
Seq         -- An alphabetic string
SeqList     -- A collection of Seq's

Alphabets

o generic_alphabet  -- A generic alphabet. Any printable ASCII character.
o protein_alphabet -- IUCAP/IUB Amino Acid one letter codes.
o nucleic_alphabet -- IUPAC/IUB Nucleic Acid codes 'ACGTURYSWKMBDHVN-'
o dna_alphabet -- Same as nucleic_alphabet, with 'U' (Uracil) an
    alternative for 'T' (Thymidine).
o rna_alphabet -- Same as nucleic_alphabet, with 'T' (Thymidine) an
    alternative for 'U' (Uracil).
o reduced_nucleic_alphabet -- All ambiguous codes in 'nucleic_alphabet' are
    alternative to 'N' (aNy)
o reduced_protein_alphabet -- All ambiguous ('BZJ') and non-canonical amino
    acids codes ( 'U', Selenocysteine and 'O', Pyrrolysine)  in
    'protein_alphabet' are alternative to 'X'.
o unambiguous_dna_alphabet -- 'ACGT'
o unambiguous_rna_alphabet -- 'ACGU'
o unambiguous_protein_alphabet -- The twenty canonical amino acid one letter
    codes, in alphabetic order, 'ACDEFGHIKLMNPQRSTVWY'

Amino Acid Codes:

Code  Alt.  Meaning
-----------------
A           Alanine
B           Aspartic acid or Asparagine
C           Cysteine
D           Aspartate
E           Glutamate
F           Phenylalanine
G           Glycine
H           Histidine
I           Isoleucine
J           Leucine or Isoleucine
K           Lysine
L           Leucine
M           Methionine
N           Asparagine
O           Pyrrolysine
P           Proline
Q           Glutamine
R           Arginine
S           Serine
T           Threonine
U           Selenocysteine
V           Valine
W           Tryptophan
Y           Tyrosine
Z           Glutamate or Glutamine
X    ?      any
*           translation stop
-    .~     gap

Nucleotide Codes:

Code  Alt.  Meaning
------------------------------
A           Adenosine
C           Cytidine
G           Guanine
T           Thymidine
U           Uracil
R           G A (puRine)
Y           T C (pYrimidine)
K           G T (Ketone)
M           A C (aMino group)
S           G C (Strong interaction)
W           A T (Weak interaction)
B           G T C (not A) (B comes after A)
D           G A T (not C) (D comes after C)
H           A C T (not G) (H comes after G)
V           G C A (not T, not U) (V comes after U)
N   X?      A G C T (aNy)
-   .~      A gap
Refs:
http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html
Authors:
GEC 2004,2005
class weblogo.seq.Alphabet

An ordered subset of printable ascii characters.

Status:
Beta
Authors:
  • GEC 2005
alphabetic(string)

True if all characters of the string are in this alphabet.

chr(n)

The n’th character in the alphabet (zero indexed) or 0

chrs(sequence_of_ints)

Convert a sequence of ordinals into an alphabetic string.

letters()

Letters of the alphabet as a string.

normalize(string)

Normalize an alphabetic string by converting all alternative symbols to the canonical equivalent in ‘letters’.

ord(c)

The ordinal position of the character c in this alphabet, or 255 if no such character.

ords(string)

Convert an alphabetic string into a byte array of ordinals.

static which(seqs, alphabets=None)

Returns the most appropriate unambiguous protein, RNA or DNA alphabet for a Seq or SeqList. If a list of alphabets is supplied, then the best alphabet is selected from that list.

The heuristic is to count the occurrences of letters for each alphabet and downweight longer alphabets by the log of the alphabet length. Ties go to the first alphabet in the list.

class weblogo.seq.Seq

An alphabetic string. A subclass of “str” consisting solely of letters from the same alphabet.

alphabet -- A string or Alphabet of allowed characters.
name -- A short string used to identify the sequence.
description -- A string describing the sequence
Authors :
GEC 2005
back_translate()

Translate a protein sequence back into coding DNA, using the standard genetic code. See weblogo.transform.GeneticCode for details and more options.

complement()

Returns complementary nucleic acid sequence.

join(str_list)

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

lower()

Return a lower case copy of the sequence.

mask(letters='abcdefghijklmnopqrstuvwxyz', mask='X')

Replace all occurrences of letters with the mask character. The default is to replace all lower case letters with ‘X’.

ords()

Convert sequence to an array of integers in the range [0, len(alphabet) )

remove(delchars)

Return a new alphabetic sequence with all characters in ‘delchars’ removed.

reverse()

Return the reversed sequence.

Note that this method returns a new object, in contrast to the in-place reverse() method of list objects.

reverse_complement()

Returns reversed complementary nucleic acid sequence (i.e. the other strand of a DNA sequence.)

tally(alphabet=None)

Counts the occurrences of alphabetic characters.

Arguments: - alphabet – an optional alternative alphabet

Returns :
A list of character counts in alphabetic order.
tostring()

Converts Seq to a raw string.

translate()

Translate a nucleotide sequence to a polypeptide using full IUPAC ambiguities in DNA/RNA and amino acid codes, using the standard genetic code. See weblogo.transform.GeneticCode for details and more options.

upper()

Return a lower case copy of the sequence.

word_count(k, alphabet=None)

Return a count of all subwords in the sequence.

>>> from weblogo.seq import *
>>> Seq("abcabc").word_count(3)
[('abc', 2), ('bca', 1), ('cab', 1)]
words(k, alphabet=None)

Return an iteration over all subwords of length k in the sequence. If an optional alphabet is provided, only words from that alphabet are returned.

>>> list(Seq("abcabc").words(3))
['abc', 'bca', 'cab', 'abc']
weblogo.seq.rna(string)

Create an alphabetic sequence representing a stretch of RNA.

weblogo.seq.dna(string)

Create an alphabetic sequence representing a stretch of DNA.

weblogo.seq.protein(string)

Create an alphabetic sequence representing a stretch of polypeptide.

class weblogo.seq.SeqList(alist=[], alphabet=None, name=None, description=None)

A list of sequences.

isaligned()

Are all sequences of the same length and alphabet?

ords(alphabet=None)

Convert sequence list into a 2D array of ordinals.

profile(alphabet=None)

Counts the occurrences of characters in each column.

Returns: Motif(counts, alphabet)

tally(alphabet=None)

Counts the occurrences of alphabetic characters.

Parameters:alphabet -- an optional alternative alphabet (-) –

Returns : A list of character counts in alphabetic order.