Welcome to pymer’s documentation!¶
Contents
This package provides several classes and utilities for counting k-mers in DNA sequences.
Examples¶
Note
The API demonstrated below applies to all Counters, though Counter intialisation varies.
>>> ksize = 4
>>> kc = ExactKmerCounter(ksize)
DNA sequences are counted using the consume
method:
>>> kc.consume('ACGTACGTACGTAC')
>>> kc['ACGT']
3
Sequences can be subtracted using the unconsume
method:
>>> kc.unconsume('ACGTA')
>>> kc['ACGT']
2
>>> kc['CGTA']
2
>>> kc['GTAC']
3
Counters can be added and subtracted:
>>> kc += kc
>>> kc['GTAC']
6
>>> kc -= kc
>>> kc['GTAC']
0
Counters may be read and written to a file, using HDF5
.
>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> tmpdir = mkdtemp()
>>> filename = tmpdir + '/kc.h5'
(Above we simply create a temporary directory to hold the saved counts.)
>>> kc.write(filename)
>>> new_kc = ExactKmerCounter.read(filename, ksize)
>>> (kc.array == new_kc.array).all()
True
>>> rmtree(tmpdir)
Data Structures¶
Summary¶
ExactKmerCounter (k[, alphabet, array]) |
Count k-mers in DNA sequences exactly using an array. |
Exact K-mer Counting¶
-
class
pymer.
ExactKmerCounter
(k, alphabet='ACGT', array=None)¶ Count k-mers in DNA sequences exactly using an array.
Parameters: k : int
K-mer length
alphabet : list-like (str, bytes, list, set, tuple) of letters
Alphabet over which values are defined, defaults to “ACGT”
Methods
consume
(seq)Counts all k-mers in sequence. consume_file
(filename)Counts all kmers in all sequences in a FASTA/FASTQ file. print_table
([sparse, file, sep])read
(filename, kmersize)readall
(filename)to_dict
([sparse])unconsume
(seq)Subtracts all k-mers in sequence. write
(filename)
Markovian K-mer Counting¶
-
class
pymer.
TransitionKmerCounter
(k, alphabet='ACGT', array=None)¶ Counts markovian state transitions in DNA sequences.
This class counts transtions between (k-1)-mers (or stems) and their following bases. This represents the k-1’th order markov process that (may have) generated the underlying DNA sequences.
A normalised, condensed transtion matrix of shape (4^(k-1), 4) or sparse complete transtion matrix (shape (4^(k-1), 4^(k-1)) can be returned. In addition, the steady-state vector is calculated from the complete transition matrix via eigendecomposition.
Parameters: k : int
K-mer length
alphabet : str
Alphabet over which values are defined, defaults to “ACGT”
Attributes
P
Sparse [k-1]x[k-1] transition frequency matrix steady_state
Steady-state frequencies of each [k-1]-mer stem_frequencies
Frequencies of each stem ([k-1]-mer) transitions
Dense [k-1]x4 transition frequency matrix Methods
consume
(seq)Counts all k-mers in sequence. consume_file
(filename)Counts all kmers in all sequences in a FASTA/FASTQ file. read
(filename, kmersize)readall
(filename)unconsume
(seq)Subtracts all k-mers in sequence. write
(filename)