Welcome to pymer’s documentation!

This package provides several classes and utilities for counting k-mers in DNA sequences.

Examples

Note

The API demonstrated below applies to all Counters, though Counter intialisation varies.

>>> ksize = 4
>>> kc = ExactKmerCounter(ksize)

DNA sequences are counted using the consume method:

>>> kc.consume('ACGTACGTACGTAC')
>>> kc['ACGT']
3

Sequences can be subtracted using the unconsume method:

>>> kc.unconsume('ACGTA')
>>> kc['ACGT']
2
>>> kc['CGTA']
2
>>> kc['GTAC']
3

Counters can be added and subtracted:

>>> kc += kc
>>> kc['GTAC']
6
>>> kc -= kc
>>> kc['GTAC']
0

Counters may be read and written to a file, using bcolz.

>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> tmpdir = mkdtemp()
>>> filename = tmpdir + '/kc.bcz'

(Above we simply create a temporary directory to hold the saved counts.)

>>> kc.write(filename)
>>> new_kc = ExactKmerCounter.read(filename, ksize)
>>> (kc.array == new_kc.array).all()
True
>>> rmtree(tmpdir)

Data Structures

Summary

ExactKmerCounter(k[, alphabet, array]) Count k-mers in DNA sequences exactly using an array.

Exact K-mer Counting

class pymer.ExactKmerCounter(k, alphabet='ACGT', array=None)

Count k-mers in DNA sequences exactly using an array.

Parameters:

k : int

K-mer length

alphabet : list-like (str, bytes, list, set, tuple) of letters

Alphabet over which values are defined, defaults to “ACGT”

Methods

consume(seq) Counts all k-mers in sequence.
consume_file(filename) Counts all kmers in all sequences in a FASTA/FASTQ file.
print_table([sparse, file, sep])
read(filename, kmersize)
readall(filename)
to_dict([sparse])
unconsume(seq) Subtracts all k-mers in sequence.
write(filename)

Markovian K-mer Counting

class pymer.TransitionKmerCounter(k, alphabet='ACGT', array=None)

Counts markovian state transitions in DNA sequences.

This class counts transtions between (k-1)-mers (or stems) and their following bases. This represents the k-1’th order markov process that (may have) generated the underlying DNA sequences.

A normalised, condensed transtion matrix of shape (4^(k-1), 4) or sparse complete transtion matrix (shape (4^(k-1), 4^(k-1)) can be returned. In addition, the steady-state vector is calculated from the complete transition matrix via eigendecomposition.

Parameters:

k : int

K-mer length

alphabet : str

Alphabet over which values are defined, defaults to “ACGT”

Attributes

P
steady_state
stem_frequencies Compute the frequencies of each stem, i.e.
transitions

Methods

consume(seq) Counts all k-mers in sequence.
consume_file(filename) Counts all kmers in all sequences in a FASTA/FASTQ file.
read(filename, kmersize)
readall(filename)
unconsume(seq) Subtracts all k-mers in sequence.
write(filename)