Database API

The “database” object handles the indexes of the sequence dataset in fasta format, and other useful information on the input dataset.

MacSyFinder needs several indexes to run, and speed up the analyses.

  • index for hmmsearch (Hmmer program)
  • index for MacSyFinder

hmmsearch needs to index the sequences to speed up the analyses. The indexes are built by the external tools formatdb (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ncbi.tar.gz) or makeblastdb. MacSyFinder tries to find formatdb indexes in the same directory as the sequence file. If the indexes are present MacSyFinder uses these index, otherwise it builds these indexes using formatdb or makeblastdb.

MacSyFinder needs also to have the length of each sequence and its position in the database to compute some statistics on Hmmer hits. Thus it also builds an index (with .idx suffix) that is stored in the same directory as the sequence dataset. If this file is found in the same folder than the input dataset, MacSyFinder will use it. Otherwise, it will build it.

The user can force MacSyFinder to rebuild these indexes with the “–idx” option on the command-line.

Additionally, for ordered datasets ( db_type = ‘gembase’ or ‘ordered_replicon’ ), MacSyFinder builds an internal “database” from these indexes to store information about replicons, their begin and end positions, and their topology. The begin and end positions of each replicon are computed from the sequence file, and the topology from the parsing of the topology file (–topology-file, see Topology files).

database API reference

class macsypy.database.Indexes(cfg)[source]

Handle the indexes for macsyfinder:

  • find the indexes for hmmer, or build them using formatdb or makeblastdb external tools
  • find the indexes required by macsyfinder to compute some scores, or build them.
__init__(cfg)[source]

The constructor retrieves the file of indexes in the case they are not present or the user asked for build indexes (–idx) Launch the indexes building.

Parameters:cfg (macsypy.config.Config object) – the configuration
__weakref__

list of weak references to the object (if defined)

_build_hmmer_indexes()[source]

build the index files for hmmer using the formatdb or makeblastdb tool

_build_my_indexes()[source]

Build macsyfinder indexes. These indexes are stored in a file.

The file format is the following:
  • one entry per line, with each line having this format:
  • sequence id;sequence length;sequence rank
build(force=False)[source]

Build the indexes from the sequence dataset in fasta format

Parameters:force (boolean) – If True, force the index building even if the index files are present in the sequence dataset folder
find_hmmer_indexes()[source]
Returns:The hmmer index files. If indexes are inconsistent (some file(s) missing), a Runtime Error is raised
Return type:list of string
find_my_indexes()[source]
Returns:the file of macsyfinder indexes if it exists in the dataset folder, None otherwise.
Return type:string
class macsypy.database.RepliconDB(cfg)[source]

Stores information (topology, min, max, [genes]) for all replicons in the sequence_db the Replicon object must be instantiated only for sequence_db of type ‘gembase’ or ‘ordered_replicon’

__contains__(replicon_name)[source]
Parameters:replicon_name (string) – the name of the replicon
Returns:True if replicon_name is in the repliconDB, false otherwise.
Return type:boolean
__getitem__(replicon_name)[source]
Parameters:replicon_name (string) – the name of the replicon to get information on
Returns:the RepliconInfo for the provided replicon_name
Return type:RepliconInfo object
Raise:KeyError if replicon_name is not in repliconDB
__init__(cfg)[source]
Parameters:cfg (macsypy.config.Config object) – The configuration object

Note

This class can be instanciated only if the db_type is ‘gembase’ or ‘ordered_replicon’

__weakref__

list of weak references to the object (if defined)

_fill_gembase_min_max(topology, default_topology)[source]

For each replicon_name of a gembase dataset, it fills the internal dictionary with a namedtuple RepliconInfo

Parameters:
  • topology (dict) – the topologies for each replicon (parsed from the file specified with the option –topology-file)
  • default_topology (string) – the topology provided by the config.replicon_topology
_fill_ordered_min_max(default_topology=None)[source]

For the replicon_name of the ordered_replicon sequence base, fill the internal dict with RepliconInfo

Parameters:default_topology (string) – the topology provided by config.replicon_topology
_fill_topology()[source]

Fill the internal dictionary with min and max positions for each replicon_name of the sequence_db

get(replicon_name, default=None)[source]
Parameters:
  • replicon_name (string) – the name of the replicon to get informations
  • default (any) – the value to return if the replicon_name is not in the RepliconDB
Returns:

the RepliconInfo for replicon_name if replicon_name is in the repliconDB, else default.

If default is not given, it is set to None, so that this method never raises a KeyError. :rtype: RepliconInfo object

items()[source]
Returns:a copy of the RepliconDB as a list of (replicon_name, RepliconInfo) pairs
iteritems()[source]
Returns:an iterator over the RepliconDB as a list (replicon_name, RepliconInfo) pairs
replicon_infos()[source]
Returns:a copy of the RepliconDB as list of replicons info
Return type:RepliconInfo instance
replicon_names()[source]
Returns:a copy of the RepliconDB as a list of replicon_names
class macsypy.database.RepliconInfo(topology, min, max, genes)
__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

static __new__(_cls, topology, min, max, genes)

Create new instance of RepliconInfo(topology, min, max, genes)

__repr__()

Return a nicely formatted representation string

_asdict()

Return a new OrderedDict which maps field names to their values

classmethod _make(iterable, new=<built-in method __new__ of type object at 0x906d60>, len=<built-in function len>)

Make a new RepliconInfo object from a sequence or iterable

_replace(_self, **kwds)

Return a new RepliconInfo object replacing specified fields with new values

genes

Alias for field number 3

max

Alias for field number 2

min

Alias for field number 1

topology

Alias for field number 0

macsypy.database.fasta_iter(fasta_file)[source]
Parameters:fasta_file (file object) – the file containing all input sequences in fasta format.
Author:http://biostar.stackexchange.com/users/36/brentp
Returns:for a given fasta file, it returns an iterator which yields tuples (string id, string comment, int sequence length)
Return type:iterator