database

The “database” object handles the indexes of the sequence dataset in fasta format, and other useful information on the input dataset.

MacSyFinder needs to have the length of each sequence and its position in the database to compute some statistics on Hmmer hits. Additionally, for ordered datasets ( db_type = ‘gembase’ or ‘ordered_replicon’ ), MacSyFinder builds an internal “database” from these indexes to store information about replicons, their begin and end positions, and their topology.

The begin and end positions of each replicon are computed from the sequence file, and the topology from the parsing of the topology file (–topology-file, see Topology files).

Thus it also builds an index (with .idx suffix) that is stored in the same directory as the sequence dataset. If this file is found in the same folder than the input dataset, MacSyFinder will use it. Otherwise, it will build it.

The user can force MacSyFinder to rebuild these indexes with the “–idx” option on the command-line.

database API reference

Indexes

class macsypy.database.Indexes(cfg)[source]

Handle the indexes for macsyfinder:

  • find the indexes required by macsyfinder to compute some scores, or build them.

__init__(cfg)[source]

The constructor retrieves the file of indexes in the case they are not present or the user asked for build indexes (–idx) Launch the indexes building.

Parameters:

cfg (macsypy.config.Config object) – the configuration

__iter__()[source]
Raises:

MacsypyError – if the indexes are not buid

Returns:

an iterator on the indexes

To use it the index must be build.

__weakref__

list of weak references to the object (if defined)

_build_my_indexes(index_dir)[source]

Build macsyfinder indexes. These indexes are stored in a file.

The file format is the following:
  • the first line is the path of the sequence-db indexed

  • one entry per line, with each line having this format:

  • sequence id;sequence length;sequence rank

_index_dir(build=False)[source]

search where to store(build=True) read indexes

Parameters:

build (bool) – if check the index-dir permissions to write

Returns:

The directory where read or write the indexes

Return type:

str

Raises:

ValueError – if the directory specify by –index-dir option does not exists or if build = True index-dir is not writable

build(force=False)[source]

Build the indexes from the sequence data set in fasta format,

Parameters:

force (boolean) – If True, force the index building even if the index files are present in the sequence data set folder

Returns:

the path to the index

Return type:

str

find_my_indexes()[source]
Returns:

the file of macsyfinder indexes if it exists in the dataset folder, None otherwise.

Return type:

string

RepliconInfo

Module to handle sequences and their indexes

class macsypy.database.RepliconInfo(topology, min, max, genes)

handle information about a replicon

topology

The type of replicon topology ‘linear or ‘circular’

min

The position of the last gene of the replicon in the sequence dataset.

max

The position of the last gene of the replicon in the sequence dataset.

genes

A list of genes beloging to the replicon. Each genes is representing by a tuple (str seq_id, int length)

genes

Alias for field number 3

max

Alias for field number 2

min

Alias for field number 1

topology

Alias for field number 0

RepliconDB

class macsypy.database.RepliconDB(cfg)[source]

Stores information (topology, min, max, [genes]) for all replicons in the sequence_db the Replicon object must be instantiated only for sequence_db of type ‘gembase’ or ‘ordered_replicon’

__contains__(replicon_name)[source]
Parameters:

replicon_name (string) – the name of the replicon

Returns:

True if replicon_name is in the repliconDB, false otherwise.

Return type:

boolean

__getitem__(replicon_name)[source]
Parameters:

replicon_name (string) – the name of the replicon to get information on

Returns:

the RepliconInfo for the provided replicon_name

Return type:

RepliconInfo object

Raise:

KeyError if replicon_name is not in repliconDB

__init__(cfg)[source]
Parameters:

cfg (macsypy.config.Config object) – The configuration object

Note

This class can be instanciated only if the db_type is ‘gembase’ or ‘ordered_replicon’

__weakref__

list of weak references to the object (if defined)

_fill_gembase_min_max(topology, default_topology)[source]

For each replicon_name of a gembase dataset, it fills the internal dictionary with a namedtuple RepliconInfo

Parameters:
  • topology (dict) – the topologies for each replicon (parsed from the file specified with the option –topology-file)

  • default_topology (string) – the topology provided by the config.replicon_topology

_fill_ordered_min_max(default_topology=None)[source]

For the replicon_name of the ordered_replicon sequence base, fill the internal dict with RepliconInfo

Parameters:

default_topology (string) – the topology provided by config.replicon_topology

_fill_topology()[source]

Fill the internal dictionary with min and max positions for each replicon_name of the sequence_db

get(replicon_name, default=None)[source]
Parameters:
  • replicon_name (string) – the name of the replicon to get informations

  • default (any) – the value to return if the replicon_name is not in the RepliconDB

Returns:

the RepliconInfo for replicon_name if replicon_name is in the repliconDB, else default. If default is not given, it is set to None, so that this method never raises a KeyError.

Return type:

RepliconInfo object

guess_if_really_gembase()[source]

Count the number of replicon with only on sequence if this number is above a threshold may be it’s not gembase. for instance the folowing sequence have id compliant with the gembase id syntax but it’s not it only contains one replicon (‘ordered replicon’)

>1E10S0A0cP00_0010 D GTG TGA 483 2027 Valid dnaA 1545 _PA0001_NP_064721.1_ PA0001 1 483 2027
MSVELWQQCVDLLRDELPSQQFNTWIRPLQVEAEGDELRVYAPNRFVLDW
>0200S001A0c_0P1E0 D ATG TAA 2056 3159 Valid dnaN 1104 _PA0002_NP_064722.1_ PA0002 1 2056 3159
MHFTIQREALLKPLQLVAGVVERRQTLPVLSNVLLVVEGQQLSLTGTDLE
>0000310E00S0c_1PA D ATG TGA 3169 4278 Valid recF 1110 _PA0003_NP_064723.1_ PA0003 1 3169 4278
MSLTRVSVTAVRNLHPVTLSPSPRINILYGDNGSGKTSVLEAIHLLGLAR
>c_01000A0PS00014E D ATG TGA 4275 6695 Valid gyrB 2421 _PA0004_NP_064724.1_ PA0004 1 4275 6695
MSENNTYDSSSIKVLKGLDAVRKRPGMYIGDTDDGTGLHHMVFEVVDNSI
>07700ES100A0cP01_ C ATG TGA 91521 94826 Valid icmF1 3306 _PA0077_NP_248767.1_ PA0077 1 91521 94826
MQSLAEVSAPDAASVAT
Returns:

False if most of replicon contains only one seaquence, True otherwise

Return type:

bool

items()[source]
Returns:

a copy of the RepliconDB as a list of (replicon_name, RepliconInfo) pairs

iteritems()[source]
Returns:

an iterator over the RepliconDB as a list (replicon_name, RepliconInfo) pairs

replicon_infos()[source]
Returns:

a copy of the RepliconDB as list of replicons info

Return type:

RepliconInfo instance

replicon_names()[source]
Returns:

a copy of the RepliconDB as a list of replicon_names

fasta_iter

macsypy.database.fasta_iter(fasta_file)[source]
Parameters:

fasta_file (file object) – the file containing all input sequences in fasta format.

Author:

http://biostar.stackexchange.com/users/36/brentp

Returns:

for a given fasta file, it returns an iterator which yields tuples (string id, string comment, int sequence length)

Return type:

iterator