search_systems API reference

class macsypy.search_systems.Cluster(systems_to_detect)[source]

Stores a set of contiguous hits. The Cluster object can have different states regarding its content in different genes’ systems:

  • ineligible: not a cluster to analyze
  • clear: a single system is represented in the cluster
  • ambiguous: several systems are represented in the cluster => might need a disambiguation
__init__(systems_to_detect)[source]
Parameters:systems_to_detect (a list of macsypy.system.System) – the list of systems to be detected in this run
__len__()[source]
Returns:the length of the Cluster, i.e., the number of hits stored in it
Return type:integer
__str__()[source]

print of the Cluster’s hits stored in terms of components, and corresponding sequence identifier and positions

__weakref__

list of weak references to the object (if defined)

add(hit)[source]

Add a Hit to a Cluster. Hits are always added at the end of the cluster (appended to the list of hits). Thus, ‘begin’ and ‘end’ positions of the Cluster are always the position of the 1st and of the last hit respectively.

Parameters:hit (a macsypy.report.Hit) – the Hit to add
Raise:a macsypy.macsypy_error.SystemDetectionError
compatible_systems
Returns:the list of the names of compatible systems represented by the cluster
Return type:string
putative_system
Returns:the name of the putative system represented by the cluster
Return type:string
save(force=False)[source]

Check the status of the cluster regarding systems which have hits in it. Update systems represented, and assign a putative system (self._putative_system), which is the system with most hits in the cluster. The systems represented are stored in a dictionary in the self.systems variable. The execution of this function can be forced, even if it has already run for the cluster with the option force=True.

state
Returns:the state of the Cluster of hits
Return type:string
class macsypy.search_systems.ClustersHandler[source]

Deals with sets of clusters found in a dataset. Conceived to store only clusters from a same replicon.

__init__()[source]
Parameters:cfg (macsypy.config.Config) – The configuration object built from default and user parameters.
__weakref__

list of weak references to the object (if defined)

circularize(rep_info, end_hits, systems_to_detect)[source]

This function takes into account the circularity of the replicon by merging clusters when appropriate (typically at replicon’s ends). It has to be called only if the replicon_topology is set to “circular”.

Parameters:

and that might be part of a system overlapping the two “ends” of the replicon :type end_hits: a list of macsypy.report.Hit :param systems_to_detect: the set of systems to detect in this run :type systems_to_detect: a list of :class:`macsypy.system.System

class macsypy.search_systems.SystemNameGenerator[source]

Creates and stores the names of detected systems. Ensures the uniqueness of the names.

__weakref__

list of weak references to the object (if defined)

_computeBasename(replicon, system)[source]

Computes the base name to be used for unique name generation

Parameters:
  • replicon (string) – the replicon name
  • system (string) – the system name
Returns:

the base name

Return type:

string

getSystemName(replicon, system)[source]

Generates a unique system name based on the replicon’s name and the system’s name.

Parameters:
  • replicon (string) – the replicon name
  • system (string) – the system name
Returns:

a unique system name

Return type:

string

class macsypy.search_systems.SystemOccurence(system)[source]

This class is instantiated for a specific system that has been asked for detection. It can be filled step by step with hits. A decision can then be made according to the parameters defined e.g. quorum of genes.

The SystemOccurence object has a “state” parameter, with the possible following values:
  • “empty” if the SystemOccurence has not yet been filled with genes of the decision rule of the system
  • “no_decision” if the filling process has started but the decision rule has not yet been applied to this occurence
  • “single_locus”
  • “multi_loci”
  • “uncomplete”
__init__(system)[source]
Parameters:system (macsypy.system.System) – the system to “fill” with hits.
__str__()[source]
Returns:Information of the component content of the SystemOccurence.
Return type:string
__weakref__

list of weak references to the object (if defined)

compute_missing_genes_list(gene_dict)[source]
Parameters:gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values
Returns:the list of genes with no occurence in the gene counter.
Return type:list
compute_system_length(rep_info)[source]

Returns the length of the system, all loci gathered, in terms of protein number (even those not matching any system gene)

Parameters:rep_info (a namedTuple “RepliconInfo” macsypy.database.RepliconInfo) – an entry extracted from the macsypy.database.RepliconDB
Return type:integer
count_genes(gene_dict)[source]

Counts the number of genes with at least one occurrence in a dictionary with a counter of genes.

Parameters:gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values
Return type:integer
count_genes_tot(gene_dict)[source]

Counts the number of matches in a dictionary with a counter of genes, independently of the nb of genes matched.

Parameters:gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values
Return type:integer
count_missing_genes(gene_dict)[source]

Counts the number of genes with no occurrence in the gene counter.

Parameters:gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values
Return type:integer
decision_rule()[source]
This function applies the decision rules for system assessment in terms of quorum:
  • the absence of forbidden genes is checked
  • the minimal number of mandatory genes is checked (“min_mandatory_genes_required”)
  • the minimal number of genes in the system is checked (“min_genes_required”)

When a decision is made, the status (self.status) of the macsypy.search_systems.SystemOccurence is set either to:

  • “single_locus” when a complete system in the form of a single cluster was found
  • “multi_loci” when a complete system in the form of several clusters was found
  • “uncomplete” when no system was assessed (quorum not reached)
  • “empty” when no gene for this system was found
  • “exclude” when no system was assessed (at least one forbidden gene was found)
Returns:a printable message of the output decision with this SystemOccurrence
Return type:string
fill_with_cluster(cluster)[source]

Adds hits from a cluster to a system occurence, and check which are their status according to the system definition. Set the system occurence state to “no_decision” after calling of this function.

Parameters:cluster (macsypy.search_systems.Cluster) – the set of contiguous genes to treat for macsypy.search_systems.SystemOccurence inclusion.
fill_with_hits(hits, include_forbidden)[source]

Adds hits to a system occurence, and check what are their status according to the system definition. Set the system occurence state to “no_decision” after calling of this function.

Note

Forbidden genes will only be included if they do belong to the current system (and not to another specified with “system_ref” in the current system’s definition).

Parameters:hits – a list of Hits to treat for macsypy.search_systems.SystemOccurence inclusion.
fill_with_multi_systems_genes(multi_systems_hits)[source]

This function fills the SystemOccurrence with genes putatively coming from other systems (feature “multi_system”). Those genes are used only if the occurrence of the corresponding gene was not yet filled with a gene from a cluster of the system.

Parameters:multi_systems_hits – a list of hits of genes that are “multi_system” which correspond to mandatory or accessory genes from the current system for which to fill a SystemOccurrence
get_gene_counter_output(forbid_exclude=False)[source]
Parameters:forbid_exclude (boolean) – exclude the forbidden components if set to True. False by default.
Returns:A dictionary ready for printing in system summary, with genes (mandatory, accessory and forbidden if specified) occurences in the system occurrence.
get_gene_ref(gene)[source]
Parameters:gene (macsypy.gene.Gene, or macsypy.gene.Homolog or macsypy.gene.Analog object) – the gene to get it’s gene reference
Returns:object macsypy.gene.Gene or None
Return type:macsypy.gene.Gene object or None
Raise:KeyError if the system does not contain any gene gene.
get_summary(replicon_name, rep_info)[source]

Gives a summary of the system occurrence in terms of gene content and localization.

Parameters:
Returns:

a tabulated summary of the macsypy.search_systems.SystemOccurence

Return type:

string

get_summary_header()[source]

Returns a string with the description of the summary returned by self.get_summary()

Return type:string
get_summary_unordered(replicon_name)[source]

Gives a summary of the system occurrence in terms of gene content only (specific of “unordered” datasets).

Parameters:replicon_name (string) – the name of the replicon
Returns:a tabulated summary of the macsypy.search_systems.SystemOccurence
Return type:string
get_system_name_unordered(suffix='_putative')[source]
Attributes a name to the system occurrence for an “unordered” dataset => generating a generic name based
on the system name and the suffix given.
Parameters:suffix (string) – the suffix to be used for generating the systemOccurrence’s name
Returns:a name for a system in an “unordered” dataset to the macsypy.search_systems.SystemOccurence
Return type:string
get_system_unique_name(replicon_name)[source]

Attributes unique name to the system occurrence with the class macsypy.search_systems.SystemNameGenerator. Generates the name if not already set.

Parameters:replicon_name (string) – the name of the replicon
Returns:the unique name of the macsypy.search_systems.SystemOccurence
Return type:string
is_complete()[source]

Test for SystemOccurrence completeness.

Returns:True if the state of the SystemOccurrence is “single_locus” or “multi_loci”, False otherwise.
Return type:boolean
nb_syst_genes

This value is set after a decision was made on the system in macsypy.search_systems.SystemOccurence:decision_rule()

Returns:the number of mandatory and accessory genes with at least one occurence

(number of different accessory genes) :rtype: integer

state
Returns:the state of the systemOccurrence.
Return type:string
macsypy.search_systems.analyze_clusters_replicon(clusters, systems, multi_systems_genes)[source]

Analyzes sets of contiguous hits (clusters) stored in a ClustersHandler for system detection:

  • split clusters if needed
  • delete them if they are not relevant
  • add eventual genes from other systems “multi_system” genes
  • check the QUORUM for each system to detect, i.e. mandatory + accessory - forbidden

Only for “ordered” datasets representing a whole replicon. Reports systems occurence.

Parameters:
Returns:

a set of systems occurence filled with hits found in clusters

Return type:

a list of macsypy.search_systems.SystemOccurence

macsypy.search_systems.build_clusters(hits, systems_to_detect, rep_info)[source]

Gets sets of contiguous hits according to the minimal inter_gene_max_space between two genes. Only for “ordered” datasets.

Parameters:
Returns:

a set of clusters and a dictionary with “multi_system” genes stored in a system-wise way for further utilization.

Return type:

macsypy.search_systems.ClustersHandler

macsypy.search_systems.disambiguate_cluster(cluster)[source]

This disambiguation step is used on clusters with hits for multiple systems (when cluster.state is set to “ambiguous”). It returns a “cleansed” list of clusters, ready to use for system occurence detection (and that are “clear” cases). It:

  • splits the cluster in two if it seems that two systems are nearby
  • removes single hits that are not forbidden for the “main” system and
that are at one end of the current cluster in this case, check that they are not “loners”, cause “loners” can be stored.
Parameters:cluster (macsypy.search_systems.Cluster) – the cluster to “disambiguate”
macsypy.search_systems.get_best_hits(hits, tosort=False, criterion='score')[source]

Returns from a putatively redundant list of hits, a list of best matching hits. Analyzes quorum and co-localization if required for system detection. By default, hits are already sorted by position, and the hit with the best score is kept, then the best i-evalue. Possible criteria are:

  • maximal score (criterion=”score”)
  • minimal i-evalue (criterion=”i_eval”)
  • maximal percentage of the profile covered by the alignment with the query sequence (criterion=”profile_coverage”)
Parameters:
  • tosort (boolean) – tells if the hits have to be sorted
  • criterion (string) – the criterion to base the sorting on
Returns:

the list of best matching hits

Return type:

list of macsypy.report.Hit

Raise:

a macsypy.macsypy_error.MacsypyError

macsypy.search_systems.get_compatible_systems(systems_list1, systems_list2)[source]

Returns the intersection of the two input systems lists.

Parameters:systems_list2 (systems_list1,) – two lists of systems
Returns:a list of systems, or an empty list if no common system
Return type:a list of macsypy.system.System
macsypy.search_systems.search_systems(hits, systems, cfg)[source]

Runs search of systems from a set of hits. Criteria for system assessment will depend on the kind of input dataset provided:

  • analyze quorum and co-localization for “ordered_replicon” and “gembase” datasets.
  • analyze quorum only (and in a limited way) for “unordered_replicon” and “unordered” datasets.
Parameters:
class macsypy.search_systems.systemDetectionReportOrdered(replicon_name, systems_occurrences_list, cfg)[source]
Stores the detected systems to report for each replicon:
  • by system name,
  • by state of the systems (single vs multi loci)
__init__(replicon_name, systems_occurrences_list, cfg)[source]
Parameters:
_match2json(valid_hit, so)[source]
Parameters:
  • valid_hit (class:macsypy.search_system.ValidHit object.) – the valid hit to transform in to json.
  • so (class:macsypy.search_system.SystemOccurence.) – the system occurence where the valid hit come from.
counter_output()[source]

Builds a counter of systems per replicon, with different “states” separated (single-locus vs multi-loci systems)

Returns:the counter of systems
Return type:Counter
json_output(json_path, json_data)[source]
report_output(reportfilename, print_header=False)[source]

Writes a report of sequences forming the detected systems, with information in their status in the system, their localization on replicons, and statistics on the Hits.

Parameters:
  • reportfilename (string) – the output file name
  • print_header (boolean) – True if the header has to be written. False otherwise
summary_output(reportfilename, rep_info, print_header=False)[source]

Writes a report with the summary of systems detected in replicons. For each system, a summary is done including:

  • the number of mandatory/accessory genes in the reference system (as defined in XML files)
  • the number of mandatory/accessory genes detected
  • the number and list of missing genes
  • the number of loci encoding the system
Parameters:
system_2_json(rep_db)[source]

Generates the report in json format

Parameters:
  • path (string) – the path to a file where to write the report in json format
  • rep_db (a class:macsypy.database.RepliconDB object) – the replicon database
tabulated_output(system_occurrence_states, system_names, reportfilename, print_header=False)[source]

Write a tabulated output with number of detected systems for each replicon.

Parameters:
  • system_occurrence_states (list of string) – the different forms of detected systems to consider
  • reportfilename (string) – the output file name
  • print_header (boolean) – True if the header has to be written. False otherwise
Return type:

string

tabulated_output_header(system_occurrence_states, system_names)[source]

Returns a string containing the header of the tabulated output

Parameters:system_occurrence_states (list of string) – the different forms of detected systems to consider
Return type:string
class macsypy.search_systems.systemDetectionReportUnordered(systems_occurrences_list, cfg)[source]
Stores a report for putative detected systems gathering all hits from a search in an unordered dataset:
  • by system.

Mandatory and accessory genes only are reported in the “json” and “report” output, but all hits matching a system component are reported in the “summary”.

__init__(systems_occurrences_list, cfg)[source]
Parameters:systems_occurrences_list (list of macsypy.search_systems.SystemOccurence) – the list of system’s occurrences to consider
json_output(json_path)[source]

Generates the report in json format

Parameters:path (string) – the path to a file where to write the report in json format
report_output(reportfilename, print_header=False)[source]

Writes a report of sequences forming the detected systems, with information in their status in the system, their localization on replicons, and statistics on the Hits.

Parameters:
  • reportfilename (string) – the output file name
  • print_header (boolean) – True if the header has to be written. False otherwise
summary_output(reportfilename, print_header=False)[source]

Writes a report with the summary for putative systems in an unordered dataset. For each system, a summary is done including:

  • the number of mandatory/accessory genes in the reference system (as defined in XML files)
  • the number of mandatory/accessory genes detected
Parameters:
  • reportfilename (string) – the output file name
  • print_header (boolean) – True if the header has to be written. False otherwise
class macsypy.search_systems.validSystemHit(hit, detected_system, gene_status)[source]

Encapsulates a macsypy.report.Hit This class stores a Hit that has been attributed to a detected system. Thus, it also stores:

  • the system,
  • the status of the gene in this system,

It also aims at storing information for results extraction:

  • system extraction (e.g. genomic positions)
  • sequence extraction
__init__(hit, detected_system, gene_status)[source]
Parameters:
  • hit (macsypy.report.Hit) – a hit to base the validSystemHit on
  • detected_system (string) – the name of the predicted System
  • gene_status (string) – the “role” of the gene in the predicted system
__weakref__

list of weak references to the object (if defined)

output_system_header()[source]
Returns:the header for the output file
Return type:string