MacSyFinder relies on the definition of models of macromolecular systems as a set of models’ components to be searched by similarity search, and a set of rules regarding their genomic organization and their requirement level to make a complete system (mandatory, accessory components, number of components required).
A MacSyFinder model (macsy-model for short) is the association of several elements:
The models are grouped by family possibly gathering sub-families (multiple levels allowed), for instance Secretion, Cas-proteins…
A set of models from a same family (coherent set) of systems to detect is called hereafter a macsy-model package
NEW in V2.
Structure of a macsy-model package¶
A macsy-model package follows the following structure
family_name |_______ metadata.yml |_______ LICENCE |_______ README.md |_______ definitions | |________ model_1.xml | |________ model_2.xml | : | |_______ profiles |________ geneA.hmm |________ geneB.hmm
If the package contains sub-families
family_name |_______ metadata.yml |_______ LICENCE |_______ README.md |_______ definitions | |________ subfamilyA | | |________ model_1.xml | | |________ model_2.xml | | | |________ subfamilyB | | |________ model_3.xml | | |________ model_4.xml | | | : | |_______ profiles |________ geneA.hmm |________ geneB.hmm
For examples of macsy-model packages, please visit https://github.com/macsy-models
Principles, and how to write macsy-models definitions¶
Macsy-models are written as XML files, and should be named with the name of the system to detect as a prefix, and the XML file extension as a suffix. For example, ‘T1SS.xml’ for T1SS (Type I Secretion System).
A macsy-model defines a macromolecular System as:
- A set of components (i.e. proteins, or protein-coding genes given the context) with different attributes that are used for system’s content description.
- Features regarding the genomic architecture of the systems’ components for system detection.
- Rules for quorum specifying how many components are required to infer the presence of a complete system.
Four distinct types of components can be used to model the System’s content. Components correspond to Gene objects in MacSyFinder’s implementation, and point to corresponding HMM protein profiles.
- mandatory components represent components that are essential to be found to infer the system’s presence.
- accessory components correspond to components that can be found in some systems’ occurrence (or quickly evolving components that are hard to detect with a single HMM profile and thus can be missed along similarity search).
- neutral components are used to build/extend clusters of proximal genes/components on the replicon analysed, but are not part of the quorum (i.e., not taken into account to assess the system’s presence).
NEW in V2
- forbidden components are components which presence is eliminatory for the system’s presence assessment.
Specifying a genomic organization¶
Beyond its list of Components, a MacSyFinder’s model of a System is defined by the genomic organization of its components. This genomic organization can be defined in several ways:
the general System’s architecture, whether it is single-locus or multi-loci (encoded at one or several loci)
the co-localization criteria defined either at the System level or at the Gene (component) level:
- the inter-gene-max-space parameter (system- or gene- wise)
- the loner parameter (gene- wise)
See below for more details on how to specify these parameters in a macsy-model.
The XML hierarchy¶
A System’s model is defined using a specific XML grammar that is hereby described. It consists in a hierarchic view of a Model that has specific features described through parameters, and is made of a set of Genes that have specific features themselves. All these elements and corresponding parameters will parametrize the search of Systems matching the search by MacSyFinder, in terms of Gene content and genomic architecture criteria.
The element root of a System’s model is “model”.
It has a mandatory attribute: “inter_gene_max_space”, an integer representing the maximal number of components without a match between two components with a match for a component profile in order to consider them contiguous (part of a same Cluster).
The version of the XML grammar (the actual version is “2.0”)
The element “model” may have attributes:
- min_mandatory_genes_required: an integer representing the minimal number of mandatory genes required to infer the system’s presence.
- min_genes_required: an integer representing the minimal number of mandatory or accessory genes (whose corresponding proteins match a profile of the model) required to infer the system’s presence.
- multi_loci: a boolean set to True (“1”, “true” or “True”) to allow the definition of “scattered” systems (i.e., systems encoded at different genomic loci or by different gene clusters). If not specified, default value is false.
The model contains one or more element(s) “gene” that correspond(s) to the genetic components of the macromolecular system.
The element “gene” has several mandatory attributes:
- name: a string representing the name of the component/gene which must match that of a profile enclosed in the profile directory of the macsy-model package (see below).
- presence: a string representing the status of the gene’s presence in the system. It can take four values among “mandatory”, “accessory”, “neutral”, “forbidden” (see above).
The element “gene” may have other attributes:
- loner: a boolean. A loner gene can be isolated on the genome and does not have to be part of a cluster of genes to be considered for system’s assessment ( default false ).
- multi_system: a boolean. If a gene has the feature “multi_system” (value set to “1”, “true” or “True”), it means that it can be used to fill multiple systems’ occurrences - and thus be considered part of several systems. ( default false ).
- inter_gene_max_space: an integer that defines gene-wise value of system’s “inter_gene_max_space” parameter (see above). It supersedes the system-wise parameter to give the gene a specific co-localization parameter.
The element “gene” may have one “exchangeables” child element:
- The element “exchangeables” can contain one or more elements “gene”.
For a Gene to have “exchangeables” Genes listed, means that this Gene can be replaced in the quorum by the listed child Genes.
If not specified by the user, several features will have their values assigned by default:
- the genomic architecture of the System being searched will consist in a single locus. If a System may be made of Genes from multiple loci, consider setting the multi_loci parameter to True.
- the quorum parameters min_mandatory_genes_required and min_genes_required will be set to the number of mandatory Genes listed - the accessory Genes being deemed not required to infer a complete System.
Example of a macsy-model definition in XML:
<model inter_gene_max_space="5" ver="2.0"> <gene name="gspD" presence="mandatory"> <exchangeables> <gene name="sctC"/> </exchangeables> </gene> <gene name="sctN_FLG" presence="mandatory" loner="1"/> <exchangeables> <gene name="gspE"/> <gene name="pilT"/> </exchangeables> <gene name="sctV_FLG" presence="mandatory"/> <gene name="flp" presence="accessory"/> </model>
In this example, the described System consists of three mandatory and one accessory components:
- Two components, the Gene “GspD” and the Gene “sctN_FLG” can respectively be replaced by sctC, and gspE and pilT genes in the quorum.
- To be considered as part of such System, the components should be co-localized in loci (Clusters of Genes), which in this case would amount to being located from each other at a distance of 5-Genes maximum, except for the Gene “sctN_FLG” that is allowed to be located “alone” in the genome being investigated, by a loner parameter being set to True. As the multi_loci parameter is not set, by default the System should be made of a single locus (Cluster of co-localized Genes - except for the ones listed as loners).
- To be considered a complete System, the quorum of Genes should be reached. In this case, the min_genes_required and min_mandatory_genes_required are not specified and therefore assigned to their default values: min_mandatory_genes_required is set to the number of mandatory Genes listed as well as the min_genes_required parameter (see above).
- a gene is identified by its name.
- this name is case sensitive.
- this name must be unique inside a family of models.
- a HMM profile with a gene-based name must exist in the profiles directory of the macsy-model package (see below).
Providing HMM profiles¶
For each gene mentioned in each model you have to provide a HMM profile to enable the similarity search of this gene. The HMM profile must have been created by the user from a curated multiple sequence alignment with the hmmbuild program from the HMMER package, or can have been obtained from HMM profiles’ databases such as TIGRFAM or PFAM .
This profile MUST have the same name as the name of the gene mentioned in the definition. For instance, a component named “GeneA” in the macsy-model would correspond by default to a HMM profile “GeneA.hmm” enclosed in the macsy-model package. The names are case-sensitive. All HMM profiles must be placed in the profiles directory of the macsy-model package.
For a detailed tutorial on how to define your macsy-model’s features, parameters and HMM profiles, you can have a look at our cookbook in this book chapter .
Installing and sharing models¶
How to install new models¶
MacSyFinder does not provide models. You must install models before using it.
macsydata utility tool is shipped with MacSyFinder to deal with macsy-models:
macsydata <subcommand> [options]
The main sub-commands are
macsydata availableto get the list of macsy-models available
macsydata searchto search a model given its name or a pattern in its description
macsydata installto install a macsy-model package (the installed version can be set see –help)
macsydata citeto retrieve information on how to cite the model
macsydata --helpto get the extended list of available subcommands
macsydata <subcommand> --helpto get help about the specified subcommand
NEW in V2
Where the models are located¶
MacSyFinder looks at several locations to find macsy-models.
By default macsydata installs models in a shared location (set by –install-data option) that is /usr/share/macsyfinder/ or /usr/local/share/macsyfinder depending on your Operating System distribution. If you use a virtualenv, the shared resources are located in the <virtualenv>/share/macsyfinder directory.
If you don’t own rights to install system-wide, you can install models in the MacSyFinder’s cache located in your home: $HOME/.macsyfinder/data/. macsydata installs packages in this location when you use the –user option. The packages installed in user land is added to the system-wide packages.
If two packages have the same name, the package in the user land supersedes the system-wide package.
Writing my own macsy-model package¶
The whole package structure is described above and requires five different types of files described below to be complete:
- a metadata file
- a README.md file
- a LICENCE file
- macsy-models definition(s)
- HMM profiles
This file contains some meta information about the package itself. It is in YAML format and must have the following structure:
--- maintainer: name: The name of the person who maintains/to contact for further information. (required) email: The email of the maintainer (required) short_desc: A one line description of the package (can e.g. be used for *macsydata* searches) (required) vers: The package version (required) cite: The publication(s) to cite by the user when the package is used (optional, used by `macsydata cite`) doc: Where to find extended documentation (optional) licence: The licence under the package is released (optional but highly recommended) copyright: The copyright of the package (optional)
--- maintainer: name: first name last name email: login@my_domain.com short_desc: Models for 15 types of secretion systems or bacterial appendages (T1SS, T2SS, T3SS, T4P, pT4SSt, pT4SSi, T5aSS, T5bSS, T5bSS, T6SSi, T6SSii, T6SSiii, Flagellum, Tad, T9SS). vers: 0.0a1 cite: - | Abby Sophie S., Cury Jean, Guglielmini Julien, Néron Bertrand, Touchon Marie, Rocha Eduardo P. C. (2016). Identification of protein secretion systems in bacterial genomes. In Scientific Reports, 6, pp. 23080. http://dx.doi.org/10.1038/srep23080 doc: https://github.com/macsy-models/TXSS licence: CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/) copyright: 2014-2020, Institut Pasteur, CNRS
This metadata.yml file is mandatory. Without this file your archive/repository will not be considered as a macsy-model package.
- - specify an item of yaml list
- | is used to specify a single item but over multiple lines.
A description of the package: what kind of systems the package models, how to use it etc… in markdown format.