In this project we examine automatic methods to find similarities of models in SBML and CellML formats. We aim to group similar models and extract similarity features from those groups. The considered models are stored in our graph database MaSyMoS.
Annotation-based feature extraction
Annotation-based feature extraction enables the comparison of model sets, as opposed to existing methods for model-to-keyword comparison, or model-to-model comparison. We suggest here three different methods to extract characteristic features from arbitrary sets of mdoels. All methods are adapted from Information Retrieval and focus on the semantic annotations in models. The features are extracted from three frequently used ontologies in the field, namely Gene Ontology, ChEBI and SBO.
The selected features vary depending on the underlying model set, and they are also specific to the chosen model set. We show that the identified features map on concepts that are higher up in the hierarchy of the ontologies than the concepts used for model annotations. Our analysis also reveals that the information content of concepts in ontologies and their usage for model annotation do not correlate.
read on in our 2015 paper in the Journal of Biomedical Semantics.
Structure-based feature extraction
With respect to the biological background, we use an approach to automatically find the most frequent patterns within the models’ reaction networks. The occurrences of such patterns can serve as a reasonable similarity measure for grouping the models that share many common structures. Detecting common patterns offers a variety of further use cases. For example, it is possible to determine if a model was created by a theoretical, data driven, or hybrid approach.
To find frequent occurring patterns within a set of biological models, we use frequent subgraph mining. Given a set of graphs, frequent subgraph mining (abbrv. FSM) is an approach to find subgraphs within these graphs that pass a given frequency threshold. We chose an algorithm called gSpan, which is an extension based algorithm that takes a graph set as its input and produces all frequent connected subgraphs according to the given frequency threshold. We apply the gSpan algorithm provided by the Java library “Parallel and Sequential Mining Suite” (abbrv. ParSeMiS).
Brought to you by