Fusion Rule Technology

Bioinformatics Case Study

The following case study is the result of three months's preliminary investigation. It develops two knowledgebases and two sets of logical fusion rules for merging the functional annotations of protein domains. The first knowledgebase and set of fusion rules merges information from different sources about the same protein domain, in order to obtain "cleaner" data about individual protein domains. The second knowledgebase and set of fusion rules groups different protein domains, on the basis of their functional annotations, into meaningful subsets exhibiting various degrees of functional similarity. Thus the merged output of the first set of fusion rules is intended to serve as the input to the second set.

Rules for Merging Information about Individual Protein Domains

Some protein domains will have functional annotations from different sources. If the functional annotations assigned by different sources conflict then we prefer the annotation assigned by one source, EC, to that assigned by the other, MSD, provided that both annotations belong to the Gene Ontology molecular_function hierarchy. If, however, not all of the annotations in the various sources of information are from that hierarchy then we want to find the most specific annotation assigned to the protein domain in each hierarchy. This yields the following set of rules (for an explanation of how to read a fusion rule click here):

Rulecode is 1 (Status = foundational) sameproteindomain(set1//proteindomain/name) IMPLIES Initialize(proteindomain(name,GO_ID_num_molecular_function,source)) AND AddText(1//proteindomain/name, proteindomain/name)

This first rule simply checks that all the input information concerns the same protein domain. If that conditions fails, no other rule is executed. If it holds, then the structure of the merged output is constructed and the name of the protein domain attached.

Rulecode is 2 (Status = optional) sameGO_IDnumbers(set1//proteindomain/GO_ID_num) IMPLIES AddText(1//proteindomain/GO_ID_num, proteindomain/GO_ID_num_molecular_function) AND AddText(Conjunction(set1//proteindomain/source), proteindomain/source)

The second rule checks that all the GO ID numbers are the same; if they are, then there is no conflict and that functional annotation is added to the output, together with a list of all of the sources of that information.

Rulecode is 3 (Status = optional) NOT sameGO_IDnumbers(set1//proteindomain/GO_ID_num) AND molecularfunctions(set1//proteindomain/GO_ID_num) AND usepreferredsource(set1//proteindomain/source, set1//proteindomain/GO_ID_num, X) AND preferredsource(set1//proteindomain/source, Y) IMPLIES AddText(X, proteindomain/GO_ID_num_molecular_function) AND AddText(Y, proteindomain/source)

The third rule deals with the case where the GO ID numbers assigned by different source are in conflict. The second conditions checks that all of the ID numbers are from the molecular function hierarchy, and if they are, then the ID number assigned by the preferred source, EC, is added to the merged output, together with the name of the source.

The Prolog knowledgebase, the fusion rules for merging different sources of information about the same protein domain, and two sample merged output reports can be found at the following links:

Prolog knowledgebase

Fusion rules in logical form

Fusion rules in XML

Merged data for 1a50B2 (Example where Rule 2 executes)

Merged data for 1a50B2 (Example where Rule 3 executes)

(For an explanation of how to read a fusion rule click here.)

The information in the merged report comes from a single CATH Superfamily (3.40.50.1900).

Rules for Merging Information from Different Protein Domains

The information to be merged concerns a single CATH Superfamily (3.30.559.10) and is given in the following table:

Table 1: CATH Superfamily: 3.30.559.10
Functional Annotation Protein Domains

GO:0004149 1c4tA0 1c4tB0 1c4tC0 1e2o00 - - - - -

GO:0004517 1nocB0 - - - - - - - -

GO:0008811 1cia00 1cla00 1qca00 2cla00 3cla00 4cla00 - - -

GO:0030523 1dpb00 1dpc00 1dpd00 1eaa00 1eab00 1eac00 1ead00 1eae00 1eaf00

Table 1: CATH Superfamily: 3.30.559.10
Functional Annotation	Protein Domains
GO:0004149	1c4tA0	1c4tB0	1c4tC0	1e2o00	-	-	-	-	-
GO:0004517	1nocB0	-	-	-	-	-	-	-	-
GO:0008811	1cia00	1cla00	1qca00	2cla00	3cla00	4cla00	-	-	-
GO:0030523	1dpb00	1dpc00	1dpd00	1eaa00	1eab00	1eac00	1ead00	1eae00	1eaf00

The main objective in applying fusion rules to a set of protein domains is to find subsets of those domains that exhibit varying degrees of functional similarity. This selection is done on the basis of the domains' functional annotations. These annotations belong to a hierarchical classification (a directed acyclic graph) in the Gene Ontology database. If two nodes in the graph are related as parent and child then this means they stand in either of two relationships:

Case 1: GO:0003674 molecular function GO:0016209 antioxidant In case 1, the parent is molecular function and the child is antioxidant, and the relationship between them is that antioxidant is a subclass of molecular function; in other words, antioxidant is a molecular function.

Case 2: GO:0003673 gene ontology GO:0003674 molecular function In case 2, however, the parent-child relationship is that that of part of; in other words, molecular function is part of the gene ontology.

On the basis of this hierarchy it is possible to select subsets of the protein domains in a given superfamily such that there is an annotation that is general enough to cover the function of each of the proteins in the subset of the original set, but that is no more general than necessary; that is, we can find a least upper bound for each subset. Such a least upper bound's being sufficiently specific would indicate that the subset of proteins in question might be usefully grouped together. The subgroups are chosen so that, for a given least upper bound, the greatest number of domains are included. (If, however, the least upper bound is too general, i.e. at the level of molecular_function and above, this would suggest that the proteins in question should not be grouped together, and the fusion rules are designed to prevent this.)

It would be desirable, however, if, in merging information about different protein domains, we were not restricted to sources that give functional annotations for the domains. We can in fact free ourselves of this restriction by making use of information about sequence similarity scores of proteins domains in the following way. Suppose we do not have a functional annotation for a protein domain, D1, but that we know that D1 has a sequence similarity score of greater than 60% to a protein domain, D2, whose GO ID number we do know. Then we can infer from this that these two domains share the same function. The significance of this fact is that if all of the set of protein domains in an input set that lack functional annotations have sequences scores of greater than 60% to other protein domains in the input set that do have functional annotations, then we can safely merge the whole set, knowing that the least upper bound obtained for those proteins that do have annotations will apply to the whole set.

A further objective of the rules is to provide a list of the key words and phrases (duplicates being removed) that occur in the functional annotations of all those terms in the GO hierarchy above and including the term which is the least upper bound. (Because much of the computation required to achieve this further objective is also required to determine the least upper bound, both objectives have been subsumed into a single rule to avoid unnecessary duplication.) The following rules are designed to fulfil these objectives (for an explanation of how to read a fusion rule click here):

Rulecode is 1 (Status = foundational) selectfunctionalgroups(set1//protein/function, X) AND selectproteingroups(set1//protein/name, set1//protein/function, X, Y) AND expandfunctionalgroupings(set1//protein/name, set1//protein/function, Y, U) AND getleastupperbounds(Y, U, Z) AND getkeywordgroups(Z, W) AND getnumberoffunctionalgroups(Y, V) IMPLIES Initialize(biofusionanalysis) AND RepeatAddNode(functionalgroup, V, biofusionanalysis) AND RepeatAddNode(selectedproteins, 1, biofusionanalysis/functionalgroup) AND RepeatAddText(Y, biofusionanalysis/functionalgroup/selectedproteins) AND RepeatAddNode(commonfunction, 1, biofusionanalysis/functionalgroup) AND RepeatAddText(Z, biofusionanalysis/functionalgroup/commonfunction) AND RepeatAddNode(keywords, 1, biofusionanalysis/functionalgroup) AND RepeatAddAtomicTrees(W, keyword, biofusionanalysis/functionalgroup/keywords)

The first rule groups the input set of protein domains into subgroups such that those subgroups contain the largest number of domains for a given least upper bound. If any protein domain lacks an annotation then, provided it has a sequence score of greater than 60% to another protein domain that does have an annotation, it is included in any subgroups that contain the second protein domain. The rule also finds all of the key words and phrases that occur in the annotations of the least upper bound, and in those entries in the GO hierarchy that are more general (i.e. above it). These are then added to the output.

Rulecode is 2 (Status = optional) cannotfindsemanticgeneralization(set1//protein/name, set1//protein/function, X) IMPLIES AddText(X, biofusionAnalysis/commonfunction)

If it turns out that one or more of the protein domains that lack annotations also lack a sequence score of greater than 60% to any other input protein domain that does have an annotation, then we cannot safely merge these domains. This is because the least upper bound obtained from the annotations that we do possess might not accurately reflect the true least upper bound obtained if we possessed annotations for all of the domains. In this case Rule 2 adds a message to this effect to the output, including the names of those domains which lack both an annotation and the required sequence score.

The Prolog knowledgebase and the fusion rules for merging the functional annotations can be found at the following links (The fusion rules are given in both XML (the mark-up language) and in logical form (these are easier to read)):

Prolog knowledgebase

Fusion rules in logical form

Fusion rules in XML

(For an explanation of how to read a fusion rule click here.)

Applying these rules to the protein domains in CATH Superfamily: 3.30.449.10. revealed three subgroups; the superfamily itself, and two smaller groupings. The output report can be found at the following link.

Output report

Table 2 summarizes the result of seeking functionally similar subsets of the protein domains in CATH Superfamily: 3.30.449.10.

Table 2: The three subgroups exhibiting various degrees of functional similarity found in CATH Superfamily: 3.30.559.10.
Protein Domains Least Upper Bound

1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00,
2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0,
1e2o00. acyltransferase activity (GO:0008415)

1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00,
2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0,
1e2o00, 1nocB0. catalytic activity (GO:0003824)

1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
1ead00, 1eae00, 1eaf00, 1c4tA0, 1c4tB0, 1c4tC0,
1e2o00. S-acyltransferase activity (GO:0016417)

Table 2: The three subgroups exhibiting various degrees of functional similarity found in CATH Superfamily: 3.30.559.10.
Protein Domains	Least Upper Bound
1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00, 1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00, 2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0, 1e2o00.	acyltransferase activity (GO:0008415)
1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00, 1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00, 2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0, 1e2o00, 1nocB0.	catalytic activity (GO:0003824)
1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00, 1ead00, 1eae00, 1eaf00, 1c4tA0, 1c4tB0, 1c4tC0, 1e2o00.	S-acyltransferase activity (GO:0016417)

Back to Fusion Rule Technology homepage.