Fusion Rule Technology




Bioinformatics Case Study

The following case study is the result of three months's preliminary investigation. It develops two knowledgebases and two sets of logical fusion rules for merging the functional annotations of protein domains. The first knowledgebase and set of fusion rules merges information from different sources about the same protein domain, in order to obtain "cleaner" data about individual protein domains.  The second knowledgebase and set of fusion rules groups different protein domains, on the basis of their functional annotations, into meaningful subsets exhibiting various degrees of functional similarity. Thus the merged output of  the first set of fusion rules is intended to serve as the input to the second set.

Rules for Merging Information about Individual Protein Domains

Some protein domains will have functional annotations from different sources. If the functional annotations assigned by different sources conflict then we prefer the annotation assigned by one source, EC, to that assigned by the other, MSD, provided that both annotations belong to the Gene Ontology molecular_function hierarchy. If, however, not all of the annotations in the various sources of information are from that hierarchy then we want to find the most specific annotation assigned to the protein domain in each hierarchy. This yields the following set of rules (for an explanation of how to read a fusion rule click here):

Rulecode is 1 (Status = foundational)

sameproteindomain(set1//proteindomain/name)

IMPLIES Initialize(proteindomain(name,GO_ID_num_molecular_function,source)) AND AddText(1//proteindomain/name, proteindomain/name)

This first rule simply checks that all the input information concerns the same protein domain. If that conditions fails, no other rule is executed. If it holds, then the structure of the merged output is constructed and the name of the protein domain attached.

Rulecode is 2 (Status = optional)

sameGO_IDnumbers(set1//proteindomain/GO_ID_num)

IMPLIES AddText(1//proteindomain/GO_ID_num, proteindomain/GO_ID_num_molecular_function) AND AddText(Conjunction(set1//proteindomain/source), proteindomain/source)

The second rule checks that all the GO ID numbers are the same; if they are, then there is no conflict and that functional annotation is added to the output, together with a list of all of the sources of that information.

Rulecode is 3 (Status = optional)

NOT sameGO_IDnumbers(set1//proteindomain/GO_ID_num)

AND molecularfunctions(set1//proteindomain/GO_ID_num) AND usepreferredsource(set1//proteindomain/source, set1//proteindomain/GO_ID_num, X) AND preferredsource(set1//proteindomain/source, Y) IMPLIES AddText(X, proteindomain/GO_ID_num_molecular_function) AND AddText(Y, proteindomain/source)

The third rule deals with the case where the GO ID numbers assigned by different source are in conflict. The second conditions checks that all of the ID numbers are from the molecular function hierarchy, and if they are, then the ID number assigned by the preferred source, EC, is added to the merged output, together with the name of the source.

The Prolog knowledgebase, the fusion rules for merging different sources of information about the same protein domain, and two sample  merged output reports can be found at the following links:

  • Prolog knowledgebase
  • Fusion rules in logical form
  • Fusion rules in XML
  • Merged data for 1a50B2 (Example where Rule 2 executes)
  • Merged data for 1a50B2 (Example where Rule 3 executes)
  • (For an explanation of how to read a fusion rule click here.)

    The information in the merged report comes from a single CATH Superfamily (3.40.50.1900). 

    Rules for Merging Information from Different Protein Domains

    The information to be merged concerns a single CATH Superfamily (3.30.559.10) and is given in the following table:

    Table 1: CATH Superfamily: 3.30.559.10
    Functional AnnotationProtein Domains
    GO:00041491c4tA01c4tB01c4tC01e2o00 - - - - -
    GO:00045171nocB0 - - - - - - - -
    GO:00088111cia001cla001qca002cla003cla004cla00 - - -
    GO:00305231dpb001dpc001dpd001eaa001eab001eac001ead001eae001eaf00

    The main objective in applying fusion rules to a set of protein domains is to find subsets of those domains that exhibit varying degrees of functional similarity. This selection is done on the basis of the domains' functional annotations. These annotations belong to a hierarchical classification (a directed acyclic graph) in the Gene Ontology database. If two nodes in the graph are related as parent and child then this means they stand in either of two relationships:

    Case 1:
    GO:0003674 molecular function GO:0016209 antioxidant
    In case 1, the parent is molecular function and the child is antioxidant, and the relationship between them is that antioxidant is a subclass of molecular function; in other words, antioxidant is a molecular function.

    Case 2:
    GO:0003673 gene ontology GO:0003674 molecular function
    In case 2, however, the parent-child relationship is that that of part of; in other words, molecular function is part of the gene ontology.

    On the basis of this hierarchy it is possible to select subsets of the protein domains in a given superfamily such that there is an annotation that is general enough to cover the function of each of the proteins in the subset of the original set, but that is no more general than necessary; that is, we can find a least upper bound for each subset. Such a least upper bound's being sufficiently specific would indicate that the subset of proteins in question might be usefully grouped together. The subgroups are chosen so that, for a given least upper bound, the greatest number of domains are included. (If, however, the least upper bound is too general, i.e. at the level of molecular_function and above, this would suggest that the proteins in question should not be grouped together, and the fusion rules are designed to prevent this.)  

    It would be desirable, however, if, in merging information about different protein domains, we were not restricted to sources that give functional annotations for the domains. We can in fact free ourselves of this restriction by making use of information about sequence similarity scores of proteins domains in the following way.  Suppose we do not have a functional annotation for a protein domain, D1, but that we know that D1 has a sequence similarity score of  greater than 60% to a protein domain, D2, whose GO ID number we do know. Then we can infer from this that these two domains share the same function. The significance of this fact is that if all of the set of protein domains in an input set that lack functional annotations have sequences scores of greater than 60% to other protein domains in the input set that do have functional annotations, then we can safely merge the whole set,  knowing that the least upper bound obtained for those proteins that do have annotations will apply to the whole set. 

    A further objective of the rules is to provide a list of the key words and phrases (duplicates being removed) that occur in the functional annotations of all those terms in the GO hierarchy above and including the term which is the least upper bound. (Because much of the computation required to achieve this further objective is also required to determine the least upper bound, both objectives have been subsumed into a single rule to avoid unnecessary duplication.) The following rules are designed to fulfil these objectives (for an explanation of how to read a fusion rule click here):

    Rulecode is 1 (Status = foundational)

    selectfunctionalgroups(set1//protein/function, X)

    AND selectproteingroups(set1//protein/name, set1//protein/function, X, Y) AND expandfunctionalgroupings(set1//protein/name, set1//protein/function, Y, U) AND getleastupperbounds(Y, U, Z) AND getkeywordgroups(Z, W) AND getnumberoffunctionalgroups(Y, V) IMPLIES Initialize(biofusionanalysis) AND RepeatAddNode(functionalgroup, V, biofusionanalysis) AND RepeatAddNode(selectedproteins, 1, biofusionanalysis/functionalgroup) AND RepeatAddText(Y, biofusionanalysis/functionalgroup/selectedproteins) AND RepeatAddNode(commonfunction, 1, biofusionanalysis/functionalgroup) AND RepeatAddText(Z, biofusionanalysis/functionalgroup/commonfunction) AND RepeatAddNode(keywords, 1, biofusionanalysis/functionalgroup) AND RepeatAddAtomicTrees(W, keyword, biofusionanalysis/functionalgroup/keywords)

    The first rule groups the input set of protein domains into subgroups such that those subgroups contain the largest number of domains for a given least upper bound. If any protein domain lacks an annotation then, provided it has a sequence score of greater than 60% to another protein domain that does have an annotation, it is included in any subgroups that contain the second protein domain. The rule also finds all of the key words and phrases that occur in the annotations of the least upper bound, and in those entries in the GO hierarchy that are more general (i.e. above it). These are then added to the output.

    Rulecode is 2 (Status = optional)

    cannotfindsemanticgeneralization(set1//protein/name, set1//protein/function, X)

    IMPLIES AddText(X, biofusionAnalysis/commonfunction)

    If it turns out that one or more of the protein domains that lack annotations also lack a sequence score of greater than 60% to any other input protein domain that does have an annotation, then we cannot safely merge these domains. This is because the least upper bound obtained from the annotations that we do possess might not accurately reflect the true least upper bound obtained if we possessed annotations for all of the domains. In this case Rule 2 adds a message to this effect to the output, including the names of those domains which lack both an annotation and the required sequence score.

    The Prolog knowledgebase and the fusion rules for merging the functional annotations can be found at the following links (The fusion rules are given in both XML (the mark-up language) and in logical form (these are easier to read)):

  • Prolog knowledgebase
  • Fusion rules in logical form
  • Fusion rules in XML
  • (For an explanation of how to read a fusion rule click here.)

    Applying these rules to the protein domains in CATH Superfamily: 3.30.449.10. revealed three subgroups; the superfamily itself, and two smaller groupings. The output report can be found at the following link.

  • Output report
  • Table 2 summarizes the result of seeking functionally similar subsets of the protein domains in CATH Superfamily: 3.30.449.10.

    Table 2: The three subgroups exhibiting various degrees of functional similarity found in CATH Superfamily: 3.30.559.10.
    Protein DomainsLeast Upper Bound
    1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
    1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00,
    2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0,
    1e2o00.
    acyltransferase activity (GO:0008415)
    1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
    1ead00, 1eae00, 1eaf00, 1cia00, 1cla00, 1qca00,
    2cla00, 3cla00, 4cla00, 1c4tA0, 1c4tB0, 1c4tC0,
    1e2o00, 1nocB0.
    catalytic activity (GO:0003824)
    1dpb00, 1dpc00, 1dpd00, 1eaa00, 1eab00, 1eac00,
    1ead00, 1eae00, 1eaf00, 1c4tA0, 1c4tB0, 1c4tC0,
    1e2o00.
    S-acyltransferase activity (GO:0016417)


    Back to Fusion Rule Technology homepage.