Non-orthologous in the COG database

Posted on July 17th, 2006 by Roland Krause in Journals, Publications

The latest issue of Nucleic Acid Research contains work that finds non-orthologous proteins in about one third of the COG database from the NCBI.
The work by Christophe Dessimoz et al. from the ETH Zürich entitled “Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits” is low in voice and avoids clevering. Eugene Koonin, one of the inventors of the resource, admitted that the resource had some flaws in this respect.
The examples convince (e.g. for COG0508) as they follow up the screening for paralogs by a phylogenetic consensus tree.
However, the large number seems a bit worrying - I always thought that COGs would be rather too stringent and not contain many paralogs that could in principle be resolved. The finding that the majority of the wrongly included proteins have metabolic functions was likewise surprising.

The finding has only major implications if the majority of the non-orthologous can be shown to be functionally divergent, which I doubt. And can one use the procedure to provide a resource of the same quality as COG?

3 Responses to “Non-orthologous in the COG database”

  1. Raja Jothi Says:

    An earlier work on this topic (COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations by Jothi et al. from NCBI) has already revealed that at least 15% of the COGs contain non-orthologous proteins. In their work, the authors provide a simple and efficient clustering mechanism based on evolutionary correlations to detect COGs containing non-orthologous proteins. Their work can be accessed at http://bioinformatics.oxfordjournals.org/cgi/content/full/22/7/779
    and the list of COGs containing non-orthologous proteins is available at
    http://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/COCOCL/

  2. Roland Krause Says:

    True, albeit you don’t make much of a point about it in the abstract and Dessimoz et al. might have overlooked it just like I did.
    Can you comment on the number? Surely, the number of COGs containing non-orthologous depends on many parameters and experts would come to contradicting conclusions. However, do you think that the number of important is in the 5% or the 25% percent range?

  3. Raja Jothi Says:

    It is 15% or more, as I have mentioned it in my previous post. I believe that Dessimoz et al.’s 25% range is an overestimation due to the following reason: after automatic generation of COG clusters, COG curators manually added sequences into COGs, which were missed due to a lack adequate sequence similarity (due to faster sequence divergence, eg: KOG3752). As a result, it may look as if, for example, KOG3752 contains non-orthologous sequences, while the actual truth is that KOG3752 is clean. We came accross several of these cases during the testing of COCO-CL, which we later verified with R. Tatusov (COG co-author). More than the percentage of COGs that contain non-orthologous proteins, one should be more concerned about the fraction of proteins in a COG that is non-orthologous, which is not that much after all.

Leave a Comment


Comment