Dangling on String
Singling out my favorite amongst the 174 biological information resources in the current database issue of Nucleic Acids Research is easily achieved: String, a protein-protein interaction database primarily developed in the group of Peer Bork at the EMBL was updated to version 7, introducing many small and a few major improvements and should finally be covered here.
Unlike the major repositories of protein-protein interactions such as Biogrid, Intact or DIP, String does not classically collect data. It relies on the available experimental sources from high-throughput experiments, expression studies, literature mining as well as in silico-predictions and evaluates the different methods to highlight reliable interactions. Therefore, you get an evaluated, concise picture of the interactions for a protein, linked to the evidence underlying the assessment.
String provides particularly good coverage for bacteria based on phylogenetic profiles, protein fusion and gene neighborhood, which beat the available experimental information in prokaryotes in number (and quality I would say on a blog). Predicting interactions is one way to study organisms that are inaccessible experimentally; the other option in the transfer of information from related species is likewise implemented. This ribosomal gene in mouse shows the sources of the data and its combination.
The current edition of the database holds information for 373 genomes. String builds on a pairwise Smith-Waterman searches of all protein coding sequences. As the computational costs for the pairwise searches increase quadratically, the new versions considers about half of the genome as the core genomes, which receive full coverage. The other, peripheral genomes are only searched against the core set. This way, the incoming surge of fully sequenced organisms in the next years can be elegantly handled. The NAR paper has more information on the database that I don’t want to regurgitate here.
The database offers several features that were not primarily intended but that I find myself using. The occurrence view gives a quick overview whether a protein of interest is conserved in other organisms. In many cases, the gene name suffices to give you a quick glance on the conservation of a protein, without the need to click-wade through slow loading pages elsewhere, let alone identifying the right full length sequence and run blast. Another helpful file holding synonyms of gene names in all organisms is hidden in the vaults of the download page and will come in handy for bioinformaticians that need a combined source.
And if your blog reader is exhausted and no other opportunity for procrastination is in the office, you can always hit the random input link conveniently provided. Feels like making real discoveries.
[Disclaimer|Cheap attempt to fake objectivity: I have worked with several of the people behind the database for several years.]