Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects
© Commins et al. 2009
Received: 23 January 2009
Accepted: 17 February 2009
Published: 11 March 2009
Comparative genomics has become a real tantalizing challenge in the postgenomic era. This fact has been mostly magnified by the plethora of new genomes becoming available in a daily bases. The overwhelming list of new genomes to compare has pushed the field of bioinformatics and computational biology forward toward the design and development of methods capable of identifying patterns in a sea of swamping data noise. Despite many advances made in such endeavor, the ever-lasting annoying exceptions to the general patterns remain to pose difficulties in generalizing methods for comparative genomics. In this review, we discuss the different tools devised to undertake the challenge of comparative genomics and some of the exceptions that compromise the generality of such methods. We focus on endosymbiotic bacteria of insects because of their genomic dynamics peculiarities when compared to free-living organisms.
1. Genomes, Genomes, and More Genomes
The emergence of genome information has overwhelmed our efforts to analyze the unexpected amount of data generated during the last two decades. As an example, today (February, 2009), there are 438 complete microbial genomes and 17 in draft in the J. Craig Venter Institute, Comprehensive Microbial Resource website (URL: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi) considering that this is only a single resource we estimate that the number of completed genomes will be in the order of double that by the end of 2009 with a considerable percentage of these already published in the literature. Already the Entrez Genome project website controlled by National Center for Biotechnology Information (NCBI) reports that on February 3, 2009, 857 genomes are complete, 815 are in draft assembly, and 989 are in progress (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html). The number of institutes worldwide with increasing sequencing capacities has been rising at an exponential rate and the first results of analyzing such data have solved old and long debated hypotheses and also have generated breakthrough ideas that have opened new avenues in all fields of genetics and evolutionary biology. However, our ability to cope technically with the amount of generated raw data has become seriously compromised, fueling many initiatives aimed at developing computational tools to analyze genomic and proteomic data. Many of these tools have been developed to perform comparative genomic analyses; each tool has had to face many of the complexities that biologically driven genome remodeling phenomena cause, such as genome duplication, rearrangement, and shrinkage. In this review, we first discuss the different technologies developed to perform genomic and proteomic analyses. We then focus on the importance of the developed tools to study biologically important phenomena such as genome duplication, the dynamics of genome rearrangement, and genome shrinkage that is associated with the intracellular life of bacteria.
2. Common Methods in Comparative Genomics
Comparative genomic methods are vast in number as well as function. A decision about the best way to do something is often a long and arduous task in this field, a task that has resulted in the design and reengineering of many of the tools that are available. To describe every method in this area of research would be next to impossible, and so, this text will provide a snapshot of what is available for many of the common tasks in comparative genomics. The logical place to start is of course the beginning—genome sequencing, assembly, and closing, then continuing to discuss the intricacies of comparative genomics.
While in the past comparative genomics has concentrated on sequencing single genomes and parts of genomes, current excitement lies with the sequencing of environmental communities. This field of research, entitled metagenomics is fast growing and the current hot topic. Its application is most utilized to characterize unculturable organisms (an estimated 99% of microbes cannot be cultivated in a laboratory environment ), but it has also made it possible to sequence genomes without the problems that are associated with cultures maintained in laboratories . Metagenomics has transformed the uses of such organisms by allowing the focus to move from those that can be cloned in culture . Depending on the source of the environmental sample to be subjected to environmental shotgun sequencing, a colossal variation in the number of identified species may result. Just looking at prokaryotes alone, as few as five species were identified in a community sequencing carried out on acid mine biofilm (Tyson et al. ), in contrast, as many as 3,000 species were sequenced from a soil sample taken in Minnesota, USA analyzed by Tringe et al. . For a comprehensive review of this subject, see .
As described above, the whole genome approach where the genome is fragmented into defined length reads is followed by assembly, using purely bioinformatic-based techniques. The second approach, which is more appropriate for larger genomes, utilizes an added step to reduce the computational requirement in assembling the final sequence (Figure 1b). Firstly, the genome is broken into larger fragments, which are in a known order; these fragments are then subsequently subjected to sequencing using the normal shotgun approach. This method requires less computational intervention in assembling the reads into the correct order. Information is already known about the order of each subset of reads and thus less error is incurred in the final assembly. Of course, there are disadvantages with each of these approaches. For instance, with the whole-genome approach, there is the uncertainty as to whether the assembly is correct due to the total reliance on bioinformatics tools to join and order the reads; in addition, coverage may be insufficient (i.e., overlap between the fragments). The second approach is time consuming and labor intensive due to the addition of the extra step at the beginning of the protocol ; this approach is also susceptible to incomplete coverage . Further advances have been made since the advent of shotgun sequencing but the central concepts remain the same.
Technologies currently used in genome sequencing include high-throughput methods such as 454 , SOLid (Applied Biosciences), and Solexa . These methods differ from older technologies in their throughput. Hundreds of thousands of DNA molecules at the same time are sequenced instead of a single DNA clones being processed . The reads returned from each of these technologies are very short; thus, assembly is rather difficult. This disadvantage is offset by the fact that some much DNA is sequenced. The sequencing methodology of these approaches, in particular 454, is called pyrosequencing. This essentially is the sequencing of DNA utilizing the detection of enzymatic activity to identify the bases. This process is termed "sequencing-by-synthesis" . Future developments will of course increase the length of reads produced by the technologies, as well as the accuracy of the programs with which the fragments are assembled.
Discussion in the past has provided some insight into the pitfalls of each method and perhaps aided in the decision making process [14, 16, 17]. One thing is certain, the higher the coverage the method is able to achieve, the higher the likelihood that the assembly tool will get the correct result and so that in itself should be one of the highest considerations in the decision making process.
4. Base Calling and Genome Assembly
After genome sequencing is complete, it then becomes necessary to reconstruct the sequence fragments into a meaningful order that will accurately reflect the original orientation and order of the gene and junk (noncoding regions and pseudogenes) content. The most common and popular manner in which this is achieved is through the Phred [18, 19]–PHRAP –CONSED  pipeline of tools (all of which originate from the University of Washington).
When assembling sequences from the myriad of reads that encompass a genome, several factors must be accounted for. Firstly, base-calling (the operation of determining the nucleotide base sequence from the chromatograph) must be completed with a minimum of erroneous interpretations of the chromatograph. The nucleotide sequence is determined for each read by the base-caller; the assembler then is utilized to piece the reads together into their original order, but must account for insertions, deletions, rearrangements, inversions, and sequence divergence in doing so. In particular, these events are important when assembling using a comparative method (i.e., using the scaffold of an existing genome to predict the locations of the fragments in the newly sequenced genome). No assembler (to date) proposes to handle all of these complications successfully but some do claim to be more capable than others under certain circumstances. For example, Pop et al.  reported that PHRAP  is more adept at creating long contigs (collection of contiguous pieces of DNA (reads)) than other available methods such as TIGR Assembler  or Celera Assembler (WGS-Assembler) . This can be valuable and has been used in the past as an indication of the success of an assembler. More recently, it has been reported that a reduction in the length of contigs across the assembly is an acceptable outcome if the error rate is reduced . Probably the most widely used base-calling algorithm is implemented in Phred [18, 19]. Others include GeneObject  and Life-Trace .
PHRAP has been widely adopted as an integral component of assembly pipelines such as implemented by Havlak et al.  in the Atlas Genome Assembly System and Mullikin and Ning  in the Phusion Assembler. It is considered the standard way in which to assemble smaller genomes with larger genomes relying on more complex algorithms provided by programs such as the WGS-Assembler.
There is no up-to-date objective comparison of genome assemblers available that takes the consistent development being carried out on each project into account. Comparisons carried out by groups such as Huang and Madan  and Chen and Skiena  are works that seek to validate recently released methods. Chen and Skiena  come closest to an objective comparison in their rigorous testing of their own creation, STROLL, and latest versions (at the time) of PHRAP by Green  and the TIGR Assembler by Sutton et al. . In their evaluation of the programs, they reported that PHRAP was consistently more accurate in producing the correct assembly and had the lowest error rates of the group. STROLL produced similar results to PHRAP while TIGR Assembler produced a considerably more erroneous resultant assembly. The TIGR Assembler produced significantly more and smaller contigs, a higher proportion of gaps remaining unclosed and aside from the result, the process of running the TIGR Assembler on the read data used took approximately five times longer to complete than either of the other two programs evaluated.
In the race to publish the Human genome in the early 2000s, the Celera Whole Genome Assembler was engineered to accommodate large genomes. Its first use was described by  in the paper reporting the completion of the Drosophila genome (Myers et al.). This was enhanced and used later in the initial assembly of the Human genome  and the publication of the whole human genome assembly  in addition to the mouse , dog , and mosquito  genomes. While Celera is a private corporation, it has released the Celera Assembler as open source software for free usage.
In early 2007, a new assembly algorithm was described by Sommer et al. . It is a streamlined approach aimed at providing a simple, faster, and more efficient means of assembling fragmented sequences. Minimus  performs its best on small assembly jobs such as small genomes, genes, and bacterial artificial chromosome clones . It has also been assessed with respect to assembling larger sets of fragmented DNA such as those found in bacterial genomes and has been found to produce fewer assembly errors than PHRAP. The cost of this reduction in error rate is that the number of contigs is greater and consequently, the size of the contigs is smaller, resulting in a more fragmented assembly . In addition, all test assemblies produced by Minimus were completed in approximately half the time that PHRAP used. It remains to be seen whether this new assembler will work its way into common use in assembly systems such as Phusion and Atlas, but it is unlikely to remain at an advantage for long as the development and advancements of new and reworked as assemblers is swift and continuous. It has been suggested that it is beneficial for more than one method to be used, so that the exclusive advantages of each method may be exploited . This strategy may well of course be more time consuming but if this time is affordable, it should be implemented.
5. Annotating the Genome
Distilling information from the assembled genome is the next obvious step in the process of building biological understanding of each newly sequenced individual or species. Genome annotation has three main levels—nucleotide-level annotation, protein-level annotation, and process-level annotation. The DNA level annotation process itself has several procedures associated with it. The first procedure is called Mapping, which is the process of identifying known genes, markers, and landmarks within the genome. This is usually carried out using sequence similarity searching programs such as BLAST . Secondly, Gene Finding as the name indicated involves the prediction of gene locations within the genome. Within the genes, the location of introns and exons are sought out in an effort to characterize the DNA into coding and junk categories. This is not a trivial process and often result in very poor sensitivity and specificity, in particular, results are poor when the signal-to-noise ratio is low, i.e., the amount of noncoding DNA is high (for a more elaborate review and comparison of gene prediction algorithms, see ).
Due to the extraordinary numbers of genes and sequences that have already been characterized in one species or another, a lot of the effort required to identify genes is cut out. Also to be identified are noncoding regions including, for example, tRNAs and rRNAs. These are mostly characterized by means of once again similarity searches and by using programs such as tRNAScan-SE . Other regions that must be discovered are regulatory regions, such as transcription factor binding sites, the topic of which is covered in detail in a review paper . In brief, methods have been developed to identify these regions by looking for patterns that occur more often that would be expected by chance; often this strategy is carried out in conjunction with similarity searching techniques.
At the protein-level annotation step, characterization is carried out. Genes are named and assigned functions mostly by means of comparison to already annotated genomes. Often this results in the categorization of many proteins into "unknown function" or "hypothetical protein" categories until experimentation provide light on the purpose of the gene at hand.
The final level of annotation is Process. Here, the biological processes affected by the gene are identified. Process categories usually include cell cycle, cell death, immune response, metabolism, etc. to name but a few. Once again, the processes affected are usually determined via comparison with the information that is already available. It is useful here to note the existence of a few well-established databases that have devised naming conventions and controlled vocabulary for the description of new genes. Probably, the most commonly utilized of these are Panther [45–48] and GO [49, 50]. Both of these are freely available for use via the World Wide Web and are widely accepted adhered to by the genome analysis community.
Much work has been done in the development of quicker and more reliable ways of dealing with and identifying the protein coding regions of a genome at the same time noncoding regions while not completely neglected have been lesser studied of the two. Neither the detection of coding or noncoding regions is easy nor is the development of reliable and robust methods nearing a plateau. Constant progress is being made in these field; thus, the literature should be watched closely in order to be up to date with the current best practices in annotation.
6. Closing the Genome
Closing and completing a genome-sequencing project has proved to be an important step in ensuring the accuracy and reliability of the output into public databases. While the release of draft sequences is very useful, they are notoriously erroneous—in sequence and assembly . Error rates for draft sequencing have been reported to be 1 in 1,000–2,000 base pairs , in contrast to the rates of 1 in 10,000 reported by Selkov et al.  and 1 in 100,000 reported by Fleischmann  for whole genome sequencing. The typical errors found in draft sequences are sequencing errors, sequence misassembles, and the inclusion on contaminant sequences from foreign DNA as bona fide reads . Finding the source of such problems is difficult and time consuming and is often carried out manually. The most important factor taken into account here is the economic tradeoff and whether it is worth the compromise. For example, are there enough financial resources to allow for the whole genome sequencing to be brought to a close? It is important to realize that the quality of the sequencing or lack thereof will propagate forward into whatever analysis is carried out using the DNA sequence. Negative effects will be evident in all downstream analysis; everything from annotation and gene recognition to subsequent identification of homologs, gene families, and phylogenetics relationships will be affected.
While the discussed methods of sequence assembly are thorough and have relatively low error rates, they are not capable of producing a completely reconstructed genome sequence without manual intervention and some potential resequencing. What the methods do produce is a draft sequence that would normally cover approximately 99% of the genome under reconstruction . This draft stage of assembly can be reached within a short number of days. In contrast, the process of closing the assembly out may potentially require months to complete and in some instances may take years. For example, the draft human genome was published in 2001, 4 years ahead of the predicted date of availability (2005). The complete whole genome was, however, not finished until 2003 and subsequently published in 2004 . The time and consequently the monetary cost incurred is a sacrifice that those in the area of comparative genomics are willing to make, as the quality provided by a closed genome is well worth the wait. Moreover, while useful in their own right, draft assemblies are constantly changing and potentially erroneous.
To meet the need for high quality complete genome sequences, several strategies have been developed at facilities such as TIGR, Washington University and Sanger. In some cases, a certain amount of error checking is carried out in conjunction with assembly. Programs such as EULER  and Arachne  are examples of assembly systems that include error correction components. Other approaches include the use of correction algorithms a posteriori to the assembly process. Examples of this type of program are Autofinish (of the wider package—CONSED) , MisEd , and ReDit . Autofinish, one of the most popular computer programs, is used in many genome sequencing centers, such as The Genome Center at the University of Washington, the Berkeley Drosophila Genome Project at Lawrence Berkeley National Laboratory, and the Lita Annenberg Hazen Genome Center at Cold Spring Harbor Laboratory among others . The product of the program must be manually inspected to ensure the quality and accuracy, but the amount of human intervention in this program is significantly reduced. In projects that had sequence coverage as low as four and five times, the human time required to close the project was reduced by more than 51% and 83%, respectively . As the sequence coverage increased up to 14 times, the difference diminished, but consistently less human effort was required when Autofinish was utilized.
The finishing techniques that are employed in programs such as Autofinish reflect what a human finisher does in identifying problem areas in the assembly that has been produced. They go on to propose possible means of resolving the issues, indicating regions to be resequenced and potential reads to aid in closing any gaps that are present. Due to the nature of the problems that are found in draft genome sequences, the process of finishing is an iterative process that can require many cycles through a workflow to resolve all issues; frequently, it is necessary for a human finisher to get involved toward the end to complete the process. This intervention must be as efficient as possible and many graphical viewers and editors are available for this purpose. Examples of manual finishing software are components of the aforementioned CONSED: sequence finishing tool  and ReDit: shotgun assembly finishing aid , also others include BaCCardI: validate and assist in finishing  and DNPTrapper: analysis of complex regions and finishing tool . Each of these software programs aim to make the editing process as user friendly as possible while offering the best possible combinations of editing capabilities.
7. Comparative Genomics: Solving the Puzzle
Comparative genomics is one of the most promising areas that logically follows the success in improving genome sequencing. More and more comparative genomics programs are being demanded to identify protein-coding genome regions, placement of regulatory elements, and the main evolutionary dynamics affecting the complexity of genome organization. Despite its apparent simplicity, such comparative methods have to face many technical as well as theoretical problems. One of the most important problems is aligning whole genomes and visualizing such alignments in a comprehensive and comprehensible way. This problem in sequence alignment leads to other genomic problems such as the finding of orthologs between genomes. The magnitude of this problem becomes increasingly magnified when the comparison is held between genomes with different population dynamics and hence different mutational rates, as we will explain below.
7.1. The First Hurdle—How to Determine the Homologs (Orthologs and/or Paralogs)?
Identification of homologous genes relies on the appropriate definition of a homolog. The most widely accepted definition is that homologous genes share a common ancestry. This definition, however, is not precise as to the nature of this common ancestry and comprises two types of homologs (as described by Fitch  and Fitch and Margoliach ): orthologs (common species ancestry caused by speciation event in such away that the homolog genes are in different species) and paralogs (common gene ancestry caused by a gene duplication event and, as a consequence, the homologous genes are present in the same species).
Irrespective of the nature of the ancestry considered, homologs are usually identified on the basis of sequence similarity. So the higher the similarity, the more likely it is that the sequences have derived form a common ancestor. One of the first and the most commonly used software to detect the degree of similarity between sequences is BLAST  and the newer version PSI-BLAST . BLAST uses predefined scoring matrices in comparison to position-specific scoring matrices derived from the scoring hits in the initial search in PSI-BLAST. The two programs yield information about the score for the comparisons and their likelihood, called the e-value. Sequences with the highest scores and therefore with the lowest e-values are considered to be the closest relatives in the searched database. The assumption underlying this software is that the phylogenetic relationship between any two sequences and their degree of similarity are positively correlated. This, however, leads to another theoretical problem: how to determine if a sequence is more similar to a different particular sequence than it is to another. Unfortunately, setting a statistical cutoff value to determine when two sequences are significantly similar is rather difficult and problematic when determining a set of possible homologs. The lower the cut off, the larger the number of false negatives. On the other hand, the higher the cut off, the larger the number of false positives. As an additional drawback, the sequences with the highest score and lowest e-value are not always more closely related to each other than those identified as hits with a lower score .
In the BLAST searches for homologs, many types of relationships between the homologs can be investigated, including hits of many-to-many, one-to-many, or very strict one-to-one relationships. The first two are a result of duplication events after speciation. A very effective way to identify one-to-one relationship is by performing the generally called reciprocal best BLAST hits [65, 66]. This method is based on the assumption that genes that are each other's best hits when performing a BLAST search are more likely to be orthologs compared to ones that are not. The reason for this is that although gene A in genome 1 may be the best match for gene B in genome 2, this match may be worse than gene B in genome 2 with gene C in genome 1. This approach is again limited by the problem of the assumption that best hits ensure orthology, which might not be the case when a particular gene underwent a recent duplication in a particular lineage. The consequence of this is that when a gene finds a paralog as top BLAST hit instead of its ortholog, both the gene and its paralog are excluded from downstream analyses . These limitations in the BLAST searches have fuelled the development of other ways to identify putative orthologs over the last few years. One of such methods uses the sequence distances instead of similarities to identify orthologs and uses the reciprocal smallest distance algorithm . It uses global sequence alignment and maximum likelihood to estimate the evolutionary distances between genes to detect orthologous genes. This approach have also been used to determine orthologs in databases like Roundup . Another simple approach that has contributed significantly to the reduction in the number of false positive results when conducting BLAST searches is PSI-BLAST .
Homology may also be ascertained by means of phylogenetic methods such as BranchClust by . This type of method is capable of determining homology distinguish it from paralogy. BranchClust utilizes similarity searching during the execution of its algorithm but obviously does not rely solely on it. Hits within a certain threshold are used rather than the best hit in order to include paralogs and orthologs. These results are then grouped into what Poptsova and Gogarten has termed superfamilies. These sequences are aligned and phylogenetic trees are constructed. The step of phylogenetic inference is then followed by a complex algorithm that is described fully in the application's article . The outcome of using this method over more traditional one is that BranchClust is reported to outperform similarity search methods due to its lower false negative rate than the reciprocal best blast hit method.
Irrespective of the method used to identify homologs, visualizing results is a common way to inspect and yield the first insights into trends and patterns when looking at genome evolutionary dynamics. This fact has inspired the creation of software for comparative genomics with graphical solutions to assist in the interpretation of the results. These solutions provide user-friendly environments in which navigation along alignments, etc. is easy and reliable. The question remains, however, whether visualization tools can solve the puzzle of genome rearrangements. An argument against the use of techniques such as this is that the process will not be repeatable or statistically sound. Undoubtedly, insights will be yielded but all sure perceived trends should be investigated in a more analytically robust manner.
7.2. Pairwise Genome Comparisons
Aside from the syntenic analyses using visualization tools, other programs have been developed to search for other types of information in comparative genomics. For example, GC Comparison Graph from The CMR  compares the GC content between two genomes by placing orthologs in the axis according to their GC content, highlighting GC compositional shifts at the genome level between two genomes. Although useful in their content, these programs are subject to several drawbacks from the pragmatic point of view among which the most important is the impossibility to perform multiple genome comparisons and hence to establish the ancestry of genome rearrangement dynamics.
7.3. Multiple Genome Comparisons
As the number of genomes increased over the last decade, the demand for an understanding of the dynamics of genome evolution also increased. Dealing with the complexity of multiple genomes comparisons has been halted by the unparalleled development of appropriate software tools. Nowadays, several software tools have been developed. An example of a multiple genome comparison tool is GenColors from Jena Prokaryotic Genome Viewer (JPGV) . This program allows the user to display a number of features on the genome, like CDS, RNA genes, tRNA genes, rRNA genes, Mics RNA, GC contents, GC skew Keto excess, etc. This database also represents genomes in either a circular diagram or in a linear plot. Although several genomes can be examined at the same time using this tool, these are human observations of the genomes rather than real phylogenetic studies of the genome properties. JPGV allows multiple genome comparisons by determining a core gene set of two or more genomes defined by the set of best-bidirectional hits for all possible pairs of genes. Other methods of the JPGV are implemented to perform pair wise comparisons only.
In addition, there are computational tools that compare multicircular prokaryotic genomes and present their similarities in a circular diagram. Some of these tools perform these comparisons in addition to the BLAST searches and the CGView server is an example of that . Others also display information about the percentage of GC for each one of the genomes, such is the case of GenomeViz .
To gain more information about genome rearrangements and inversions, there has been a great effort in developing tools that perform linear comparisons between genomes. The way these tools compare genomes is by performing genomes alignments where possible and then by conducting multiple genome comparisons. There are many different multiple genome alignments algorithms. The first type is based on defining a reference genome and performing alignments taking into account that reference genome. This type of alignment algorithm is implemented in a program called Vista . The second approach is that where an iterative pairwise alignment is performed under the control of a guide tree. The tree defines the order in which the genomes should be added to the alignment. The third type of algorithms determines anchors present in all genomes and then proceeds to align them. Once aligned, the last step is to close the gaps between the anchors by aligning the substrings between them. Examples of programs implementing this type of algorithm are MGA , M-GCAT , and Mauve , with each of them having their own algorithm for identifying the anchors and performing the alignment of the interanchor regions afterward.
There are other tools that allow the user to do other things in addition to the alignment of genomes. For example, MANTIS  is a phylogenetic-group specific (metazoan phylogeny) tool that analyzes the patterns of gene gains and losses at specific branches of the phylogeny. Then, the program infers the gene content of the ancestral genome to the clade and identifies over- or underrepresentation of certain processes among the class of gene gains or losses.
Despite all these effort in developing more robust and accurate methods to perform comparative genomic studies, several biological phenomena pose difficulties in identifying the real genome dynamic processes in organisms. For example, genome duplication, genome shrinkage in intracellular symbiotic bacteria, and lateral gene transfer may well hide the real genome rearrangement processes undergone in particular genomes. To illustrate the importance of the biology of the organismal biology to understand genome dynamics, we will focus the rest of the review on intracellular bacterial genomes.
8. Comparative Genomics of Intracellular Bacteria
Intracellular bacteria are a special group of organisms that have been able to adapt to intracellular life, establishing either a symbiotic or pathogenic relationship with the host. Because many of the genes that were important for the free lifestyle are no longer needed by these bacteria, they underwent nonfunctionalization followed by disintegration . This process has been enhanced by the fact that the host provides these bacteria by some of their needed components and by a chemically stable rich environment. Genome shrinkage is therefore a fact in most if not all the strict intracellular bacteria and this process has been mostly accompanied by genome rearrangements and fast evolutionary rates of proteins. Because of these intracellular associated genomic and evolutionary events, comparative genomics including identification of orthologs, paralogs, synteny analyses, and others pose great challenge in the comparison with free-living bacteria and require including biological information in the comparative genomics analyses to increase the accuracy of the results.
In the case of the symbiotic relationships, the difficulty of comparative genomics acquires another dimension and complexity specifically associated to the mutational dynamics of these organisms. There are two main groups of symbiotic bacteria: the facultative and the obligated. When the association is facultative, it implies that the survival of each partner can be possible without the other under special environmental conditions. This is for example seen between the pea aphid Acyrthosiphon pisum and the facultative endosymbionts Hamiltonella defensa that acts as a protector of the aphid against parasitism by the solitary endoparasitoids Aphidium ervi and Aphidius eadyi [90–92]. The other case, obligated, is when the relationship between the two organisms becomes so close that the host's relative biological fitness would become seriously compromised if deprived of the symbiont. This is the case of the symbiotic relationship between the bacterium Buchnera aphidicola sp. and the aphid insect  and it is an example where the host (the aphid) has evolved specialized cells to house its endosymbionts (so called bacteriocytes) . This relationship is one of the best characterized in the literature so the last following part of this review will focus on endosymbionts contained in bacteriocyte and the challenges that their mutational dynamics impose in the comparative genomics of bacteria.
8.1. Genome Evolution of Intracellular Bacteria
The clonal vertical transmission of small populations in many intracellular symbiotic bacteria and pathogens to the next host generations imposes a strong bottleneck on the effective population size of these bacteria. This results in relaxed selective constraints in the symbiotic genomes and their channeling into a dynamic of neutral fixation of slightly deleterious mutations and irreversible increase in the endosymbiont genome mutational load (a phenomenon named Muller's ratchet ). However, these bacteria are also subjected to selection imposed mostly over their insect hosts. Because of their clonal transmission and their confinement to the interior of bacteriocytes symbiotic bacteria have little or no opportunity for recombination and hence have no alternative means for the removal of these slightly deleterious mutations.
Is there a minimum set of genes necessary for the maintenance of intracellular life? Numerous scientists have addressed this question and many have been attempting to answer it through the study of the smallest endosymbiotic genome . Comparative genomics studies in a large number of organisms have shown that the minimal gene content will depend on the environmental conditions the organism lives under [97, 98].
The process of gene loss in intracellular organisms has an important effect on rewiring the functional relationships among genes. This would lead to different organisms containing different genes performing the same essential functions in the cell. So when looking at gene content of intracellular bacteria, we should talk about the functional group of genes instead of individual genes .
8.2. Difficulties with Comparative Genomics of Bacteriocyte-Housed Insect Endosymbionts
Comparison of bacterial genomes may provide clues about the main genome rearrangement dynamics supporting different lifestyles, for example, comparative genomics of intracellular symbiotic bacteria and their closest free-living relatives. Performing comparative genomics on bacteria that are in an intermediate stage between free-living and host specific symbiosis (primary endosymbionts) with each of their groups could shed some light on the establishment of symbiosis itself. These bacteria are the ones we refer to as secondary endosymbionts—they are distinguishable from primary symbionts by their larger genomes and the fact that they are not living under the protection of the bacteriocytes provided by their hosts.
As a consequence of Muller's ratchet in intracellular bacteria in combination with mutational bias, their genomes present a higher AT content than observed in their free-living relatives . This results in programs like BLAST having increased difficulty in determining homologs—especially between the intracellular bacteria and their free-living relatives.
The difficulty of doing comparative genomics with intracellular bacteria is that few to none of the software programs have been designed to deal with any of the theoretical problems seen in these organisms. Most software and methods have been directed toward the broad stream of the comparison of genomes with similar sizes and belonging to bacteria with minor differences regarding their lifestyle or environmental conditions. The challenge, however, resides on identifying important genomic dynamics that occurred during the transition between two lifestyles and hence between potentially different biological systems.
One of the biggest problems with the comparative genomics of endosymbiontic and pathogenic bacteria to their closest free-living relative bacteria is the different evolutionary force under which they evolve. Because the population sizes of endocellular symbiotic bacteria undergo strong bottleneck during the intergenerational transmissions, many of the stochastically produced amino acid mutations are fixed by genetic drift despite their slight deleterious effects. This implies that the mean mutational load in the endocellular bacteria will dramatically increase posing serious difficulties to find their orthologs in free-living bacteria. Comparing endosymbionts with each other can yield valuable information about endosymbiosis but it is crucial to compare the endosymbionts to free-living bacteria to be able to investigate the transition from free-living to intracellular lifestyle and predict the shift in evolutionary forces. Novel methods are hence required to account for the biological and population genetics differences of the organisms whose genomes are being compared.
8.3. Databases and Methods for the Analysis of Endosymbionts
BuchneraBASE  is a database that contains information on Buchnera sp. APS. This database is the only of its kinds, to our knowledge, devoted completely to a primary symbiont. It does not offer any direct comparative genome tool for the user like many other databases but it contains some data obtained from comparison between symbiotic gamma-proteobacteria and an in silico model of Escherichia coli. This database was built as to integrate new sequenced genomes from symbiotic bacteria as they became available. It performs comparisons between different genomes using the information of gene orthology. The database also has a summary page that shows two user-interactive tables. The first table represents the number of genes in each of the genomes that are in a certain category, i.e., total number of complete genes, total number of pseudogenes, genes with an E. coli ortholog shared with the endosymbiont of Wigglesworthia glossinidia or not shared with Wigglesworthia, etc. The second table can be used to browse through each of the functional classifications for each of the symbionts stored in the database.
To our knowledge, there is only one program, GRAST , that has been developed with the sole purpose to investigate the evolutionary dynamics of endosymbionts. It performs a pairwise comparison between a free-living (reference genome) genome and an endosymbiotic genome and allows the user to choose between different outputs options, providing valuable insights regarding the change in genome dynamics in comparison to their free-living relatives. The outputs range from generation of genome plots with orthologous and nonorthologous genes' sets are plotted in the two genomes being compared to plots with the analysis of the distribution of genome rearrangements or dynamics in one of the genomes (Figure 3). Among other types of information, the program yields information about conserved regions between the two genomes, distribution of percentage of differences in the number of genes present in the different functional categories between the two genomes being compared and deviations from the expected percentage of orthologs between the genomes, and information about intergenic regions according to their position/rearrangement in the two genomes.
A brief look at the genome sizes of bacteria would suffice to realize about the incredible diversity of the genomic dynamic events that have been happening throughout evolution. These events are key to understand the different evolutionary processes shaping organismal organization. Intracellular organisms perform a minority of this diversity but they represent extreme cases where most of the genomic dynamics become dramatically manifested. New methods should therefore be developed to perform in-depth comparative genomic analyses of these bacteria to infer important shifts in the evolution of genomes.
The genomic era has exploded and generated new research avenues that go beyond all expectations. A plethora of novel ways of designing experiments and computational tools has been fuelled by the information generated from the first comparative genomics analyses. The challenge that remains is to design new comprehensive and accurate bioinformatics tools capable of counterbalancing our limitations to analyze the overwhelming amount of genomic data generated.
- Rappe MS, Giovannoni SJ: The uncultured microbial majority. Annu Rev Microbiol. 2003, 57: 369-394. 10.1146/annurev.micro.57.030502.090759.PubMedGoogle Scholar
- Hugenholtz P: Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002, 3 (2): REVIEWS0003-10.1186/gb-2002-3-2-reviews0003.PubMed CentralPubMedGoogle Scholar
- Chen K, Pachter L: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol. 2005, 1 (2): 106-112. 10.1371/journal.pcbi.0010024.PubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428 (6978): 37-43. 10.1038/nature02340.PubMedGoogle Scholar
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science. 2005, 308 (5721): 554-557. 10.1126/science.1107851.PubMedGoogle Scholar
- Riesenfeld CS, Schloss PD, Handelsman J: Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004, 38: 525-552. 10.1146/annurev.genet.38.072902.091216.PubMedGoogle Scholar
- Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 1977, 74 (12): 5463-5467. 10.1073/pnas.74.12.5463.PubMed CentralPubMedGoogle Scholar
- Edwards A, Voss H, Rice P, Civitello A, Stegemann J, Schwager C, Zimmermann J, Erfle H, Caskey CT, Ansorge W: Automated DNA sequencing of the human HPRT locus. Genomics. 1990, 6 (4): 593-608. 10.1016/0888-7543(90)90493-E.PubMedGoogle Scholar
- Green P: Whole-genome disassembly. Proc Natl Acad Sci USA. 2002, 99 (7): 4143-4144. 10.1073/pnas.082095999.PubMed CentralPubMedGoogle Scholar
- Kaiser O, Bartels D, Bekel T, Goesmann A, Kespohl S, Puhler A, Meyer F: Whole genome shotgun sequencing guided by bioinformatics pipelines—an optimized approach for an established technique. J Biotechnol. 2003, 106 (2–3): 121-133. 10.1016/j.jbiotec.2003.08.008.PubMedGoogle Scholar
- Tauch A, Homann I, Mormann S, Ruberg S, Billault A, Bathe B, Brand S, Brockmann-Gretza O, Ruckert C, Schischka N, Wrenger C, Hoheisel J, Mockel B, Huthmacher K, Pfefferle W, Puhler A, Kalinowski J: Strategy to sequence the genome of Corynebacterium glutamicum ATCC 13032: use of a cosmid and a bacterial artificial chromosome library. J Biotechnol. 2002, 95 (1): 25-38. 10.1016/S0168-1656(01)00443-6.PubMedGoogle Scholar
- Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, Li K, Rogers YH, Strausberg R, Sutton G, Tallon L, Thomas T, Venter E, Frazier M, Venter JC: A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes (vol 103., pg 11240., 2006). Proc Natl Acad Sci USA. 2006, 103 (43): 16057-10.1073/pnas.0607197103.Google Scholar
- Potera C: New gene sequencer targets productivity—Solexa says its novel system offers better cost-effectiveness via use of short-read sequences. Genet Eng News. 2006, 26 (17): 10–+-Google Scholar
- Graveley BR: Molecular biology—power sequencing. Nature. 2008, 453 (7199): 1197-1198. 10.1038/4531197b.PubMed CentralPubMedGoogle Scholar
- Wicker T, Schlagenhauf E, Graner A, Close TJ, Keller B, Stein N: 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006, 7: 275-10.1186/1471-2164-7-275.PubMed CentralPubMedGoogle Scholar
- Branscomb E, Predki P: On the high value of low standards. J Bacteriol. 2002, 184 (23): 6406-6409. 10.1128/JB.184.23.6406-6409.2002.PubMed CentralPubMedGoogle Scholar
- Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL: The value of complete microbial genome sequencing (you get what you pay for). J Bacteriol. 2002, 184 (23): 6403-6405. 10.1128/JB.184.23.6403-6405.2002. discusion 5.PubMed CentralPubMedGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8 (3): 175-185.PubMedGoogle Scholar
- Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8 (3): 186-194.PubMedGoogle Scholar
- Green P: PHRAP., unpublished. 1994, [http://www.phrap.org/]Google Scholar
- Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8 (3): 195-202.PubMedGoogle Scholar
- Pop M, Salzberg SL, Shumway M: Genome sequence assembly: algorithms and issues. Computer. 2002, 35 (7): 47-54. 10.1109/MC.2002.1016901.Google Scholar
- Sutton G, White O, Adams MD, Kerlavage AR: TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995, 1: 9-19.Google Scholar
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science. 2000, 287 (5461): 2196-2204. 10.1126/science.287.5461.2196.PubMedGoogle Scholar
- Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast., lightweight genome assembler. BMC Bioinformatics. 2007, 8: 64-10.1186/1471-2105-8-64.PubMed CentralPubMedGoogle Scholar
- Gilchrist R, Chi V: Visible Genetics Inc.., assignee. GeneObject. 1999, inventors. USA patent 5916747.Google Scholar
- Walther D, Bartha G, Morris M: Basecalling with LifeTrace. Genome Res. 2001, 11 (5): 875-888. 10.1101/gr.177901.PubMed CentralPubMedGoogle Scholar
- Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system. Genome Res. 2004, 14 (4): 721-732. 10.1101/gr.2264004.PubMed CentralPubMedGoogle Scholar
- Mullikin JC, Ning Z: The phusion assembler. Genome Res. 2003, 13 (1): 81-90. 10.1101/gr.731003.PubMed CentralPubMedGoogle Scholar
- Peltola H, Soderlund H, Ukkonen E: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 1984, 12 (1 Pt 1): 307-321. 10.1093/nar/12.1Part1.307.PubMed CentralPubMedGoogle Scholar
- Pop M, Phillippy A, Delcher AL, Salzberg SL: Comparative genome assembly. Brief Bioinform. 2004, 5 (3): 237-248. 10.1093/bib/5.3.237.PubMedGoogle Scholar
- Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res. 2004, 14 (1): 149-159. 10.1101/gr.1536204.PubMed CentralPubMedGoogle Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMed CentralPubMedGoogle Scholar
- Chen T, Skiena SS: A case study in genome-level fragment assembly. Bioinformatics. 2000, 16 (6): 494-500. 10.1093/bioinformatics/16.6.494.PubMedGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X: The sequence of the human genome. Science. 2001, 291 (5507): 1304-1351. 10.1126/science.1058040.PubMedGoogle Scholar
- Istrail S, Sutton GG, Florea L, Halpern AL, Mobarry CM, Lippert R, Walenz B, Shatkay H, Dew I, Miller JR, Flanigan MJ, Edwards NJ, Bolanos R, Fasulo D, Halldorsson BV, Hannenhalli S, Turner R, Yooseph S, Lu F, Nusskern DR, Shue BC, Zheng XH, Zhong F, Delcher AL, Huson DH, Kravitz SA, Mouchard L, Reinert K, Remington KA, Clark AG, Waterman MS, Eichler EE, Adams MD, Hunkapiller MW, Myers EW, Venter JC: Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA. 2004, 101 (7): 1916-1921. 10.1073/pnas.0307971100.PubMed CentralPubMedGoogle Scholar
- Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, Wides R, Halpern A, Li PW, Sutton GG, Nadeau J, Salzberg SL, Holt RA, Kodira CD, Lu F, Chen L, Deng Z, Evangelista CC, Gan W, Heiman TJ, Li J, Li Z, Merkulov GV, Milshina NV, Naik AK, Qi R, Shue BC, Wang A, Wang J, Wang X, Yan X, Ye J, Yooseph S, Zhao Q, Zheng L, Zhu SC, Biddick K, Bolanos R, Delcher AL, Dew IM, Fasulo D, Flanigan MJ, Huson DH, Kravitz SA, Miller JR, Mobarry CM, Reinert K, Remington KA, Zhang Q, Zheng XH, Nusskern DR, Lai Z, Lei Y, Zhong W, Yao A, Guan P, Ji RR, Gu Z, Wang ZY, Zhong F, Xiao C, Chiang CC, Yandell M, Wortman JR, Amanatides PG, Hladun SL, Pratts EC, Johnson JE, Dodson KL, Woodford KJ, Evans CA, Gropman B, Rusch DB, Venter E, Wang M, Smith TJ, Houck JT, Tompkins DE, Haynes C, Jacob D, Chin SH, Allen DR, Dahlke CE, Sanders R, Li K, Liu X, Levitsky AA, Majoros WH, Chen Q, Xia AC, Lopez JR, Donnelly MT, Newman MH, Glodek A, Kraft CL, Nodell M, Ali F, An HJ, Baldwin-Pitts D, Beeson KY, Cai S, Carnes M, Carver A, Caulk PM, Center A, Chen YH, Cheng ML, Coyne MD, Crowder M, Danaher S, Davenport LB, Desilets R, Dietz SM, Doup L, Dullaghan P, Ferriera S, Fosler CR, Gire HC, Gluecksmann A, Gocayne JD, Gray J, Hart B, Haynes J, Hoover J, Howland T, Ibegwam C, Jalali M, Johns D, Kline L, Ma DS, MacCawley S, Magoon A, Mann F, May D, McIntosh TC, Mehta S, Moy L, Moy MC, Murphy BJ, Murphy SD, Nelson KA, Nuri Z, Parker KA, Prudhomme AC, Puri VN, Qureshi H, Raley JC, Reardon MS, Regier MA, Rogers YH, Romblad DL, Schutz J, Scott JL, Scott R, Sitter CD, Smallwood M, Sprague AC, Stewart E, Strong RV, Suh E, Sylvester K, Thomas R, Tint NN, Tsonis C, Wang G, Wang G, Williams MS, Williams SM, Windsor SM, Wolfe K, Wu MM, Zaveri J, Chaturvedi K, Gabrielian AE, Ke Z, Sun J, Subramanian G, Venter JC, Pfannkoch CM, Barnstead M, Stephenson LD: A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002, 296 (5573): 1661-1671. 10.1126/science.1069193.PubMedGoogle Scholar
- Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC: The dog genome: survey sequencing and comparative analysis. Science. 2003, 301 (5641): 1898-1903. 10.1126/science.1086432.PubMedGoogle Scholar
- Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, Cai S, Center A, Chaturverdi K, Christophides GK, Chrystal MA, Clamp M, Cravchik A, Curwen V, Dana A, Delcher A, Dew I, Evans CA, Flanigan M, Grundschober-Freimoser A, Friedli L, Gu Z, Guan P, Guigo R, Hillenmeyer ME, Hladun SL, Hogan JR, Hong YS, Hoover J, Jaillon O, Ke Z, Kodira C, Kokoza E, Koutsos A, Letunic I, Levitsky A, Liang Y, Lin JJ, Lobo NF, Lopez JR, Malek JA, McIntosh TC, Meister S, Miller J, Mobarry C, Mongin E, Murphy SD, O'Brochta DA, Pfannkoch C, Qi R, Regier MA, Remington K, Shao H, Sharakhova MV, Sitter CD, Shetty J, Smith TJ, Strong R, Sun J, Thomasova D, Ton LQ, Topalis P, Tu Z, Unger MF, Walenz B, Wang A, Wang J, Wang M, Wang X, Woodford KJ, Wortman JR, Wu M, Yao A, Zdobnov EM, Zhang H, Zhao Q, Zhao S, Zhu SC, Zhimulev I, Coluzzi M, della Torre A, Roth CW, Louis C, Kalush F, Mural RJ, Myers EW, Adams MD, Smith HO, Broder S, Gardner MJ, Fraser CM, Birney E, Bork P, Brey PT, Venter JC, Weissenbach J, Kafatos FC, Collins FH, Hoffman SL: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298 (5591): 129-149. 10.1126/science.1076181.PubMedGoogle Scholar
- Shizuya H, Birren B, Kim UJ, Mancino V, Slepak T, Tachiiri Y, Simon M: Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci USA. 1992, 89 (18): 8794-8797. 10.1073/pnas.89.18.8794.PubMed CentralPubMedGoogle Scholar
- Stein L: Genome annotation: from sequence to biology. Nat Rev Genet. 2001, 2 (7): 493-503. 10.1038/35080529.PubMedGoogle Scholar
- Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000, 10 (4): 483-501. 10.1101/gr.10.4.483.PubMed CentralPubMedGoogle Scholar
- Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997, 25 (5): 955-964. 10.1093/nar/25.5.955.PubMed CentralPubMedGoogle Scholar
- Pennacchio LA, Rubin EM: Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet. 2001, 2 (2): 100-109. 10.1038/35052548.PubMedGoogle Scholar
- Mi H, Guo N, Kejariwal A, Thomas PD: PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res. 2007, 35 (Database issue): D247-D252. 10.1093/nar/gkl869.PubMed CentralPubMedGoogle Scholar
- Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD: The PANTHER database of protein families., subfamilies., functions and pathways. Nucleic Acids Res. 2005, 33 (Database issue): D284-D288. 10.1093/nar/gki078.PubMed CentralPubMedGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003, 13 (9): 2129-2141. 10.1101/gr.772403.PubMed CentralPubMedGoogle Scholar
- Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O: PANTHER: a browsable database of gene products organized by biological function., using curated protein family and subfamily classification. Nucleic Acids Res. 2003, 31 (1): 334-341. 10.1093/nar/gkg115.PubMed CentralPubMedGoogle Scholar
- Blake JA, Harris MA: The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis. Curr Protoc Bioinformatics. 2002, Chapter 7 (Unit 7.2):Google Scholar
- Camon E, Barrell D, Brooksbank C, Magrane M, Apweiler R: The Gene Ontology Annotation (GOA) Project—application of GO in SWISS-PROT., TrEMBL and InterPro. Comp Funct Genomics. 2003, 4 (1): 71-74. 10.1002/cfg.235.PubMed CentralPubMedGoogle Scholar
- Selkov E, Overbeek R, Kogan Y, Chu L, Vonstein V, Holmes D, Silver S, Haselkorn R, Fonstein M: Functional analysis of gapped microbial genomes: amino acid metabolism of Thiobacillus ferrooxidans. Proc Natl Acad Sci USA. 2000, 97 (7): 3509-3514. 10.1073/pnas.97.7.3509.PubMed CentralPubMedGoogle Scholar
- Fleischmann R: Single nucleotide polymorphisms in Mycobacterium tuberculosis structural genes—response to Dr. Musser. Emerg Infect Dis. 2001, 7 (3): 487-488.Google Scholar
- Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.PubMed CentralPubMedGoogle Scholar
- Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002, 12 (1): 177-189. 10.1101/gr.208902.PubMed CentralPubMedGoogle Scholar
- Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11 (4): 614-625. 10.1101/gr.171401.PubMed CentralPubMedGoogle Scholar
- Tammi MT, Arner E, Kindlund E, Andersson B: Correcting errors in shotgun sequences. Nucleic Acids Res. 2003, 31 (15): 4663-4672. 10.1093/nar/gkg653;.PubMed CentralPubMedGoogle Scholar
- Tammi MT, Arner E, Kindlund E, Andersson B: ReDiT: Repeat Discrepancy Tagger—a shotgun assembly finishing aid. Bioinformatics. 2004, 20 (5): 803-804. 10.1093/bioinformatics/bth004.PubMedGoogle Scholar
- Bartels D, Kespohl S, Albaum S, Druke T, Goesmann A, Herold J, Kaiser O, Puhler A, Pfeiffer F, Raddatz G, Stoye J, Meyer F, Schuster SC: BACCardI—a tool for the validation of genomic assemblies., assisting genome finishing and intergenome comparison. Bioinformatics. 2005, 21 (7): 853-859. 10.1093/bioinformatics/bti091.PubMedGoogle Scholar
- Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B: DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics. 2006, 7: 155-10.1186/1471-2105-7-155.PubMed CentralPubMedGoogle Scholar
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool. 1970, 19 (2): 99-113. 10.2307/2412448.PubMedGoogle Scholar
- Fitch WM, Margoliash E: The usefulness of amino acid and nucleotide sequences in evolutionary studies. Evol Biol. 1970, 4: 67-109.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedGoogle Scholar
- Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52 (6): 540-542.PubMedGoogle Scholar
- Hirsh AE, Fraser HB: Protein dispensability and rate of evolution. Nature. 2001, 411 (6841): 1046-1049. 10.1038/35082561.PubMedGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12 (6): 962-968.PubMed CentralPubMedGoogle Scholar
- Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics. 2003, 19 (13): 1710-1711. 10.1093/bioinformatics/btg213.PubMedGoogle Scholar
- Deluca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006, 22 (16): 2044-2046. 10.1093/bioinformatics/btl286.PubMedGoogle Scholar
- Lee MM, Chan MK, Bundschuh R: Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches. Bioinformatics. 2008, 24: 1339-1343. 10.1093/bioinformatics/btn130.PubMedGoogle Scholar
- Poptsova MS, Gogarten JP: BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics. 2007, 8: 120-10.1186/1471-2105-8-120.PubMed CentralPubMedGoogle Scholar
- Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 3643-3646. 10.1093/bioinformatics/bth397.PubMedGoogle Scholar
- Celamkoti S, Kundeti S, Purkayastha A, Mazumder R, Buck C, Seto D: GeneOrder3.0: software for comparing the order of genes in pairs of small bacterial genomes. BMC Bioinformatics. 2004, 5: 52-10.1186/1471-2105-5-52.PubMed CentralPubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36 (Database issue): D13-D21.PubMed CentralPubMedGoogle Scholar
- Peterson JD, Umayam LA, Dickinson T, Hickey EK, White O: The Comprehensive Microbial Resource. Nucleic Acids Res. 2001, 29 (1): 123-125. 10.1093/nar/29.1.123.PubMed CentralPubMedGoogle Scholar
- Choi K, Ma Y, Choi JH, Kim S: PLATCOM: a Platform for Computational Comparative Genomics. Bioinformatics. 2005, 21 (10): 2514-2516. 10.1093/bioinformatics/bti350.PubMedGoogle Scholar
- Toft C, Fares MA: GRAST: a new way of genome reduction analysis using comparative genomics. Bioinformatics. 2006, 22 (13): 1551-1561. 10.1093/bioinformatics/btl139.PubMedGoogle Scholar
- Xie T, Hood L: ACGT—a comparative genomics tool. Bioinformatics. 2003, 19 (8): 1039-1040. 10.1093/bioinformatics/btg121.PubMedGoogle Scholar
- Chen T, Abbey K, Deng WJ, Cheng MC: The bioinformatics resource for oral pathogens. Nucleic Acids Res. 2005, 33 (Web Server issue): W734-W740. 10.1093/nar/gki361.PubMed CentralPubMedGoogle Scholar
- Leader DP: BugView: a browser for comparing genomes. Bioinformatics. 2004, 20 (1): 129-130. 10.1093/bioinformatics/btg383.PubMedGoogle Scholar
- Yang J, Wang J, Yao ZJ, Jin Q, Shen Y, Chen R: GenomeComp: a visualization tool for microbial genome comparison. J Microbiol Methods. 2003, 54 (3): 423-426. 10.1016/S0167-7012(03)00094-0.PubMedGoogle Scholar
- Romualdi A, Felder M, Rose D, Gausmann U, Schilhabel M, Glockner G, Platzer M, Suhnel J: GenColors: annotation and comparative genomics of prokaryotes made easy. Methods Mol Biol. 2007, 395: 75-96. full_text.PubMedGoogle Scholar
- Grant JR, Stothard P: The CGView Server: a comparative genomics tool for circular genomes. Nucleic Acids Res. 2008, 36: W181-W184. 10.1093/nar/gkn179.PubMed CentralPubMedGoogle Scholar
- Ghai R, Chakraborty T: Comparative microbial genome visualization using GenomeViz. Methods Mol Biol. 2007, 395: 97-108. full_text.PubMedGoogle Scholar
- Dubchak I, Ryaboy DV: VISTA family of computational tools for comparative analysis of DNA sequences and whole genomes. Methods Mol Biol. 2006, 338: 69-89.PubMedGoogle Scholar
- Hohl M, Kurtz S, Ohlebusch E: Efficient multiple genome alignment. Bioinformatics. 2002, 18 (Suppl 1): S312-S320.PubMedGoogle Scholar
- Treangen TJ, Messeguer X: M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics. 2006, 7: 433-10.1186/1471-2105-7-433.PubMed CentralPubMedGoogle Scholar
- Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14 (7): 1394-1403. 10.1101/gr.2289704.PubMed CentralPubMedGoogle Scholar
- Tzika AC, Helaers R, Van de Peer Y, Milinkovitch MC: MANTIS: a phylogenetic framework for multi-species genome comparisons. Bioinformatics. 2008, 24 (2): 151-157. 10.1093/bioinformatics/btm567.PubMedGoogle Scholar
- Andersson SG, Kurland CG: Reductive evolution of resident genomes. Trends Microbiol. 1998, 6 (7): 263-268. 10.1016/S0966-842X(98)01312-2.PubMedGoogle Scholar
- Oliver KM, Russell JA, Moran NA, Hunter MS: Facultative bacterial symbionts in aphids confer resistance to parasitic wasps. Proc Natl Acad Sci USA. 2003, 100 (4): 1803-1807. 10.1073/pnas.0335320100.PubMed CentralPubMedGoogle Scholar
- Bensadia F, Boudreault S, Guay JF, Michaud D, Cloutier C: Aphid clonal resistance to a parasitoid fails under heat stress. J Insect Physiol. 2006, 52 (2): 146-157. 10.1016/j.jinsphys.2005.09.011.PubMedGoogle Scholar
- Degnan PH, Moran NA: Evolutionary genetics of a defensive facultative symbiont of insects: exchange of toxin-encoding bacteriophage. Mol Ecol. 2008, 17 (3): 916-929. 10.1111/j.1365-294X.2007.03616.x.PubMedGoogle Scholar
- Douglas AE: Reproductive failure and the free amino acid pools in pea aphids (Acyrthosiphon pisum) lacking symbiotic bacteria. J Insect Physiol. 1996, 42 (3): 247-255. 10.1016/0022-1910(95)00105-0.Google Scholar
- Buchner P: Endosymbiosis of animals with plant microorganisms. 1965, Interscience, New York., NYGoogle Scholar
- Muller HJ: The relation of recombination to mutation advance. Mutat Res. 1964, 1: 2-9.Google Scholar
- Perez-Brocal V, Gil R, Ramos S, Lamelas A, Postigo M, Michelena JM, Silva FJ, Moya A, Latorre A: A small microbial genome: the end of a long symbiotic relationship?. Science. 2006, 314 (5797): 312-313. 10.1126/science.1130441.PubMedGoogle Scholar
- Koonin EV: How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet. 2000, 1: 99-116. 10.1146/annurev.genom.1.1.99.PubMedGoogle Scholar
- Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, Daugherty MD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, Bhattacharya A, Kapatral V, D'Souza M, Baev MV, Grechkin Y, Mseeh F, Fonstein MY, Overbeek R, Barabasi AL, Oltvai ZN, Osterman AL: Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol. 2003, 185 (19): 5673-5684. 10.1128/JB.185.19.5673-5684.2003.PubMed CentralPubMedGoogle Scholar
- Koonin EV: Comparative genomics., minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003, 1 (2): 127-136. 10.1038/nrmicro751.PubMedGoogle Scholar
- Moran NA: Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA. 1996, 93 (7): 2873-2878. 10.1073/pnas.93.7.2873.PubMed CentralPubMedGoogle Scholar
- Prickett MD, Page M, Douglas AE, Thomas GH: BuchneraBASE: a post-genomic resource for Buchnera sp. APS. Bioinformatics. 2006, 22 (5): 641-642. 10.1093/bioinformatics/btk024.PubMedGoogle Scholar
- Tillier ER, Collins RA: Genome rearrangement by replication-directed translocation. Nat Genet. 2000, 26 (2): 195-197. 10.1038/79918.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.