Skip to main content

Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects

Abstract

Comparative genomics has become a real tantalizing challenge in the postgenomic era. This fact has been mostly magnified by the plethora of new genomes becoming available in a daily bases. The overwhelming list of new genomes to compare has pushed the field of bioinformatics and computational biology forward toward the design and development of methods capable of identifying patterns in a sea of swamping data noise. Despite many advances made in such endeavor, the ever-lasting annoying exceptions to the general patterns remain to pose difficulties in generalizing methods for comparative genomics. In this review, we discuss the different tools devised to undertake the challenge of comparative genomics and some of the exceptions that compromise the generality of such methods. We focus on endosymbiotic bacteria of insects because of their genomic dynamics peculiarities when compared to free-living organisms.

1. Genomes, Genomes, and More Genomes

The emergence of genome information has overwhelmed our efforts to analyze the unexpected amount of data generated during the last two decades. As an example, today (February, 2009), there are 438 complete microbial genomes and 17 in draft in the J. Craig Venter Institute, Comprehensive Microbial Resource website (URL: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi) considering that this is only a single resource we estimate that the number of completed genomes will be in the order of double that by the end of 2009 with a considerable percentage of these already published in the literature. Already the Entrez Genome project website controlled by National Center for Biotechnology Information (NCBI) reports that on February 3, 2009, 857 genomes are complete, 815 are in draft assembly, and 989 are in progress (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html). The number of institutes worldwide with increasing sequencing capacities has been rising at an exponential rate and the first results of analyzing such data have solved old and long debated hypotheses and also have generated breakthrough ideas that have opened new avenues in all fields of genetics and evolutionary biology. However, our ability to cope technically with the amount of generated raw data has become seriously compromised, fueling many initiatives aimed at developing computational tools to analyze genomic and proteomic data. Many of these tools have been developed to perform comparative genomic analyses; each tool has had to face many of the complexities that biologically driven genome remodeling phenomena cause, such as genome duplication, rearrangement, and shrinkage. In this review, we first discuss the different technologies developed to perform genomic and proteomic analyses. We then focus on the importance of the developed tools to study biologically important phenomena such as genome duplication, the dynamics of genome rearrangement, and genome shrinkage that is associated with the intracellular life of bacteria.

2. Common Methods in Comparative Genomics

Comparative genomic methods are vast in number as well as function. A decision about the best way to do something is often a long and arduous task in this field, a task that has resulted in the design and reengineering of many of the tools that are available. To describe every method in this area of research would be next to impossible, and so, this text will provide a snapshot of what is available for many of the common tasks in comparative genomics. The logical place to start is of course the beginning—genome sequencing, assembly, and closing, then continuing to discuss the intricacies of comparative genomics.

While in the past comparative genomics has concentrated on sequencing single genomes and parts of genomes, current excitement lies with the sequencing of environmental communities. This field of research, entitled metagenomics is fast growing and the current hot topic. Its application is most utilized to characterize unculturable organisms (an estimated 99% of microbes cannot be cultivated in a laboratory environment [1]), but it has also made it possible to sequence genomes without the problems that are associated with cultures maintained in laboratories [2]. Metagenomics has transformed the uses of such organisms by allowing the focus to move from those that can be cloned in culture [3]. Depending on the source of the environmental sample to be subjected to environmental shotgun sequencing, a colossal variation in the number of identified species may result. Just looking at prokaryotes alone, as few as five species were identified in a community sequencing carried out on acid mine biofilm (Tyson et al. [4]), in contrast, as many as 3,000 species were sequenced from a soil sample taken in Minnesota, USA analyzed by Tringe et al. [5]. For a comprehensive review of this subject, see [6].

3. Sequencing

In the context of γ-proteobacteria, sequencing is commonly carried out using a shotgun approach. This technique is popular and is widely used in the generation of long sequences, such as those found in whole genomes. Briefly, this approach involves the sequencing of random small cloned fragments, known as reads, in both directions from the genome. This fragmented reading of the genome is carried out multiple times to provide good coverage and overlap within the sequencing. Having good quality overlap/coverage allows the reads to be assembled into their original order, thus reconstructing the genome (Figure 1a). Not surprisingly, reconstructing the genome from short overlapping reads is a nontrivial task and requires complex computational techniques to produce a quality result. This technique was first described by Sanger et al. [7] and has been refined and used as the basis of genome sequencing and assembly ever since. The method has been developed in two main directions: (1) a whole genome shotgun approach [7, 8] and (2) a hierarchical shotgun approach [9].

Figure 1
figure 1

a Whole genome shotgun sequencing: Genome is sheared into small approximately equal sized fragments which are subsequently small enough to be sequenced in both directions followed by cloning. The cloned sequences (reads) are then fed to an assembler (illustrated in Figure 2). b To overcome some of the complexity of normal shotgun sequencing of large sequences such as genomes a hierarchical approach can be taken. The genome is broken into a series of large equal segments of known order which are then subject to shotgun sequencing. The assembly process here is simpler and less computationally expensive.

As described above, the whole genome approach where the genome is fragmented into defined length reads is followed by assembly, using purely bioinformatic-based techniques. The second approach, which is more appropriate for larger genomes, utilizes an added step to reduce the computational requirement in assembling the final sequence (Figure 1b). Firstly, the genome is broken into larger fragments, which are in a known order; these fragments are then subsequently subjected to sequencing using the normal shotgun approach. This method requires less computational intervention in assembling the reads into the correct order. Information is already known about the order of each subset of reads and thus less error is incurred in the final assembly. Of course, there are disadvantages with each of these approaches. For instance, with the whole-genome approach, there is the uncertainty as to whether the assembly is correct due to the total reliance on bioinformatics tools to join and order the reads; in addition, coverage may be insufficient (i.e., overlap between the fragments). The second approach is time consuming and labor intensive due to the addition of the extra step at the beginning of the protocol [10]; this approach is also susceptible to incomplete coverage [11]. Further advances have been made since the advent of shotgun sequencing but the central concepts remain the same.

Technologies currently used in genome sequencing include high-throughput methods such as 454 [12], SOLid (Applied Biosciences), and Solexa [13]. These methods differ from older technologies in their throughput. Hundreds of thousands of DNA molecules at the same time are sequenced instead of a single DNA clones being processed [14]. The reads returned from each of these technologies are very short; thus, assembly is rather difficult. This disadvantage is offset by the fact that some much DNA is sequenced. The sequencing methodology of these approaches, in particular 454, is called pyrosequencing. This essentially is the sequencing of DNA utilizing the detection of enzymatic activity to identify the bases. This process is termed "sequencing-by-synthesis" [15]. Future developments will of course increase the length of reads produced by the technologies, as well as the accuracy of the programs with which the fragments are assembled.

Discussion in the past has provided some insight into the pitfalls of each method and perhaps aided in the decision making process [14, 16, 17]. One thing is certain, the higher the coverage the method is able to achieve, the higher the likelihood that the assembly tool will get the correct result and so that in itself should be one of the highest considerations in the decision making process.

4. Base Calling and Genome Assembly

After genome sequencing is complete, it then becomes necessary to reconstruct the sequence fragments into a meaningful order that will accurately reflect the original orientation and order of the gene and junk (noncoding regions and pseudogenes) content. The most common and popular manner in which this is achieved is through the Phred [18, 19]–PHRAP [20]–CONSED [21] pipeline of tools (all of which originate from the University of Washington).

When assembling sequences from the myriad of reads that encompass a genome, several factors must be accounted for. Firstly, base-calling (the operation of determining the nucleotide base sequence from the chromatograph) must be completed with a minimum of erroneous interpretations of the chromatograph. The nucleotide sequence is determined for each read by the base-caller; the assembler then is utilized to piece the reads together into their original order, but must account for insertions, deletions, rearrangements, inversions, and sequence divergence in doing so. In particular, these events are important when assembling using a comparative method (i.e., using the scaffold of an existing genome to predict the locations of the fragments in the newly sequenced genome). No assembler (to date) proposes to handle all of these complications successfully but some do claim to be more capable than others under certain circumstances. For example, Pop et al. [22] reported that PHRAP [20] is more adept at creating long contigs (collection of contiguous pieces of DNA (reads)) than other available methods such as TIGR Assembler [23] or Celera Assembler (WGS-Assembler) [24]. This can be valuable and has been used in the past as an indication of the success of an assembler. More recently, it has been reported that a reduction in the length of contigs across the assembly is an acceptable outcome if the error rate is reduced [25]. Probably the most widely used base-calling algorithm is implemented in Phred [18, 19]. Others include GeneObject [26] and Life-Trace [27].

PHRAP has been widely adopted as an integral component of assembly pipelines such as implemented by Havlak et al. [28] in the Atlas Genome Assembly System and Mullikin and Ning [29] in the Phusion Assembler. It is considered the standard way in which to assemble smaller genomes with larger genomes relying on more complex algorithms provided by programs such as the WGS-Assembler.

Traditionally, assembly algorithms employ a method known as "overlap–layout–consensus" [30] (Figure 2). Initially, the reads are compared to one another to identify overlapping regions using a strategy known as hashing to minimize the time required to complete the computation [31]. When the potentially overlapping reads are positioned, a computationally intensive multiple sequence alignment is carried out to produce a consensus sequence. This consensus sequence is a draft of the genome and requires further computational and manual intervention to reach completion. In some genome assembly pipelines, a further step is introduced, in which information from sequencing in both directions of each fragment is utilized to reconstruct contigs into larger sections. These sections combine to create a scaffold, minimizing the amount of potential misassemble that may be introduced. Newer methods such as described by Pop et al. [31] eliminate the overlap identification step in favor of moving directly to the creation of the multiple sequence alignment, thus reducing the amount of time required to construct a draft assembly considerably. These methods have been entitled "alignment layout consensus" and are implemented in the AMOS Comparative Assembler (AMOS-Cmp). AMOS takes advantage of already available programs in its creation of multiple sequence alignments and scaffolds. Bambus [32] is designed to create scaffolds based on the discrete reads resulting from the shearing process of the shotgun technique. It aids in the resolution of the placement and direction of the reads using the mate-pair information produced by sequencing each read in both directions (a process known as double-ended shotgun sequencing). Using this scaffolding approach interleaved with other assembly techniques gives an elevated probability of producing a high quality complete genome.

Figure 2
figure 2

Overlap–layout–consensus genome assembly algorithm: Reads are provided to the algorithm. Overlapping regions are identified. Each read is graphed as a node and the overlaps are represented as edges joining the two nodes involved. The algorithm determines the best path through the graph (Hamiltonian path). Redundant information (i.e., unused nodes and edges) is discarded. This process is carried out multiple times and resulting sequences are combined to give the final consensus sequence that represents the genome.

There is no up-to-date objective comparison of genome assemblers available that takes the consistent development being carried out on each project into account. Comparisons carried out by groups such as Huang and Madan [33] and Chen and Skiena [34] are works that seek to validate recently released methods. Chen and Skiena [34] come closest to an objective comparison in their rigorous testing of their own creation, STROLL, and latest versions (at the time) of PHRAP by Green [20] and the TIGR Assembler by Sutton et al. [23]. In their evaluation of the programs, they reported that PHRAP was consistently more accurate in producing the correct assembly and had the lowest error rates of the group. STROLL produced similar results to PHRAP while TIGR Assembler produced a considerably more erroneous resultant assembly. The TIGR Assembler produced significantly more and smaller contigs, a higher proportion of gaps remaining unclosed and aside from the result, the process of running the TIGR Assembler on the read data used took approximately five times longer to complete than either of the other two programs evaluated.

In the race to publish the Human genome in the early 2000s, the Celera Whole Genome Assembler was engineered to accommodate large genomes. Its first use was described by [24] in the paper reporting the completion of the Drosophila genome (Myers et al.). This was enhanced and used later in the initial assembly of the Human genome [35] and the publication of the whole human genome assembly [36] in addition to the mouse [37], dog [38], and mosquito [39] genomes. While Celera is a private corporation, it has released the Celera Assembler as open source software for free usage.

In early 2007, a new assembly algorithm was described by Sommer et al. [25]. It is a streamlined approach aimed at providing a simple, faster, and more efficient means of assembling fragmented sequences. Minimus [25] performs its best on small assembly jobs such as small genomes, genes, and bacterial artificial chromosome clones [40]. It has also been assessed with respect to assembling larger sets of fragmented DNA such as those found in bacterial genomes and has been found to produce fewer assembly errors than PHRAP. The cost of this reduction in error rate is that the number of contigs is greater and consequently, the size of the contigs is smaller, resulting in a more fragmented assembly [25]. In addition, all test assemblies produced by Minimus were completed in approximately half the time that PHRAP used. It remains to be seen whether this new assembler will work its way into common use in assembly systems such as Phusion and Atlas, but it is unlikely to remain at an advantage for long as the development and advancements of new and reworked as assemblers is swift and continuous. It has been suggested that it is beneficial for more than one method to be used, so that the exclusive advantages of each method may be exploited [33]. This strategy may well of course be more time consuming but if this time is affordable, it should be implemented.

5. Annotating the Genome

Distilling information from the assembled genome is the next obvious step in the process of building biological understanding of each newly sequenced individual or species. Genome annotation has three main levels—nucleotide-level annotation, protein-level annotation, and process-level annotation. The DNA level annotation process itself has several procedures associated with it. The first procedure is called Mapping, which is the process of identifying known genes, markers, and landmarks within the genome. This is usually carried out using sequence similarity searching programs such as BLAST [41]. Secondly, Gene Finding as the name indicated involves the prediction of gene locations within the genome. Within the genes, the location of introns and exons are sought out in an effort to characterize the DNA into coding and junk categories. This is not a trivial process and often result in very poor sensitivity and specificity, in particular, results are poor when the signal-to-noise ratio is low, i.e., the amount of noncoding DNA is high (for a more elaborate review and comparison of gene prediction algorithms, see [42]).

Due to the extraordinary numbers of genes and sequences that have already been characterized in one species or another, a lot of the effort required to identify genes is cut out. Also to be identified are noncoding regions including, for example, tRNAs and rRNAs. These are mostly characterized by means of once again similarity searches and by using programs such as tRNAScan-SE [43]. Other regions that must be discovered are regulatory regions, such as transcription factor binding sites, the topic of which is covered in detail in a review paper [44]. In brief, methods have been developed to identify these regions by looking for patterns that occur more often that would be expected by chance; often this strategy is carried out in conjunction with similarity searching techniques.

At the protein-level annotation step, characterization is carried out. Genes are named and assigned functions mostly by means of comparison to already annotated genomes. Often this results in the categorization of many proteins into "unknown function" or "hypothetical protein" categories until experimentation provide light on the purpose of the gene at hand.

The final level of annotation is Process. Here, the biological processes affected by the gene are identified. Process categories usually include cell cycle, cell death, immune response, metabolism, etc. to name but a few. Once again, the processes affected are usually determined via comparison with the information that is already available. It is useful here to note the existence of a few well-established databases that have devised naming conventions and controlled vocabulary for the description of new genes. Probably, the most commonly utilized of these are Panther [45–48] and GO [49, 50]. Both of these are freely available for use via the World Wide Web and are widely accepted adhered to by the genome analysis community.

Much work has been done in the development of quicker and more reliable ways of dealing with and identifying the protein coding regions of a genome at the same time noncoding regions while not completely neglected have been lesser studied of the two. Neither the detection of coding or noncoding regions is easy nor is the development of reliable and robust methods nearing a plateau. Constant progress is being made in these field; thus, the literature should be watched closely in order to be up to date with the current best practices in annotation.

6. Closing the Genome

Closing and completing a genome-sequencing project has proved to be an important step in ensuring the accuracy and reliability of the output into public databases. While the release of draft sequences is very useful, they are notoriously erroneous—in sequence and assembly [17]. Error rates for draft sequencing have been reported to be 1 in 1,000–2,000 base pairs [51], in contrast to the rates of 1 in 10,000 reported by Selkov et al. [51] and 1 in 100,000 reported by Fleischmann [52] for whole genome sequencing. The typical errors found in draft sequences are sequencing errors, sequence misassembles, and the inclusion on contaminant sequences from foreign DNA as bona fide reads [17]. Finding the source of such problems is difficult and time consuming and is often carried out manually. The most important factor taken into account here is the economic tradeoff and whether it is worth the compromise. For example, are there enough financial resources to allow for the whole genome sequencing to be brought to a close? It is important to realize that the quality of the sequencing or lack thereof will propagate forward into whatever analysis is carried out using the DNA sequence. Negative effects will be evident in all downstream analysis; everything from annotation and gene recognition to subsequent identification of homologs, gene families, and phylogenetics relationships will be affected.

While the discussed methods of sequence assembly are thorough and have relatively low error rates, they are not capable of producing a completely reconstructed genome sequence without manual intervention and some potential resequencing. What the methods do produce is a draft sequence that would normally cover approximately 99% of the genome under reconstruction [17]. This draft stage of assembly can be reached within a short number of days. In contrast, the process of closing the assembly out may potentially require months to complete and in some instances may take years. For example, the draft human genome was published in 2001, 4 years ahead of the predicted date of availability (2005). The complete whole genome was, however, not finished until 2003 and subsequently published in 2004 [36]. The time and consequently the monetary cost incurred is a sacrifice that those in the area of comparative genomics are willing to make, as the quality provided by a closed genome is well worth the wait. Moreover, while useful in their own right, draft assemblies are constantly changing and potentially erroneous.

To meet the need for high quality complete genome sequences, several strategies have been developed at facilities such as TIGR, Washington University and Sanger. In some cases, a certain amount of error checking is carried out in conjunction with assembly. Programs such as EULER [53] and Arachne [54] are examples of assembly systems that include error correction components. Other approaches include the use of correction algorithms a posteriori to the assembly process. Examples of this type of program are Autofinish (of the wider package—CONSED) [55], MisEd [56], and ReDit [57]. Autofinish, one of the most popular computer programs, is used in many genome sequencing centers, such as The Genome Center at the University of Washington, the Berkeley Drosophila Genome Project at Lawrence Berkeley National Laboratory, and the Lita Annenberg Hazen Genome Center at Cold Spring Harbor Laboratory among others [55]. The product of the program must be manually inspected to ensure the quality and accuracy, but the amount of human intervention in this program is significantly reduced. In projects that had sequence coverage as low as four and five times, the human time required to close the project was reduced by more than 51% and 83%, respectively [55]. As the sequence coverage increased up to 14 times, the difference diminished, but consistently less human effort was required when Autofinish was utilized.

The finishing techniques that are employed in programs such as Autofinish reflect what a human finisher does in identifying problem areas in the assembly that has been produced. They go on to propose possible means of resolving the issues, indicating regions to be resequenced and potential reads to aid in closing any gaps that are present. Due to the nature of the problems that are found in draft genome sequences, the process of finishing is an iterative process that can require many cycles through a workflow to resolve all issues; frequently, it is necessary for a human finisher to get involved toward the end to complete the process. This intervention must be as efficient as possible and many graphical viewers and editors are available for this purpose. Examples of manual finishing software are components of the aforementioned CONSED: sequence finishing tool [21] and ReDit: shotgun assembly finishing aid [57], also others include BaCCardI: validate and assist in finishing [58] and DNPTrapper: analysis of complex regions and finishing tool [59]. Each of these software programs aim to make the editing process as user friendly as possible while offering the best possible combinations of editing capabilities.

7. Comparative Genomics: Solving the Puzzle

Comparative genomics is one of the most promising areas that logically follows the success in improving genome sequencing. More and more comparative genomics programs are being demanded to identify protein-coding genome regions, placement of regulatory elements, and the main evolutionary dynamics affecting the complexity of genome organization. Despite its apparent simplicity, such comparative methods have to face many technical as well as theoretical problems. One of the most important problems is aligning whole genomes and visualizing such alignments in a comprehensive and comprehensible way. This problem in sequence alignment leads to other genomic problems such as the finding of orthologs between genomes. The magnitude of this problem becomes increasingly magnified when the comparison is held between genomes with different population dynamics and hence different mutational rates, as we will explain below.

7.1. The First Hurdle—How to Determine the Homologs (Orthologs and/or Paralogs)?

Identification of homologous genes relies on the appropriate definition of a homolog. The most widely accepted definition is that homologous genes share a common ancestry. This definition, however, is not precise as to the nature of this common ancestry and comprises two types of homologs (as described by Fitch [60] and Fitch and Margoliach [61]): orthologs (common species ancestry caused by speciation event in such away that the homolog genes are in different species) and paralogs (common gene ancestry caused by a gene duplication event and, as a consequence, the homologous genes are present in the same species).

Irrespective of the nature of the ancestry considered, homologs are usually identified on the basis of sequence similarity. So the higher the similarity, the more likely it is that the sequences have derived form a common ancestor. One of the first and the most commonly used software to detect the degree of similarity between sequences is BLAST [62] and the newer version PSI-BLAST [63]. BLAST uses predefined scoring matrices in comparison to position-specific scoring matrices derived from the scoring hits in the initial search in PSI-BLAST. The two programs yield information about the score for the comparisons and their likelihood, called the e-value. Sequences with the highest scores and therefore with the lowest e-values are considered to be the closest relatives in the searched database. The assumption underlying this software is that the phylogenetic relationship between any two sequences and their degree of similarity are positively correlated. This, however, leads to another theoretical problem: how to determine if a sequence is more similar to a different particular sequence than it is to another. Unfortunately, setting a statistical cutoff value to determine when two sequences are significantly similar is rather difficult and problematic when determining a set of possible homologs. The lower the cut off, the larger the number of false negatives. On the other hand, the higher the cut off, the larger the number of false positives. As an additional drawback, the sequences with the highest score and lowest e-value are not always more closely related to each other than those identified as hits with a lower score [64].

In the BLAST searches for homologs, many types of relationships between the homologs can be investigated, including hits of many-to-many, one-to-many, or very strict one-to-one relationships. The first two are a result of duplication events after speciation. A very effective way to identify one-to-one relationship is by performing the generally called reciprocal best BLAST hits [65, 66]. This method is based on the assumption that genes that are each other's best hits when performing a BLAST search are more likely to be orthologs compared to ones that are not. The reason for this is that although gene A in genome 1 may be the best match for gene B in genome 2, this match may be worse than gene B in genome 2 with gene C in genome 1. This approach is again limited by the problem of the assumption that best hits ensure orthology, which might not be the case when a particular gene underwent a recent duplication in a particular lineage. The consequence of this is that when a gene finds a paralog as top BLAST hit instead of its ortholog, both the gene and its paralog are excluded from downstream analyses [67]. These limitations in the BLAST searches have fuelled the development of other ways to identify putative orthologs over the last few years. One of such methods uses the sequence distances instead of similarities to identify orthologs and uses the reciprocal smallest distance algorithm [67]. It uses global sequence alignment and maximum likelihood to estimate the evolutionary distances between genes to detect orthologous genes. This approach have also been used to determine orthologs in databases like Roundup [68]. Another simple approach that has contributed significantly to the reduction in the number of false positive results when conducting BLAST searches is PSI-BLAST [69].

Homology may also be ascertained by means of phylogenetic methods such as BranchClust by [70]. This type of method is capable of determining homology distinguish it from paralogy. BranchClust utilizes similarity searching during the execution of its algorithm but obviously does not rely solely on it. Hits within a certain threshold are used rather than the best hit in order to include paralogs and orthologs. These results are then grouped into what Poptsova and Gogarten has termed superfamilies. These sequences are aligned and phylogenetic trees are constructed. The step of phylogenetic inference is then followed by a complex algorithm that is described fully in the application's article [70]. The outcome of using this method over more traditional one is that BranchClust is reported to outperform similarity search methods due to its lower false negative rate than the reciprocal best blast hit method.

Irrespective of the method used to identify homologs, visualizing results is a common way to inspect and yield the first insights into trends and patterns when looking at genome evolutionary dynamics. This fact has inspired the creation of software for comparative genomics with graphical solutions to assist in the interpretation of the results. These solutions provide user-friendly environments in which navigation along alignments, etc. is easy and reliable. The question remains, however, whether visualization tools can solve the puzzle of genome rearrangements. An argument against the use of techniques such as this is that the process will not be repeatable or statistically sound. Undoubtedly, insights will be yielded but all sure perceived trends should be investigated in a more analytically robust manner.

7.2. Pairwise Genome Comparisons

Many groups have devoted a substantial amount of their resources to the development of tools aimed at comparing two genomes and have validated such tools by comparing circular prokaryotic genomes. Some visualization software tools have specialized in performing direct comparisons of synteny information through scatter plots of pairwise genome comparisons. For example, software such as DAGchainer [71], GeneOrder [72], GenePlot from NCBI [73], Genome v/s Genome Protein Hits Scatter Plot from The Comprehensive Microbial Resource (CMR) [74], and GenomePlot from PLATCOM [75] achieve this by presenting a plot where one axis represents the positions of the genes within one of the genomes while the other represents the genes for the other genome (Figure 3). The scatter plot then represents homologous genes for both genomes determined by either total hits or best BLAST hit. Perfectly syntenic genes between the two genomes would therefore represent a linear relationship between the two axes (Figure 3a) whereas alternative arrangements of the scattered dots may indicate that genome rearrangements have taken place in one of the genomes (Figure 3b). As an alternative to these visual representations, other programs such as GRAST mark the hits between the two genomes and represent them in a circular way [76]. Finally, other programs such as ACGT [77], GOAL from BROP [78], BugView [79], and GenomeComp [80] have contributed to the field of comparative genomics by linearly representing rearrangements or syntenic information by linking homologous regions between the compared genomes using lines. The advantage of programs such as these is that in addition to yielding information about genome rearrangements, they can also spot conserved and nonconserved regions between the two genomes in much greater detail than other programs.

Figure 3
figure 3

Genome rearrangements plots comparing two genomes. Genome plots can provide information on the kind of rearrangements undergone. These plots represent the location of each gene in one axis for one of the genomes against the location of the found ortholog in the other axis for the second genome. a Comparative genomic plot when comparing two genomes showing no lineage-specific genome rearrangements. In this case, the plot was produced for the comparison of two primary symbiotic bacteria of insects (B. aphidicola strain A. pisum versus B. aphidicola strain Schizaphis graminum). Since no rearrangements have occurred in any of the two genomes, the comparison yields a straight diagonal line. b Comparative genomic plot for two genomes showing lineage-specific genome rearrangements. In this case, the plot was comparing the genome of other patterns that can be observed and are x-like patterns b (in this case, B. aphidicola, A. pisum, and E. coli k12) where the rearrangements have occurred over the replication axis E. coli K12 to the genome of B. aphidicola strain A. pisum. c This is the comparison between Chlamydophila pneumoniae CWL029 and Chlamydia trachomatis 434/Bu that show an even better example of rearrangements that have occurred over the replication axis (this example have also been shown in [102]). As shown, many rearrangements including inversions and translocations have occurred, and consequently, the orthologs are not located in the major diagonal of the plot but rather show an X-shape distribution. This is expected if an inversion has taken place near the centromer of the chromosome.

Aside from the syntenic analyses using visualization tools, other programs have been developed to search for other types of information in comparative genomics. For example, GC Comparison Graph from The CMR [74] compares the GC content between two genomes by placing orthologs in the axis according to their GC content, highlighting GC compositional shifts at the genome level between two genomes. Although useful in their content, these programs are subject to several drawbacks from the pragmatic point of view among which the most important is the impossibility to perform multiple genome comparisons and hence to establish the ancestry of genome rearrangement dynamics.

7.3. Multiple Genome Comparisons

As the number of genomes increased over the last decade, the demand for an understanding of the dynamics of genome evolution also increased. Dealing with the complexity of multiple genomes comparisons has been halted by the unparalleled development of appropriate software tools. Nowadays, several software tools have been developed. An example of a multiple genome comparison tool is GenColors from Jena Prokaryotic Genome Viewer (JPGV) [81]. This program allows the user to display a number of features on the genome, like CDS, RNA genes, tRNA genes, rRNA genes, Mics RNA, GC contents, GC skew Keto excess, etc. This database also represents genomes in either a circular diagram or in a linear plot. Although several genomes can be examined at the same time using this tool, these are human observations of the genomes rather than real phylogenetic studies of the genome properties. JPGV allows multiple genome comparisons by determining a core gene set of two or more genomes defined by the set of best-bidirectional hits for all possible pairs of genes. Other methods of the JPGV are implemented to perform pair wise comparisons only.

In addition, there are computational tools that compare multicircular prokaryotic genomes and present their similarities in a circular diagram. Some of these tools perform these comparisons in addition to the BLAST searches and the CGView server is an example of that [82]. Others also display information about the percentage of GC for each one of the genomes, such is the case of GenomeViz [83].

To gain more information about genome rearrangements and inversions, there has been a great effort in developing tools that perform linear comparisons between genomes. The way these tools compare genomes is by performing genomes alignments where possible and then by conducting multiple genome comparisons. There are many different multiple genome alignments algorithms. The first type is based on defining a reference genome and performing alignments taking into account that reference genome. This type of alignment algorithm is implemented in a program called Vista [84]. The second approach is that where an iterative pairwise alignment is performed under the control of a guide tree. The tree defines the order in which the genomes should be added to the alignment. The third type of algorithms determines anchors present in all genomes and then proceeds to align them. Once aligned, the last step is to close the gaps between the anchors by aligning the substrings between them. Examples of programs implementing this type of algorithm are MGA [85], M-GCAT [86], and Mauve [87], with each of them having their own algorithm for identifying the anchors and performing the alignment of the interanchor regions afterward.

There are other tools that allow the user to do other things in addition to the alignment of genomes. For example, MANTIS [88] is a phylogenetic-group specific (metazoan phylogeny) tool that analyzes the patterns of gene gains and losses at specific branches of the phylogeny. Then, the program infers the gene content of the ancestral genome to the clade and identifies over- or underrepresentation of certain processes among the class of gene gains or losses.

Despite all these effort in developing more robust and accurate methods to perform comparative genomic studies, several biological phenomena pose difficulties in identifying the real genome dynamic processes in organisms. For example, genome duplication, genome shrinkage in intracellular symbiotic bacteria, and lateral gene transfer may well hide the real genome rearrangement processes undergone in particular genomes. To illustrate the importance of the biology of the organismal biology to understand genome dynamics, we will focus the rest of the review on intracellular bacterial genomes.

8. Comparative Genomics of Intracellular Bacteria

Intracellular bacteria are a special group of organisms that have been able to adapt to intracellular life, establishing either a symbiotic or pathogenic relationship with the host. Because many of the genes that were important for the free lifestyle are no longer needed by these bacteria, they underwent nonfunctionalization followed by disintegration [89]. This process has been enhanced by the fact that the host provides these bacteria by some of their needed components and by a chemically stable rich environment. Genome shrinkage is therefore a fact in most if not all the strict intracellular bacteria and this process has been mostly accompanied by genome rearrangements and fast evolutionary rates of proteins. Because of these intracellular associated genomic and evolutionary events, comparative genomics including identification of orthologs, paralogs, synteny analyses, and others pose great challenge in the comparison with free-living bacteria and require including biological information in the comparative genomics analyses to increase the accuracy of the results.

In the case of the symbiotic relationships, the difficulty of comparative genomics acquires another dimension and complexity specifically associated to the mutational dynamics of these organisms. There are two main groups of symbiotic bacteria: the facultative and the obligated. When the association is facultative, it implies that the survival of each partner can be possible without the other under special environmental conditions. This is for example seen between the pea aphid Acyrthosiphon pisum and the facultative endosymbionts Hamiltonella defensa that acts as a protector of the aphid against parasitism by the solitary endoparasitoids Aphidium ervi and Aphidius eadyi [90–92]. The other case, obligated, is when the relationship between the two organisms becomes so close that the host's relative biological fitness would become seriously compromised if deprived of the symbiont. This is the case of the symbiotic relationship between the bacterium Buchnera aphidicola sp. and the aphid insect [93] and it is an example where the host (the aphid) has evolved specialized cells to house its endosymbionts (so called bacteriocytes) [94]. This relationship is one of the best characterized in the literature so the last following part of this review will focus on endosymbionts contained in bacteriocyte and the challenges that their mutational dynamics impose in the comparative genomics of bacteria.

8.1. Genome Evolution of Intracellular Bacteria

The clonal vertical transmission of small populations in many intracellular symbiotic bacteria and pathogens to the next host generations imposes a strong bottleneck on the effective population size of these bacteria. This results in relaxed selective constraints in the symbiotic genomes and their channeling into a dynamic of neutral fixation of slightly deleterious mutations and irreversible increase in the endosymbiont genome mutational load (a phenomenon named Muller's ratchet [95]). However, these bacteria are also subjected to selection imposed mostly over their insect hosts. Because of their clonal transmission and their confinement to the interior of bacteriocytes symbiotic bacteria have little or no opportunity for recombination and hence have no alternative means for the removal of these slightly deleterious mutations.

Is there a minimum set of genes necessary for the maintenance of intracellular life? Numerous scientists have addressed this question and many have been attempting to answer it through the study of the smallest endosymbiotic genome [96]. Comparative genomics studies in a large number of organisms have shown that the minimal gene content will depend on the environmental conditions the organism lives under [97, 98].

The process of gene loss in intracellular organisms has an important effect on rewiring the functional relationships among genes. This would lead to different organisms containing different genes performing the same essential functions in the cell. So when looking at gene content of intracellular bacteria, we should talk about the functional group of genes instead of individual genes [99].

8.2. Difficulties with Comparative Genomics of Bacteriocyte-Housed Insect Endosymbionts

Comparison of bacterial genomes may provide clues about the main genome rearrangement dynamics supporting different lifestyles, for example, comparative genomics of intracellular symbiotic bacteria and their closest free-living relatives. Performing comparative genomics on bacteria that are in an intermediate stage between free-living and host specific symbiosis (primary endosymbionts) with each of their groups could shed some light on the establishment of symbiosis itself. These bacteria are the ones we refer to as secondary endosymbionts—they are distinguishable from primary symbionts by their larger genomes and the fact that they are not living under the protection of the bacteriocytes provided by their hosts.

As a consequence of Muller's ratchet in intracellular bacteria in combination with mutational bias, their genomes present a higher AT content than observed in their free-living relatives [100]. This results in programs like BLAST having increased difficulty in determining homologs—especially between the intracellular bacteria and their free-living relatives.

The difficulty of doing comparative genomics with intracellular bacteria is that few to none of the software programs have been designed to deal with any of the theoretical problems seen in these organisms. Most software and methods have been directed toward the broad stream of the comparison of genomes with similar sizes and belonging to bacteria with minor differences regarding their lifestyle or environmental conditions. The challenge, however, resides on identifying important genomic dynamics that occurred during the transition between two lifestyles and hence between potentially different biological systems.

One of the biggest problems with the comparative genomics of endosymbiontic and pathogenic bacteria to their closest free-living relative bacteria is the different evolutionary force under which they evolve. Because the population sizes of endocellular symbiotic bacteria undergo strong bottleneck during the intergenerational transmissions, many of the stochastically produced amino acid mutations are fixed by genetic drift despite their slight deleterious effects. This implies that the mean mutational load in the endocellular bacteria will dramatically increase posing serious difficulties to find their orthologs in free-living bacteria. Comparing endosymbionts with each other can yield valuable information about endosymbiosis but it is crucial to compare the endosymbionts to free-living bacteria to be able to investigate the transition from free-living to intracellular lifestyle and predict the shift in evolutionary forces. Novel methods are hence required to account for the biological and population genetics differences of the organisms whose genomes are being compared.

8.3. Databases and Methods for the Analysis of Endosymbionts

BuchneraBASE [101] is a database that contains information on Buchnera sp. APS. This database is the only of its kinds, to our knowledge, devoted completely to a primary symbiont. It does not offer any direct comparative genome tool for the user like many other databases but it contains some data obtained from comparison between symbiotic gamma-proteobacteria and an in silico model of Escherichia coli. This database was built as to integrate new sequenced genomes from symbiotic bacteria as they became available. It performs comparisons between different genomes using the information of gene orthology. The database also has a summary page that shows two user-interactive tables. The first table represents the number of genes in each of the genomes that are in a certain category, i.e., total number of complete genes, total number of pseudogenes, genes with an E. coli ortholog shared with the endosymbiont of Wigglesworthia glossinidia or not shared with Wigglesworthia, etc. The second table can be used to browse through each of the functional classifications for each of the symbionts stored in the database.

To our knowledge, there is only one program, GRAST [76], that has been developed with the sole purpose to investigate the evolutionary dynamics of endosymbionts. It performs a pairwise comparison between a free-living (reference genome) genome and an endosymbiotic genome and allows the user to choose between different outputs options, providing valuable insights regarding the change in genome dynamics in comparison to their free-living relatives. The outputs range from generation of genome plots with orthologous and nonorthologous genes' sets are plotted in the two genomes being compared to plots with the analysis of the distribution of genome rearrangements or dynamics in one of the genomes (Figure 3). Among other types of information, the program yields information about conserved regions between the two genomes, distribution of percentage of differences in the number of genes present in the different functional categories between the two genomes being compared and deviations from the expected percentage of orthologs between the genomes, and information about intergenic regions according to their position/rearrangement in the two genomes.

A brief look at the genome sizes of bacteria would suffice to realize about the incredible diversity of the genomic dynamic events that have been happening throughout evolution. These events are key to understand the different evolutionary processes shaping organismal organization. Intracellular organisms perform a minority of this diversity but they represent extreme cases where most of the genomic dynamics become dramatically manifested. New methods should therefore be developed to perform in-depth comparative genomic analyses of these bacteria to infer important shifts in the evolution of genomes.

The genomic era has exploded and generated new research avenues that go beyond all expectations. A plethora of novel ways of designing experiments and computational tools has been fuelled by the information generated from the first comparative genomics analyses. The challenge that remains is to design new comprehensive and accurate bioinformatics tools capable of counterbalancing our limitations to analyze the overwhelming amount of genomic data generated.

References

  1. Rappe MS, Giovannoni SJ: The uncultured microbial majority. Annu Rev Microbiol. 2003, 57: 369-394. 10.1146/annurev.micro.57.030502.090759.

    CAS  PubMed  Google Scholar 

  2. Hugenholtz P: Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002, 3 (2): REVIEWS0003-10.1186/gb-2002-3-2-reviews0003.

    PubMed Central  PubMed  Google Scholar 

  3. Chen K, Pachter L: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol. 2005, 1 (2): 106-112. 10.1371/journal.pcbi.0010024.

    CAS  PubMed  Google Scholar 

  4. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428 (6978): 37-43. 10.1038/nature02340.

    CAS  PubMed  Google Scholar 

  5. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science. 2005, 308 (5721): 554-557. 10.1126/science.1107851.

    CAS  PubMed  Google Scholar 

  6. Riesenfeld CS, Schloss PD, Handelsman J: Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004, 38: 525-552. 10.1146/annurev.genet.38.072902.091216.

    CAS  PubMed  Google Scholar 

  7. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 1977, 74 (12): 5463-5467. 10.1073/pnas.74.12.5463.

    PubMed Central  CAS  PubMed  Google Scholar 

  8. Edwards A, Voss H, Rice P, Civitello A, Stegemann J, Schwager C, Zimmermann J, Erfle H, Caskey CT, Ansorge W: Automated DNA sequencing of the human HPRT locus. Genomics. 1990, 6 (4): 593-608. 10.1016/0888-7543(90)90493-E.

    CAS  PubMed  Google Scholar 

  9. Green P: Whole-genome disassembly. Proc Natl Acad Sci USA. 2002, 99 (7): 4143-4144. 10.1073/pnas.082095999.

    PubMed Central  CAS  PubMed  Google Scholar 

  10. Kaiser O, Bartels D, Bekel T, Goesmann A, Kespohl S, Puhler A, Meyer F: Whole genome shotgun sequencing guided by bioinformatics pipelines—an optimized approach for an established technique. J Biotechnol. 2003, 106 (2–3): 121-133. 10.1016/j.jbiotec.2003.08.008.

    CAS  PubMed  Google Scholar 

  11. Tauch A, Homann I, Mormann S, Ruberg S, Billault A, Bathe B, Brand S, Brockmann-Gretza O, Ruckert C, Schischka N, Wrenger C, Hoheisel J, Mockel B, Huthmacher K, Pfefferle W, Puhler A, Kalinowski J: Strategy to sequence the genome of Corynebacterium glutamicum ATCC 13032: use of a cosmid and a bacterial artificial chromosome library. J Biotechnol. 2002, 95 (1): 25-38. 10.1016/S0168-1656(01)00443-6.

    CAS  PubMed  Google Scholar 

  12. Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, Li K, Rogers YH, Strausberg R, Sutton G, Tallon L, Thomas T, Venter E, Frazier M, Venter JC: A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes (vol 103., pg 11240., 2006). Proc Natl Acad Sci USA. 2006, 103 (43): 16057-10.1073/pnas.0607197103.

    CAS  Google Scholar 

  13. Potera C: New gene sequencer targets productivity—Solexa says its novel system offers better cost-effectiveness via use of short-read sequences. Genet Eng News. 2006, 26 (17): 10–+-

    Google Scholar 

  14. Graveley BR: Molecular biology—power sequencing. Nature. 2008, 453 (7199): 1197-1198. 10.1038/4531197b.

    PubMed Central  CAS  PubMed  Google Scholar 

  15. Wicker T, Schlagenhauf E, Graner A, Close TJ, Keller B, Stein N: 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006, 7: 275-10.1186/1471-2164-7-275.

    PubMed Central  PubMed  Google Scholar 

  16. Branscomb E, Predki P: On the high value of low standards. J Bacteriol. 2002, 184 (23): 6406-6409. 10.1128/JB.184.23.6406-6409.2002.

    PubMed Central  CAS  PubMed  Google Scholar 

  17. Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL: The value of complete microbial genome sequencing (you get what you pay for). J Bacteriol. 2002, 184 (23): 6403-6405. 10.1128/JB.184.23.6403-6405.2002. discusion 5.

    PubMed Central  CAS  PubMed  Google Scholar 

  18. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8 (3): 175-185.

    CAS  PubMed  Google Scholar 

  19. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8 (3): 186-194.

    CAS  PubMed  Google Scholar 

  20. Green P: PHRAP., unpublished. 1994, [http://www.phrap.org/]

    Google Scholar 

  21. Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8 (3): 195-202.

    CAS  PubMed  Google Scholar 

  22. Pop M, Salzberg SL, Shumway M: Genome sequence assembly: algorithms and issues. Computer. 2002, 35 (7): 47-54. 10.1109/MC.2002.1016901.

    Google Scholar 

  23. Sutton G, White O, Adams MD, Kerlavage AR: TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995, 1: 9-19.

    CAS  Google Scholar 

  24. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science. 2000, 287 (5461): 2196-2204. 10.1126/science.287.5461.2196.

    CAS  PubMed  Google Scholar 

  25. Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast., lightweight genome assembler. BMC Bioinformatics. 2007, 8: 64-10.1186/1471-2105-8-64.

    PubMed Central  PubMed  Google Scholar 

  26. Gilchrist R, Chi V: Visible Genetics Inc.., assignee. GeneObject. 1999, inventors. USA patent 5916747.

    Google Scholar 

  27. Walther D, Bartha G, Morris M: Basecalling with LifeTrace. Genome Res. 2001, 11 (5): 875-888. 10.1101/gr.177901.

    PubMed Central  CAS  PubMed  Google Scholar 

  28. Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system. Genome Res. 2004, 14 (4): 721-732. 10.1101/gr.2264004.

    PubMed Central  CAS  PubMed  Google Scholar 

  29. Mullikin JC, Ning Z: The phusion assembler. Genome Res. 2003, 13 (1): 81-90. 10.1101/gr.731003.

    PubMed Central  CAS  PubMed  Google Scholar 

  30. Peltola H, Soderlund H, Ukkonen E: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 1984, 12 (1 Pt 1): 307-321. 10.1093/nar/12.1Part1.307.

    PubMed Central  CAS  PubMed  Google Scholar 

  31. Pop M, Phillippy A, Delcher AL, Salzberg SL: Comparative genome assembly. Brief Bioinform. 2004, 5 (3): 237-248. 10.1093/bib/5.3.237.

    CAS  PubMed  Google Scholar 

  32. Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res. 2004, 14 (1): 149-159. 10.1101/gr.1536204.

    PubMed Central  CAS  PubMed  Google Scholar 

  33. Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.

    PubMed Central  CAS  PubMed  Google Scholar 

  34. Chen T, Skiena SS: A case study in genome-level fragment assembly. Bioinformatics. 2000, 16 (6): 494-500. 10.1093/bioinformatics/16.6.494.

    CAS  PubMed  Google Scholar 

  35. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X: The sequence of the human genome. Science. 2001, 291 (5507): 1304-1351. 10.1126/science.1058040.

    CAS  PubMed  Google Scholar 

  36. Istrail S, Sutton GG, Florea L, Halpern AL, Mobarry CM, Lippert R, Walenz B, Shatkay H, Dew I, Miller JR, Flanigan MJ, Edwards NJ, Bolanos R, Fasulo D, Halldorsson BV, Hannenhalli S, Turner R, Yooseph S, Lu F, Nusskern DR, Shue BC, Zheng XH, Zhong F, Delcher AL, Huson DH, Kravitz SA, Mouchard L, Reinert K, Remington KA, Clark AG, Waterman MS, Eichler EE, Adams MD, Hunkapiller MW, Myers EW, Venter JC: Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA. 2004, 101 (7): 1916-1921. 10.1073/pnas.0307971100.

    PubMed Central  CAS  PubMed  Google Scholar 

  37. Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, Wides R, Halpern A, Li PW, Sutton GG, Nadeau J, Salzberg SL, Holt RA, Kodira CD, Lu F, Chen L, Deng Z, Evangelista CC, Gan W, Heiman TJ, Li J, Li Z, Merkulov GV, Milshina NV, Naik AK, Qi R, Shue BC, Wang A, Wang J, Wang X, Yan X, Ye J, Yooseph S, Zhao Q, Zheng L, Zhu SC, Biddick K, Bolanos R, Delcher AL, Dew IM, Fasulo D, Flanigan MJ, Huson DH, Kravitz SA, Miller JR, Mobarry CM, Reinert K, Remington KA, Zhang Q, Zheng XH, Nusskern DR, Lai Z, Lei Y, Zhong W, Yao A, Guan P, Ji RR, Gu Z, Wang ZY, Zhong F, Xiao C, Chiang CC, Yandell M, Wortman JR, Amanatides PG, Hladun SL, Pratts EC, Johnson JE, Dodson KL, Woodford KJ, Evans CA, Gropman B, Rusch DB, Venter E, Wang M, Smith TJ, Houck JT, Tompkins DE, Haynes C, Jacob D, Chin SH, Allen DR, Dahlke CE, Sanders R, Li K, Liu X, Levitsky AA, Majoros WH, Chen Q, Xia AC, Lopez JR, Donnelly MT, Newman MH, Glodek A, Kraft CL, Nodell M, Ali F, An HJ, Baldwin-Pitts D, Beeson KY, Cai S, Carnes M, Carver A, Caulk PM, Center A, Chen YH, Cheng ML, Coyne MD, Crowder M, Danaher S, Davenport LB, Desilets R, Dietz SM, Doup L, Dullaghan P, Ferriera S, Fosler CR, Gire HC, Gluecksmann A, Gocayne JD, Gray J, Hart B, Haynes J, Hoover J, Howland T, Ibegwam C, Jalali M, Johns D, Kline L, Ma DS, MacCawley S, Magoon A, Mann F, May D, McIntosh TC, Mehta S, Moy L, Moy MC, Murphy BJ, Murphy SD, Nelson KA, Nuri Z, Parker KA, Prudhomme AC, Puri VN, Qureshi H, Raley JC, Reardon MS, Regier MA, Rogers YH, Romblad DL, Schutz J, Scott JL, Scott R, Sitter CD, Smallwood M, Sprague AC, Stewart E, Strong RV, Suh E, Sylvester K, Thomas R, Tint NN, Tsonis C, Wang G, Wang G, Williams MS, Williams SM, Windsor SM, Wolfe K, Wu MM, Zaveri J, Chaturvedi K, Gabrielian AE, Ke Z, Sun J, Subramanian G, Venter JC, Pfannkoch CM, Barnstead M, Stephenson LD: A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002, 296 (5573): 1661-1671. 10.1126/science.1069193.

    CAS  PubMed  Google Scholar 

  38. Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC: The dog genome: survey sequencing and comparative analysis. Science. 2003, 301 (5641): 1898-1903. 10.1126/science.1086432.

    PubMed  Google Scholar 

  39. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, Cai S, Center A, Chaturverdi K, Christophides GK, Chrystal MA, Clamp M, Cravchik A, Curwen V, Dana A, Delcher A, Dew I, Evans CA, Flanigan M, Grundschober-Freimoser A, Friedli L, Gu Z, Guan P, Guigo R, Hillenmeyer ME, Hladun SL, Hogan JR, Hong YS, Hoover J, Jaillon O, Ke Z, Kodira C, Kokoza E, Koutsos A, Letunic I, Levitsky A, Liang Y, Lin JJ, Lobo NF, Lopez JR, Malek JA, McIntosh TC, Meister S, Miller J, Mobarry C, Mongin E, Murphy SD, O'Brochta DA, Pfannkoch C, Qi R, Regier MA, Remington K, Shao H, Sharakhova MV, Sitter CD, Shetty J, Smith TJ, Strong R, Sun J, Thomasova D, Ton LQ, Topalis P, Tu Z, Unger MF, Walenz B, Wang A, Wang J, Wang M, Wang X, Woodford KJ, Wortman JR, Wu M, Yao A, Zdobnov EM, Zhang H, Zhao Q, Zhao S, Zhu SC, Zhimulev I, Coluzzi M, della Torre A, Roth CW, Louis C, Kalush F, Mural RJ, Myers EW, Adams MD, Smith HO, Broder S, Gardner MJ, Fraser CM, Birney E, Bork P, Brey PT, Venter JC, Weissenbach J, Kafatos FC, Collins FH, Hoffman SL: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298 (5591): 129-149. 10.1126/science.1076181.

    CAS  PubMed  Google Scholar 

  40. Shizuya H, Birren B, Kim UJ, Mancino V, Slepak T, Tachiiri Y, Simon M: Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci USA. 1992, 89 (18): 8794-8797. 10.1073/pnas.89.18.8794.

    PubMed Central  CAS  PubMed  Google Scholar 

  41. Stein L: Genome annotation: from sequence to biology. Nat Rev Genet. 2001, 2 (7): 493-503. 10.1038/35080529.

    CAS  PubMed  Google Scholar 

  42. Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000, 10 (4): 483-501. 10.1101/gr.10.4.483.

    PubMed Central  CAS  PubMed  Google Scholar 

  43. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997, 25 (5): 955-964. 10.1093/nar/25.5.955.

    PubMed Central  CAS  PubMed  Google Scholar 

  44. Pennacchio LA, Rubin EM: Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet. 2001, 2 (2): 100-109. 10.1038/35052548.

    CAS  PubMed  Google Scholar 

  45. Mi H, Guo N, Kejariwal A, Thomas PD: PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res. 2007, 35 (Database issue): D247-D252. 10.1093/nar/gkl869.

    PubMed Central  CAS  PubMed  Google Scholar 

  46. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD: The PANTHER database of protein families., subfamilies., functions and pathways. Nucleic Acids Res. 2005, 33 (Database issue): D284-D288. 10.1093/nar/gki078.

    PubMed Central  CAS  PubMed  Google Scholar 

  47. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003, 13 (9): 2129-2141. 10.1101/gr.772403.

    PubMed Central  CAS  PubMed  Google Scholar 

  48. Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O: PANTHER: a browsable database of gene products organized by biological function., using curated protein family and subfamily classification. Nucleic Acids Res. 2003, 31 (1): 334-341. 10.1093/nar/gkg115.

    PubMed Central  CAS  PubMed  Google Scholar 

  49. Blake JA, Harris MA: The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis. Curr Protoc Bioinformatics. 2002, Chapter 7 (Unit 7.2):

  50. Camon E, Barrell D, Brooksbank C, Magrane M, Apweiler R: The Gene Ontology Annotation (GOA) Project—application of GO in SWISS-PROT., TrEMBL and InterPro. Comp Funct Genomics. 2003, 4 (1): 71-74. 10.1002/cfg.235.

    PubMed Central  PubMed  Google Scholar 

  51. Selkov E, Overbeek R, Kogan Y, Chu L, Vonstein V, Holmes D, Silver S, Haselkorn R, Fonstein M: Functional analysis of gapped microbial genomes: amino acid metabolism of Thiobacillus ferrooxidans. Proc Natl Acad Sci USA. 2000, 97 (7): 3509-3514. 10.1073/pnas.97.7.3509.

    PubMed Central  CAS  PubMed  Google Scholar 

  52. Fleischmann R: Single nucleotide polymorphisms in Mycobacterium tuberculosis structural genes—response to Dr. Musser. Emerg Infect Dis. 2001, 7 (3): 487-488.

    Google Scholar 

  53. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.

    PubMed Central  CAS  PubMed  Google Scholar 

  54. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002, 12 (1): 177-189. 10.1101/gr.208902.

    PubMed Central  CAS  PubMed  Google Scholar 

  55. Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11 (4): 614-625. 10.1101/gr.171401.

    PubMed Central  CAS  PubMed  Google Scholar 

  56. Tammi MT, Arner E, Kindlund E, Andersson B: Correcting errors in shotgun sequences. Nucleic Acids Res. 2003, 31 (15): 4663-4672. 10.1093/nar/gkg653;.

    PubMed Central  CAS  PubMed  Google Scholar 

  57. Tammi MT, Arner E, Kindlund E, Andersson B: ReDiT: Repeat Discrepancy Tagger—a shotgun assembly finishing aid. Bioinformatics. 2004, 20 (5): 803-804. 10.1093/bioinformatics/bth004.

    CAS  PubMed  Google Scholar 

  58. Bartels D, Kespohl S, Albaum S, Druke T, Goesmann A, Herold J, Kaiser O, Puhler A, Pfeiffer F, Raddatz G, Stoye J, Meyer F, Schuster SC: BACCardI—a tool for the validation of genomic assemblies., assisting genome finishing and intergenome comparison. Bioinformatics. 2005, 21 (7): 853-859. 10.1093/bioinformatics/bti091.

    CAS  PubMed  Google Scholar 

  59. Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B: DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics. 2006, 7: 155-10.1186/1471-2105-7-155.

    PubMed Central  PubMed  Google Scholar 

  60. Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool. 1970, 19 (2): 99-113. 10.2307/2412448.

    CAS  PubMed  Google Scholar 

  61. Fitch WM, Margoliash E: The usefulness of amino acid and nucleotide sequences in evolutionary studies. Evol Biol. 1970, 4: 67-109.

    CAS  Google Scholar 

  62. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.

    CAS  PubMed  Google Scholar 

  63. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.

    PubMed Central  CAS  PubMed  Google Scholar 

  64. Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52 (6): 540-542.

    CAS  PubMed  Google Scholar 

  65. Hirsh AE, Fraser HB: Protein dispensability and rate of evolution. Nature. 2001, 411 (6841): 1046-1049. 10.1038/35082561.

    CAS  PubMed  Google Scholar 

  66. Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12 (6): 962-968.

    PubMed Central  CAS  PubMed  Google Scholar 

  67. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics. 2003, 19 (13): 1710-1711. 10.1093/bioinformatics/btg213.

    CAS  PubMed  Google Scholar 

  68. Deluca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006, 22 (16): 2044-2046. 10.1093/bioinformatics/btl286.

    CAS  PubMed  Google Scholar 

  69. Lee MM, Chan MK, Bundschuh R: Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches. Bioinformatics. 2008, 24: 1339-1343. 10.1093/bioinformatics/btn130.

    CAS  PubMed  Google Scholar 

  70. Poptsova MS, Gogarten JP: BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics. 2007, 8: 120-10.1186/1471-2105-8-120.

    PubMed Central  PubMed  Google Scholar 

  71. Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 3643-3646. 10.1093/bioinformatics/bth397.

    CAS  PubMed  Google Scholar 

  72. Celamkoti S, Kundeti S, Purkayastha A, Mazumder R, Buck C, Seto D: GeneOrder3.0: software for comparing the order of genes in pairs of small bacterial genomes. BMC Bioinformatics. 2004, 5: 52-10.1186/1471-2105-5-52.

    PubMed Central  PubMed  Google Scholar 

  73. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36 (Database issue): D13-D21.

    PubMed Central  CAS  PubMed  Google Scholar 

  74. Peterson JD, Umayam LA, Dickinson T, Hickey EK, White O: The Comprehensive Microbial Resource. Nucleic Acids Res. 2001, 29 (1): 123-125. 10.1093/nar/29.1.123.

    PubMed Central  CAS  PubMed  Google Scholar 

  75. Choi K, Ma Y, Choi JH, Kim S: PLATCOM: a Platform for Computational Comparative Genomics. Bioinformatics. 2005, 21 (10): 2514-2516. 10.1093/bioinformatics/bti350.

    CAS  PubMed  Google Scholar 

  76. Toft C, Fares MA: GRAST: a new way of genome reduction analysis using comparative genomics. Bioinformatics. 2006, 22 (13): 1551-1561. 10.1093/bioinformatics/btl139.

    CAS  PubMed  Google Scholar 

  77. Xie T, Hood L: ACGT—a comparative genomics tool. Bioinformatics. 2003, 19 (8): 1039-1040. 10.1093/bioinformatics/btg121.

    CAS  PubMed  Google Scholar 

  78. Chen T, Abbey K, Deng WJ, Cheng MC: The bioinformatics resource for oral pathogens. Nucleic Acids Res. 2005, 33 (Web Server issue): W734-W740. 10.1093/nar/gki361.

    PubMed Central  CAS  PubMed  Google Scholar 

  79. Leader DP: BugView: a browser for comparing genomes. Bioinformatics. 2004, 20 (1): 129-130. 10.1093/bioinformatics/btg383.

    CAS  PubMed  Google Scholar 

  80. Yang J, Wang J, Yao ZJ, Jin Q, Shen Y, Chen R: GenomeComp: a visualization tool for microbial genome comparison. J Microbiol Methods. 2003, 54 (3): 423-426. 10.1016/S0167-7012(03)00094-0.

    CAS  PubMed  Google Scholar 

  81. Romualdi A, Felder M, Rose D, Gausmann U, Schilhabel M, Glockner G, Platzer M, Suhnel J: GenColors: annotation and comparative genomics of prokaryotes made easy. Methods Mol Biol. 2007, 395: 75-96. full_text.

    CAS  PubMed  Google Scholar 

  82. Grant JR, Stothard P: The CGView Server: a comparative genomics tool for circular genomes. Nucleic Acids Res. 2008, 36: W181-W184. 10.1093/nar/gkn179.

    PubMed Central  CAS  PubMed  Google Scholar 

  83. Ghai R, Chakraborty T: Comparative microbial genome visualization using GenomeViz. Methods Mol Biol. 2007, 395: 97-108. full_text.

    CAS  PubMed  Google Scholar 

  84. Dubchak I, Ryaboy DV: VISTA family of computational tools for comparative analysis of DNA sequences and whole genomes. Methods Mol Biol. 2006, 338: 69-89.

    CAS  PubMed  Google Scholar 

  85. Hohl M, Kurtz S, Ohlebusch E: Efficient multiple genome alignment. Bioinformatics. 2002, 18 (Suppl 1): S312-S320.

    PubMed  Google Scholar 

  86. Treangen TJ, Messeguer X: M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics. 2006, 7: 433-10.1186/1471-2105-7-433.

    PubMed Central  PubMed  Google Scholar 

  87. Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14 (7): 1394-1403. 10.1101/gr.2289704.

    PubMed Central  CAS  PubMed  Google Scholar 

  88. Tzika AC, Helaers R, Van de Peer Y, Milinkovitch MC: MANTIS: a phylogenetic framework for multi-species genome comparisons. Bioinformatics. 2008, 24 (2): 151-157. 10.1093/bioinformatics/btm567.

    CAS  PubMed  Google Scholar 

  89. Andersson SG, Kurland CG: Reductive evolution of resident genomes. Trends Microbiol. 1998, 6 (7): 263-268. 10.1016/S0966-842X(98)01312-2.

    CAS  PubMed  Google Scholar 

  90. Oliver KM, Russell JA, Moran NA, Hunter MS: Facultative bacterial symbionts in aphids confer resistance to parasitic wasps. Proc Natl Acad Sci USA. 2003, 100 (4): 1803-1807. 10.1073/pnas.0335320100.

    PubMed Central  CAS  PubMed  Google Scholar 

  91. Bensadia F, Boudreault S, Guay JF, Michaud D, Cloutier C: Aphid clonal resistance to a parasitoid fails under heat stress. J Insect Physiol. 2006, 52 (2): 146-157. 10.1016/j.jinsphys.2005.09.011.

    CAS  PubMed  Google Scholar 

  92. Degnan PH, Moran NA: Evolutionary genetics of a defensive facultative symbiont of insects: exchange of toxin-encoding bacteriophage. Mol Ecol. 2008, 17 (3): 916-929. 10.1111/j.1365-294X.2007.03616.x.

    CAS  PubMed  Google Scholar 

  93. Douglas AE: Reproductive failure and the free amino acid pools in pea aphids (Acyrthosiphon pisum) lacking symbiotic bacteria. J Insect Physiol. 1996, 42 (3): 247-255. 10.1016/0022-1910(95)00105-0.

    CAS  Google Scholar 

  94. Buchner P: Endosymbiosis of animals with plant microorganisms. 1965, Interscience, New York., NY

    Google Scholar 

  95. Muller HJ: The relation of recombination to mutation advance. Mutat Res. 1964, 1: 2-9.

    Google Scholar 

  96. Perez-Brocal V, Gil R, Ramos S, Lamelas A, Postigo M, Michelena JM, Silva FJ, Moya A, Latorre A: A small microbial genome: the end of a long symbiotic relationship?. Science. 2006, 314 (5797): 312-313. 10.1126/science.1130441.

    CAS  PubMed  Google Scholar 

  97. Koonin EV: How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet. 2000, 1: 99-116. 10.1146/annurev.genom.1.1.99.

    CAS  PubMed  Google Scholar 

  98. Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, Daugherty MD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, Bhattacharya A, Kapatral V, D'Souza M, Baev MV, Grechkin Y, Mseeh F, Fonstein MY, Overbeek R, Barabasi AL, Oltvai ZN, Osterman AL: Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol. 2003, 185 (19): 5673-5684. 10.1128/JB.185.19.5673-5684.2003.

    PubMed Central  CAS  PubMed  Google Scholar 

  99. Koonin EV: Comparative genomics., minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003, 1 (2): 127-136. 10.1038/nrmicro751.

    CAS  PubMed  Google Scholar 

  100. Moran NA: Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA. 1996, 93 (7): 2873-2878. 10.1073/pnas.93.7.2873.

    PubMed Central  CAS  PubMed  Google Scholar 

  101. Prickett MD, Page M, Douglas AE, Thomas GH: BuchneraBASE: a post-genomic resource for Buchnera sp. APS. Bioinformatics. 2006, 22 (5): 641-642. 10.1093/bioinformatics/btk024.

    CAS  PubMed  Google Scholar 

  102. Tillier ER, Collins RA: Genome rearrangement by replication-directed translocation. Nat Genet. 2000, 26 (2): 195-197. 10.1038/79918.

    CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario A Fares.

Additional information

Jennifer Commins, Christina Toft contributed equally to this work.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Commins, J., Toft, C. & Fares, M.A. Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biol Proced Online 11, 52 (2009). https://doi.org/10.1007/s12575-009-9004-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12575-009-9004-1

Keywords