Technical articles

Unlocking the Potential of Comparative Genomics in Bacterial Research

3/07/2025

The microbial world is incredibly diverse, encompassing organisms from all three domains of life: Bacteria, Archaea, and some Eukaryotes like yeasts, fungi, and microalgae (viruses are also integral contributors to microbial ecosystems). This vast diversity presents tremendous opportunities to study microbial populations, understand evolutionary traits, and address pressing challenges such as antimicrobial resistance.

Harnessing this diversity has become increasingly possible thanks to remarkable advancements in sequencing technology. Over the last decade, the dramatic reduction in sequencing costs has made Whole Genome Sequencing (WGS) more accessible, encouraging databases like the Genome Taxonomy Database (GTDB) to grow rapidly. For example, the number of bacterial and archaeal genomes listed in GTDB expanded from 402,709 genomes in April 2023 to 732,475 genomes in April 2025, demonstrating the exponential growth in available genomic data.

This explosion in genomic data has fueled the development of innovative tools capable of comparing bacterial strains, uncovering evolutionary traits, and analyzing microbial populations with unprecedented precision. These advancements are transforming epidemiological studies, research, and regulatory assessments, offering powerful insights into genetic diversity, functional adaptations, and antibiotic resistance.

Comparative genomics tools have emerged as cornerstone methodologies, helping scientists, researchers, and industries tackle critical challenges in health, biotechnology, pharmaceuticals, and agri-food. But how are these tools tailored to specific study objectives, and what value do they bring to the table?

This article explores the key steps and methodologies in bacterial comparative genomics—from DNA sequencing and genome assembly to advanced analyses of resistance genes and genetic variation, enabling readers to understand how comparative genomics is driving innovation while addressing complex global challenges like antimicrobial resistance.

1. Bacterial Comparative Genomics: Tools and Techniques

Comparative genomics for exploring bacterial genomes offers numerous tools tailored to specific research needs, contexts (e.g., epidemiology, clinical research), and study objectives. These tools can analyze different types and numbers of strains, examine specific genomic regions, or compare genomes with a reference strain to uncover variations and evolutionary relationships.

1. 1 First Steps in Bacterial Genomics

Understanding bacterial genomes begins with DNA extraction from a pure culture and progresses through the construction of sequencing libraries. This preparation involves three main steps: i) amplification of the DNA using Polymerase Chain Reaction (PCR) to obtain a quantifiable signal; ii) hybridization of the DNA fragments to the sequencing flowcell; and iii) definition of sample origins through barcode indexing, enabling simultaneous sequencing of several samples.

Sequencing reveals the order of acquired nucleotides (short DNA sequences, often referred to as “reads” i.e. ATCG). This step generates millions of reads that constitute the bacterial genome. The reads are then evaluated for quality (length, quality score, etc.), and cleaned to retain only good-quality reads and limit technical errors.

Figure 1. Bacterial genome sequencing: from pure culture in selective media to sequencing reads.

Once sequencing and quality control have been completed, genome assembly represents the next critical step. This process organizes fragmented reads into a coherent sequence, akin to assembling pieces of a puzzle whose overall pattern is not necessarily known.

Genome assembly can be conducted in two different ways:

De Novo Assembly: Reads are assembled into longer fragments without relying on a reference genome. Bioinformatics algorithms, such as the de Bruijn graph, play a key role here by splitting DNA reads into small words of fixed length (k-mer), forming nodes linked by overlapping sequences. These overlaps are then gradually reconstructed into larger DNA fragments known as contigs.

Reference-Based Alignment: Reads are aligned against a known reference genome, selected for its similarity to the target species. This method employs algorithms such as the Burrows-Wheeler algorithm. The alignment of the reads with the reference is then assessed according to:
- Depth: The average number of reads that overlap.
- Coverage: The percentage of the genome covered by the reads.

Following assembly, the process of genomic annotation ascribes biological information to genome components, such as the names of coding sequences, ribosomal RNA (rRNA), transfer RNA (tRNA), etc.).

Figure 2. Genomic annotation: assembly-based versus read-based approach

1. 2 What Can Be Analysed in Comparative Genomics?

Comparative genomics unlocks a multitude of possibilities once a genome has been assembled, enabling various types of analysis.

I. Characterization of nucleotide variations between strains can be assessed

Genetic differences between strains can be studied through:

Similarity score (Average Nucleotide Identity (ANI))
- Detection of variants (Single Nucleotide Polymorphisms (SNPs)) (see example 2)
- Molecular typing to characterize and differentiate strains.

II. Phylogenetic Analysis

Phylogeny helps to organize biological diversity and understand the origins and evolutions of organisms. By aligning genomic data, researchers can study the development of specific characteristics or traits across evolutionary timelines.

III. Feature Identification Using Specialized Databases:

Numerous databases are available and allow researchers to identify critical genomic features:

Functional annotation of genes,
- Orthology annotation and prediction of bacterial protein sequences,
- Secondary metabolites,
- Resistance genes, virulence genes and insertion sequences.

Figure 3. Comparative genomics examples of analysis between two bacteria.

1. 3 Application Example: Comparison of Antibiotic Resistance Genes

Antibiotic resistance is a natural phenomenon caused by mutations in bacterial genes. The excessive and inappropriate use of antibiotics has accelerated the emergence and spread of antibiotic-resistant bacteria.

In Europe, monitoring of antibiotic resistance is mandatory for both human and animal bacterial strains. Recent studies by the European Center for Disease Control (ECDC) and the European Food Safety Authority (EFSA) show that antimicrobial resistance in Salmonella and Campylobacter continues to be frequently observed in humans and animals (ECDC, EFSA).

WGS has revolutionized this field, enabling researchers to analyze the presence of antibiotic resistance genes within bacterial isolates.

Two types of approaches are used, depending on the type of sequencing data, available computing resources, and the study objectives:

I. Analysis of Contigs: Contigs are directly aligned with reference sequences in a database to identify and annotate the resistance genes they contain. This commonly used method requires a tool capable of de novo assembly as well as substantial computational capacity.

II. Analysis of Reads: Reads are aligned to reference genomes to predict resistance determinants. This approach is faster and less computationally demanding than the former, making it suitable for large datasets.

Figure 4. Two different approaches for the investigation of antibiotic resistance genes

The accuracy of resistance gene identification depends on the comprehensiveness and quality of the databases employed. These databases fall into three main categories:

General Databases: These databases cover a wide array of antimicrobial resistance mechanisms.
Specialized Databases: These databases cater to specific microorganisms or resistance types, such as enzymes or particular variants.
Novel Variant Databases: These databases are designed to uncover novel antimicrobial resistance variants by detecting functional similarities even in sequences with lower identity.

These databases are continually evolving, with increasing completeness, especially for extensively researched samples like those derived from the human intestine.

1. 4 Bacterial Variant Analysis

Variants analysis is the process of identifying differences between closely related genomic sequences. It is particularly useful for comparing DNA sequences from the same bacterial species.

Variants include SNPs and indels (INsertions and DELetionS) in the genome. A SNP is the variation of a genome base between a sequence and the reference for that species, while indels are insertions or deletions of a few base pairs that can lead to a punctual modification of the peptide sequence or a shift in the open reading frame leading to a more significant protein modification.

The figure below summarized the steps involved in variants identification:

Figure 6. Workflow of the SNPs calling pipeline’un pipeline de détection des SNPs.

For the identification of a variant, the evaluation of SNPs is a critical step: the number of reads supporting the SNP must be high, as must be the quality of the alignment. The functional consequences of a SNP (impact on the encoded protein, etc.) can also be studied with specific tools.

2. Other Applications

Other applications of comparative microbial genomics include clinical research and species analysis.

In clinical research, machine learning is used to build predictive models and predict antibiotic resistance from genomic data. Read our article on machine learning applied to OMICs data here.

The analysis of a species can be described by its pangenome, i.e. all its gene families, not only its core genome (shared by all individuals within a species) but also the accessory genes that provide additional functions and selective advantages (ecological adaptation, virulence mechanisms, antibiotic resistance, colonization of a new host, etc.). Pangenome studies illuminate genetic diversity within species and trace their evolutionary dynamics, offering insights into selective advantages and adaptive traits.

3. Conclusion

Comparative genomics represents a powerful tool for unraveling the complexities of microbial species, enabling profound insights into genetic diversity, evolutionary dynamics, and practical applications such as antibiotic resistance monitoring. From identifying key genomic variations to advancing predictive models through machine learning, the field has become indispensable in research, healthcare, agriculture, and biotechnology.

With the continued advancements in sequencing technologies, coupled with a growing repository of specialized databases, the potential for innovative discoveries and targeted solutions is expanding rapidly.

Need Help?

Our comparative genomics experts can assist you with the following services:

Characterize genomes and search for specific genes (taxonomy, MLST, ABR, virulence, SNP, etc.)
Compare bacterial strains: determination of orthologs and construction of phylogeny.
Build comprehensive bioinformatic pipelines and visualisation platforms for genomic comparison data.

Efor group

Discover Efor group

Our CSR commitments

Aware of our social and environmental responsibility, we act every day to make a positive impact on society.

Discover our commitments

Our news

Discover all our technical articles and news

See all