Technical articles

How to conduct a metagenomics project


Metagenomics, the science of genetic analysis of microbial communities, i.e., all the microorganisms (bacteria, viruses, fungi, yeasts, plankton, etc.) present in a specific environment (skin, organs, maritime environments, soil, air, etc.), has become an indispensable tool for understanding the complexity and dynamics of microbial ecosystems. Metagenomics projects, whether to study the impact of microorganisms on their environment or to analyze the consequences of disturbances on these communities, are complex and require rigorous experimental strategy and staging. This article offers a simplified methodological overview, designed to guide researchers through the various stages of a metagenomics project, from the conception of the experimental design to the in-depth analysis of the results obtained.

1) Set-up of an experimental design

Defining the objectives and the research questions to be addressed from the very start of a project is essential to ensuring the relevance of the final study results. Indeed, the objectives will condition many aspects of the experiment: the choice of the approach (targeted or shotgun), the sample size estimation, as well as the various metadata to be collected for each sample (for example, for an intestinal microbiota, patient data such as age or gender, the presence of a particular medical condition and treatment, or the type of diet, etc.). The estimation of the number of samples is important to ensure that the results of the study will be statistically significant and relevant and must be anticipated. Several approaches can be used, depending on the subsequent analyses and statistical tests, such as size effect estimation, Cohen’s d or the precision approach.

Irrespective of the chosen approach, particular attention must be paid to the conditions under which the samples are acquired, the DNA extraction process, and the storage of the samples. In the case of a targeted metagenomics approach, it is important to choose the marker region carefully according to the species to be screened and the available databases.

2) Sample sequencing

Once the experimental design has been determined and the DNA samples generated, it is important to select an appropriate sequencing technology. The research question and the specificities of the project can guide the choice of sequencing technology towards second generation short reads such as Illumina, or towards long reads from second generation technologies such as PacBio or Nanopore.  While the second option generally allows a better assembly of metagenomes, it has a high error rate which can compromise the resolution of the taxonomic classification. Short reads have the advantage of a rich panel of tools and methods for carrying out the different stages of analysis but encounter difficulties in assembling complex metagenomes with high diversity.

The sequencing depth is an important criterion for consideration: an increase in depth allows an increase in the detection of low abundance species and strains. It is nevertheless preferable to increase the number of samples rather than the depth of sequencing to achieve better statistical accuracy and significance.

Sequencing is an important step to obtain good quality data and careful attention should be given to the choice of the sequencing platform.

3) Data processing

Once the sequencing is complete, the subsequent step involves handling data processing, which encompasses both raw sequencer output data or demultiplexed read files in fastq format. All necessary quality controls are then conducted, and the data is processed using methods best suited to the specific data. This includes quality controls and data preprocessing to ensure the removal of the bad quality sequences and/or reads originating from the host (i.e.human cells in a gut microbiome), and taxonomic profiling that allows to describe the microbiota population within the samples. This profiling can be precise depending on the technology used but can be performed at up to the level of the strain in case of shotgun metagenomics, whereas targeted sequencing is generally limited to gender. Genomes of the organisms in the sample can also be assembled, optionally, and statistical analyses can be performed to associate the profile of the different samples with the available metadata or describe the impact of the conditions under study. Functional profiling, i.e. a description of the metabolic potential or the biological properties of the samples, can also be achieved but usually involves the description of the whole set of genes available in the sample. This can only be obtained with the shotgun approach. An overview of the different steps of metagenomic analysis is presented in the diagram below.

Figure 1: Overview of the main steps of a metagenomic analysis. The steps in italics are generally only accessible to shotgun metagenomics approaches.

The results of the data processing describe the taxonomic assignment of the reads, the abundance of the different taxa within the samples, the abundance of the metabolic reactions (i.e. the sum of the abundances of the genes that make up each reaction) and the sequences of the assembled metagenomes (depending on the needs of the study).

Figure 2 : Microbial composition (at phylum rank) of water samples taken from 5 treatment plants along a river (WP1-5) in summer (S18) or winter (W18).

4) Data analysis

Bioinformatics analysis of metagenomic data is complex and requires specialized skills. It includes taxonomic classification of sequences, analysis of microbial diversity, prediction of metabolic functions and comparison with other datasets. Results must be interpreted with caution, considering potential biases and the limitations of the methods used. Various additional analyses to the results can be conducted:

  • Comparison of the compositions of the different samples (differential abundance analysis, PERMANOVA multivariate analysis).
  • Functional annotation and sequence comparison for unidentified assembled genomes.
  • Supervised learning applied to the prediction of sample metadata based on their microbiota.
  • Analysis of alpha diversity (or specific richness, estimate of intra-sample biodiversity) and beta-diversity (measurement of species diversity between samples) and study of the impact of a variable (treatment, sample characteristics, etc.). 

Figure 3 (left) : Distribution of alpha diversity measures (Shannon)
for samples of water leaving the treatment plant and
taken in summer (S18) or winter (W18). We observe a
greater richness of samples in winter.

Figure 4 (right) : Distance (beta diversity) between samples. The ordination was carried out according to the NMDS method (non-metric multi-dimensional scaling) using Bray-Curtis distances and underlining the difference in composition of the samples depending on the season.  
  • Study of the phylogeny of the species present in the samples studied.
Figure 5 : Phylogenetic tree of the species found in the different treatment plants. The green to blue gradient indicates the number of samples collected in winter containing OTUs belonging to a given taxon. The width of a node represents the total number of OTUs found for a given taxon.

5) Validation and Further Studies

To confirm metagenomic discoveries, further studies may be required. These might include culturing experiments to isolate and characterize specific microorganisms, functional assays to validate metagenomic predictions, or transcriptomic and proteomic analyses to examine gene and protein expression.


Metagenomics is a complex specialty that enables to extract rich and varied information on a microbiota, such as its impact on its environment (host, physico-chemical properties, location, etc.) or inversely, external effects that can perturb on this particular microbiota (effect of a treatment, temperatures, etc.). The implications of metagenomic discoveries can be far-reaching, influencing our understanding of ecosystems, the development of new medical therapies, and the creation of innovative biotechnologies.

Advances in the field of metagenomics continue to broaden our knowledge of the microbial world and its impact on human health, the environment and industry. However, metagenomic projects are complex process requiring specific expertise in study design, sequencing and bioinformatics techniques, and data interpretation. 

Why should you choose Efor to support you in your metagenomics analyses?

Our metagenomics experts can offer you support at all stages of a metagenomics project and provide you with a complete and personalized analysis to achieve your objectives:

  • Support and project design: choice of the approach (targeted or shotgun), the sample size estimation, connection with sequencing facilities, etc.
  • Data analysis:  raw data processing, taxonomic profiling, functional profiling, etc.
  • Statistical analysis and interpretation: differential analysis, comparison with databases or other datasets, biological interpretation, etc.

For further information, contact our Data Centre of Technical Expertise at: