Technical articles

Machine Learning applied to OMICS data


Machine Learning is widely used in data science, but did you know that it can also be applied to OMICS data?

Machine Learning is a field of study of artificial intelligence that involves the development of algorithms and statistical models to give a computer the ability to improve its performance through experience. Reference is often made to Deep Learning, which is a particular subset of Machine Learning.

There are two main categories in Machine Learning: supervised learning and unsupervised learning, also called clustering.

Unsupervised learning  

Unsupervised learning is about finding a “hidden” structure in the data, without assigning a label to individuals, i.e., characterizing individuals by an attribute already known about them. On the contrary, we will let an algorithm try to characterize this data and see what is gleaned.

There are several types of algorithms for unsupervised learning, of which k-means, PCA (principal component analysis) and hierarchical clustering are the most well-known.

Example of use: clustering based on OMICs data (e.g., mutation or SNP profiles obtained by DNAseq, expression profiles obtained by RNAseq, etc.) can enable the imputation of missing data from similar individuals. For example, during clinical studies, some clinical variables are often missing for several patients while deleting the patients from the study, or even the variables involved, is generally unwelcome or impossible.

A common practice consists in replacing the missing data by the median of the values of theses variables for all the individuals, but the value is then the same for all the individuals and is therefore surely very far from reality. Unsupervised learning enables the grouping of patients without a priori using OMICS data, and thus enable the imputation of the missing value by the mean of that of individuals in the same group, to obtain a more coherent imputed value.

In the following figure, 3 groups of similar patients were formed using an unsupervised classification method. The patient of cluster 3 with a variable of unknown value will receive the value of the mean of this variable among the three other individuals of the cluster, instead of receiving the mean of the values of the individuals of all the clusters, which seem quite different from him.

Supervised learning 

This time, it comes to learning from labeled training data (i.e., for which we know the value that we want to learn to determine) to be able to produce predictions on unlabeled data thereafter. There are two main categories of supervised learning, and their use depends on the nature of the quantity to be predicted:

  • Classification : the variable to be predicted is divided into classes, categories.
  • Regression : the variable to be predicted is continuous.
The use of Machine Learning in a project involves two main phases

Phase 1: Learning information from available individuals. Ideally, there must be enough individuals to be representative of the diversity of individuals within the population under study. In practice, the number of individuals available is often quite limited and one must make the best of it.

At the end of this training phase, the performance of the model created must be evaluated and compared with the objectives set at the start of the study.

Phase 2 : application of the model on unlabeled data. Once the model is trained, it will be able to be used on new individuals to predict their class, based on what it has been able to learn about the labeled individuals.

Examples of use:

  • On clinical data:

In recent decades, personalized medicine has emerged as the most suitable solution for successfully treating complex pathologies. It involves adapting the treatment of a patient to his specific characteristics and his illness particular features. Machine Learning, applied to RNAseq data, can enable the stratification of patients according to specific variables of interest (different levels of response to a treatment, long-term clinical status, etc.) based on their transcription profile, i.e., the abundance of his genes at a given time. Eventually, this could make it possible to replace certain routine measurements carried out in the hospital to help in decision-making, which are often not very reproducible from one hospital to another and time-consuming, and thus offer treatments adapted to different groups of patients.

  • On industrial data:

Machine Learning applied to your industrial data can enable the replacement of certain measurements taken at a different step of the production process, by predicting these values according to a single RNA sample collected at that time. This can replace many sometimes time-consuming and costly measures.

  • On cosmetic data:

On cosmetics related data related, Machine Learning offers, for example, the possibility to predict the most suitable cosmetic products for a specific patient/client skin, based on its microbiota, unraveled by means of a metagenomics study [metagenomics article link].


Machine Learning applied to OMICS data offers a wide range of analysis possibilities, ranging from the imputation of missing data to the prediction of variables and can help you to make the best of your data regardless of your study field.

Why should Efor support you?

Our Machine Learning experts will be able to find the most suitable learning models for your data and to apply it in order to unravel the full potential of your omics data.

Contact them directly on