Large-scale Machine Learning for Metagenomics Sequence Classification

Kevin Vervier, Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras and Jean-Philippe Vert

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.

In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional approaches benefit from increasing the number of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning these models involves training a machine learning model on about 10^8 samples in 10^7 dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting models are competitive in terms of accuracy with well-established alignment tools for problems involving a small to moderate number of candidate species, and for reasonable amounts of sequencing errors.We show, however, that compositional approaches are still limited in their ability to deal with problems involving a greater number of species, and more sensitive to sequencing errors. We finally confirm that compositional approach achieve faster prediction times, with a gain of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.

Data and code used in this project

The following file contains the data used in the paper (the small and large databases), as well as programs allowing to reproduce the results

large-scale-metagenomics-1.0.tar.gz (2.5 Go)

This archive is structured as follows :

the data directory contains the sequence data used to carry out the experiments
the src directory contains script allowing to reproduce the experiments
the tools directory contains source code of two utilities used to draw fragments from genomic sequences

We now detail the contents of the three directories present in the archive.

data directory

test-dataset/

test-db.abundance: a two-columns file (sequence_id : relative_abundance) required for simulations with Grinder software.
test-db.fasta: a FASTA file that contains the 398 sequences related to 193 species used to simulate validation sets.
test-db.meta: a five-columns file containing metadata on test-db genomes, like sequence.id, genome.id and the corresponding taxon IDs at strain, species and genus levels.
test-db.species-level.taxid: a one-column file that contains the species-level taxid for each sequence in the test database.

train-dataset/

train_large-db.fasta: a FASTA file that contains the 2768 sequences related to 774 species used to train on the large database.
train_large-db.meta: a five-columns file containing metadata on large-db genomes, like sequence.id, genome.id and the corresponding taxon IDs at strain, species and genus levels.
train_large-db.species-level.taxid: a one-column file that contains the species-level taxid for each sequence in the large database.
train_small-db.fasta: a FASTA file that contains the 1564 sequences related to 193 species used to train on the small database.
train_small-db.meta: a five-columns file containing metadata on small-db genomes, like sequence.id, genome.id and the corresponding taxon IDs at strain, species and genus levels.
train_small-db.species-level.taxid: a one-column file that contains the species-level taxid for each sequence in the small database.

src directory

Each src directory contains three sub-directories (input, output and src) allowing to reproduce the results presented in our paper (see the Tutorial section below):

1-generate-test-datasets/

input/

grinder.profile: contains parameters for Grinder software.
grinder-b-parameter_vs_mut-freq: a two-columns file providing equivalence between Grinder parameter for mutation rate and observed mutation rate.

output/

All the simulated data (fragments and sequences) will be stored in this directory.

src/

01.generate-dataset-fragments.sh: Use drawfrag tool to generate fragments of length L from the test database.
02.generate-dataset-reads-homo.sh: Use Grinder software to simulate reads, including homopolymer errors, from the test database.
03.generate-dataset-reads-mutation.sh: Use Grinder software to simulate reads, including mutation errors, from the test database.
convert-gi-to-taxid.R: Match genome.id on species-level taxid.

2-build-models/

output/

All the sequences converted into VW bag-of-words will be (temporarily) stored in this directory, and discarded after learning.
The resulting classification models will also be stored in this directory.

src/

01.main.sh: Use VW to iteratively learn models.

3-make-predictions/

output/

All the VW predictions will be stored in this directory.

src/

01.make-predictions.sh: Use VW to predict labels on validation sets.
02.generate-graphs.R: Create result graphics similar to those present in the article.
vw-class-to-taxid.R: Match VW class labels on species-level taxid.

tools directory

This directory contains source codes of two C programs involved in the Vowpal Wabbit (VW) based read classification pipeline :

drawfrag is used to draw fragments from reference genomes
fasta2vw is used to convert a fasta file into a plain text file compliant with VW

Third-party softwares

To run the experiments, you need in addition to install the following third-party softwares:

Vowpal Wabbit (version 7.7.0), the machine learning algorithm
Grinder (version 0.5.3) to simulate datasets
R (version 3.0.0) to analyze and plot the results

Optionally, you may want to install also the following softwares which we used in our paper for comparison to existing methods (but you do not need them if you just want to run our method):

FCP (version 1.0.6), the Fragment Classification Package which provides an implementation of Naive Bayes
BWA-MEM (version 0.7.4), the alignment-based approach tested in our benchmark

Our implementation uses the following libraries, which you do not need to install since they are provided in our software for convenience

kseq library to parse FASTA/FASTQ files
GDL library implementing several structures like lists and hashtables (note that we provide a slightly updated version that properly compiles on Mac OS X)

Tutorial: reproducing the results of the paper

After downloading the project archive, first untar it:

$ tar zxvf large-scale-metagenomics-1.0.tar.gz

Then, run the BASH script INSTALL.sh that can be found in the tools directory. This script will process the installation of the GDL library, and create the binary executables:

$ cd large-scale-metagenomics-1.0/tools

$ sh INSTALL.sh

In order to check that everything went well during the installation, please use test.sh in the tools/test directory. It will use installed tools to simulate a small dataset.

$ cd test/

$ sh test.sh

$ ls output/
                frags.fasta	frags.gi2taxid	frags.taxid	frags.vw	vw-dico.txt

The following instructions have to be executed in the given order. At each step, we also point on tunable parameters.

```
$ cd src/1-generate-test-datasets/src
```
```
$ sh 01.generate-dataset-fragments.sh
```
Generates the fragments test dataset based on the following parameters:

L: fragment length (default: 200)
COVERAGE: mean coverage for all genomes in the database (default: 1)

```
$ sh 02.generate-dataset-reads-homo.sh
```
Generates the homopolymer test datasets based on the following parameters:

L: fragment length (default: 200)
COVERAGE: mean coverage for all genomes in the database (default: 1)
model: homopolymer errors (Balzer, Margulies, Richter)

```
$ sh 03.generate-dataset-reads-mutation.sh
```
Generates the mutations test datasets based on the following parameters:

L: fragment length (default: 200)
COVERAGE: mean coverage for all genomes in the database (default: 1)

```
$ cd ../../2-build-models/src
```
```
$ sh 01.main.sh
```
Uses VW for an iterative learning (WARNING: can take time!) based on the following parameters:

DB: sequences database (small or large)
NBATCHES: number of train batches (default: 10)
L: fragment length (default: 200)
COVERAGE: mean coverage for all genomes in the database
K: k-mer size (default: 12)
LAMBDA1: L1 regularization parameter (default: 0)
LAMBDA2: L2 regularization parameter (default: 0)
NPASSES: number of training passes on each batch (default: 1)
BITS: hash table size (default: 31)

WARNING

```
$ cd ../../3-make-predictions/src
```
```
$ sh 01.make-predictions.sh
```
Uses VW for predictions on validation sets based on the following parameters :

DB: sequences database (small or large)
NBATCHES: number of train batches (default: 10)
L: fragment length (default: 200)
K: k-mer size (default: 12)

```
$ Rscript 02.generate-graphs.R
```
Uses the previous results to plot the performance indicator considered (median species-level accuray), based on the following parameters:

DB: sequences database (small or large)
NBATCHES: number of train batches (default: 10)

Classifying your own metagenomics sequences

You can train your own classifier by using the same scripts but changing some of the parameters.

In particular, you should be able to easily adapt the script 2-build-models/src/01.main.sh

Please remember that some parameters are mandatory:

a training database (genome sequences stored as a multi-fasta file) and the corresponding taxids (defining the classes of the classification problem)
a number of training batches to consider (NBATCHES)
a mean coverage value to considere while generating training batches (COVERAGE)
a k-mer size (K)

You may also want to directly use the already-built models to make predictions for your own sequencing data.

To do this, please change the scripts in 3-make-predictions/src/ in order to point to your FASTA file.

References

K. Vervier, P. Mahé, M. Tournoud, J.-B. Veyrieras, and J.-P. Vert. Large-scale machine learning for metagenomics sequence classification, Bioinformatics, 32(7):1023-1032, 2016.

Contacts

If you have any questions or suggestions, please email pierre dot mahe at biomerieux dot fr and/or jean-philippe dot vert at mines dot org