NeSSM: a Next-generation Sequencing Simulator for Metagenomics
I. Introduction
NeSSM is a tool to generate Next-Generation Sequencing (NGS) reads with parameters set by users. The goal of NeSSM is to generate metagenome sequencing reads close to the reality. Currently, 454, Illumina sequencing platforms are supported. It can help develop methods or systems for metagenomics analysis.
II. System requirements
Linux operation system, memory 1G or up; Perl 5.8.5 or up and gcc version 4.1.2 or up. If you want to run a GPU version of NeSSM, CUDA 4.0 or up is required (**).
Download Required Tools or Drivers:
CUDA driver, which could be downloaded from the NVIDIA website:http://www.nvidia.com.
BWA, which could be downloaded from the BWA website:http://bio-bwa.sourceforge.net/.
Samtools, which could be downloaded from the Samtools website:http://samtools.sourceforge.net/.
III. Files and Directories
This system is zipped in one file, "NeSSM.tarz",which can be downloaded from here. The important files inside this tar file are listed here.
|-NeSSM_CPU.c | The CPU version of NeSSM. |
|-NeSSM_GPU.cu | The GPU version of NeSSM. |
|-composition-table.pl | A perl script to analyse the composition from a real metagenome dataset. |
|-simulation.config | A configure file of NeSSM. |
|-mk_index.pl | A Perl script to make the index file. |
|-complete_update_step.pl | A Perl script to download the genome sequences from NCBI. |
|-NCBI_PowerScripting.pm | A Perl module for NCBI database connetion. |
|-quality-value.pl | A Perl script to obtain the distribution of quality values in every base from FASTQ files. |
|-error-type.pl | A Perl script to obtain the error model from BWA result files. |
|-coverage-bias.pl | A perl script to estimate the information of sequencing coverage bias from BWA result file. |
|-startcuda.sh | A Shell script to initialize CUDA. |
IV. Install NeSSM
After you download the tarball, you can intall NeSSM as follows.
1. unzip NeSSM.tarz
tar -xzf NeSSM.tarz
2.1. If you want to use CPU version: cd NeSSM/NeSSM_CPU/
make
2.2. If you want to use GPU version: cd NeSSM/NeSSM_GPU/
make
V:Run sequencing simulation.
1.Download the NCBI genome database:
If you don't have NCBI database, use "complete_update_step.pl" to download the NCBI genome database,if you have the genome database, you can skip this step.cd NeSSM/scripts/
perl complete_update_step.pl string1
string1: the directory to store the genome database
for example: perl complete_update_step.pl NeSSM/example/data/
2.Generate the index file:
The index file contains the genomes' name, length, path and so on. It can be used in simulation or analyzing composition of metagenome.cd NeSSM/scripts/
perl mk_index.pl string1 string2
string1: the directory of the whole database (generated in step 1)
string2: the directory to store the index file generated by this script
for example: perl mk_index.pl NeSSM/example/data/ NeSSM/example/
The "index" file can be generated under the directory of NeSSM/example/
ATTENTION: the directory of the whole database in string1 should be with the absolute path!!!3.Create a composition structure table:
The composition structure table contains the names and their abundances for genomes in a metagenome. Here the abundance can be the percentage of an organism (based on its read number). There are two ways to obtain the composition structure table.3.1.Input the composition structure table by users. If the abundance is the percentage of reads number, users should confirm that the sum of all abundances is one. If the abundance is the percentage of organism number, use "adjust.pl" to adjust the table.
cd NeSSM/scripts/
perl adjust.pl string1 string2
string1: the composition structure table inputted by users
string2: the index file generated in step 2
for example: perl adjust.pl NeSSM/example/percentage.txt NeSSM/example/index
The "new-percentage.txt" file can be generated under the directory of NeSSM/scripts/
3.2.Input a metagenome data. The "composition-table.pl" can generate a composition structure table from the metagenome data.
First, use the BWA to map the metagenome data. If the reads are less 200 bps, the recommed algorithm in BWA is "is" (the recommed parameters in "aln"step
is "-I -N" and "-n 100" in the step "samse/sampe") and the algorithm "bwasw" is better for reads of longer than 200 bps.
Then, use the "composition-table.pl" to analyze BWA result.
cd NeSSM/scripts/
perl composition-table.pl string1 string2
string1: the index file generated in step 2
string2: the BWA result
for example: perl composition-table.pl NeSSM/example/index NeSSM/example/example.sam
The "percentage.txt" file can be generated under the directory of NeSSM/scripts/
4.Run NeSSM:
There are two versions of NeSSM program. One is a CPU version of NeSSM under the dirctory of NeSSM/NeSSM_CPU/. The other is GPU version under the directory of NeSSM/NeSSM_GPU/.The usage of NeSSM now is took CPU version for example.
cd NeSSM/NeSSM_CPU/
./NeSSM -list string1 -index string2 -m string3 -o string4
string1: the compostition structure table generated in step 3
string2: the index file generated in step 2
string3: the platform used to simulate, 454 or illumina
string4: output file
for example: ./NeSSM -list NeSSM/scripts/percentage.txt -index NeSSM/scripts/index -m illumina -o NeSSM/example/simulation
The "simulation.fq" file can be generated under the directory of NeSSM/example/
The four parameters: -list, -index, -m, -o are necessary to run the NeSSM. There are many other parameters to use, no matter CPU version or GPU version:
-r < int > : number of reads to simulate, default is 1000
-l < int > : length of read to simulate, default is 50(bps)
-e < int > : simulate single reads or pair-end reads, 0 means single reads and 1 means pair-end reads, default is 0
-w < int > : the length of gap when to simulate pair-end reads, default is 200(bps)
-c < string > : the cofigure file used to simulate, defaulte is "simulation.config"
-exact < int > : 0 means the length of read is decided by the parameter "-l", 1 means the length of read is decided by the distribution of length according to a real data, default is 0
-b < string > : the file of sequencing coverage bias
There are two parameters only used in GPU version:
-block < int > : the blocks number used in GPU, default is 100
-thread < int > : the treads number used in GPU, default is 200
VI:Error model estimation.
Users can estimate error models from a FASTQ file by two perl scripts.
1.Estimating the distribution of quality values in every base by "quality-value.pl"
cd NeSSM/scripts/
perl quality-value.pl string1 string2
string1: the FASTQ file inputted by users
string2: the platform of the FASTQ file, 454 or illumina
for example: perl quality-value.pl NeSSM/example/test.fq 454
The "quality.txt" file can be generated under the directory of NeSSM/scripts/
2.Mapping the reads in a FASTQ file to the reference genomes by BWA.
If the length of reads in FASTQ file is less than 200 bps, use the "is" option with default parameters.If the length of reads in FASTQ file is more than 200 bps, use the "bwasw" option with default parameters. After BWA mapping, use Samtools to adjust the result of BWA.
cd directory of Samtools
./samtools calmd -S string1 string2 > string3
string1: BWA result
string2: the reference genomes used in BWA, this file must be FASTA format
string3: output file
3.Estimating the error type by "error-type.pl".
cd NeSSM/scripts/
perl error-type.pl string1 string2
string1: the BWA result generated in step 2
string2: the platform of the FASTQ file, 454 or illumina
for example: perl error-type.pl NeSSM/example/test.sam 454
The "self-simulation.config" file can be generated under the directory of NeSSM/scripts/
VII:Sequencing coverage bias estimation.
Users can estimate the information of sequencing coverage bias from a real metagenome dataset. First, the reads are mapped back by BWA with parameters above. Then estimate the information of sequencing coverage bias by "coverage-bias.pl".
cd NeSSM/scripts/
perl coverage-bias.pl string1 string2 string3
string1: the index file generated in step V-2
string2: the compostition structure table generated in step V-3
string3: the BWA result
for example: perl coverage-bias.pl NeSSM/scripts/index NeSSM/scripts/percentage.txt
NeSSM/example/example.sam
The "coverage.txt" file can be generated under the directory of NeSSM/scripts/
VIII:Datasets in the paper.
All datasets mentioned in paper are provided here except those with sizes above 2Gb.
- Sequencig data sets derived from Dataset A, B, and C.
- LC-100 dataset (in Table 2)
- LC-250 dataset (in Table 2)
- MC-100 dataset (in Table 2)
- MC-250 dataset (in Table 2)
- HC-100 dataset (in Table 2)
- HC-250 dataset (in Table 2)
- LC dataset used in assemble (in Table 7)
- Simulated sequencing data according to Dataset E and F
- Simulated data according to Dataset E by NeSSM using the composition table estimated by NeSSM (in Table 5)
- Simulated data according to Dataset E by NeSSM using the composition table supplied by Morgan et al. (in Table 5)
- Simulated data according to Dataset E by MetaSim using the composition table estimated by NeSSM (in Table 5)
- Simulated data according to Dataset E by MetaSim using the composition table supplied by Morgan et al. (in Table 5)
- Simulated data according to Dataset E by GemSIM using the composition table estimated by NeSSM (in Table 5)
- Simulated data according to Dataset E by GemSIM using the composition table supplied by Morgan et al. (in Table 5)
- Simulated data according to Dataset E by Grinder using the composition table estimated by NeSSM (in Table 5)
- Simulated data according to Dataset E by Grinder using the composition table supplied by Morgan et al. (in Table 5)
- Simulated data according to Dataset F by NeSSM (in Table 6)
- Simulated data according to Dataset F by MetaSim (in Table 6)
- Simulated data according to Dataset F by GemSIM (in Table 6)
- Simulated data according to Dataset F by Grinder (in Table 6)
- Simulated a single genome data according to Dataset F by pIRS (in Figure 7).
1:If this is your first time to run your cuda, you should run the "startcuda.sh" with root permissions to initialize the CUDA.
The startcuda.sh file is under NeSSM/NeSSM_GPU/ and its usage is: ./startcuda.sh start
2:If your CUDA version is above 4.0, you should run the command "nvidia-smi" with root permissions
3:You can generate the simulation datasets used in paper "NeSSM: a Next-generation Sequencing Simulator for Metagenomics" according to the commands.
Contact:
If you have any questions, feel free to contact us.
< chenmodexiaoxi@126.com >
< ccwei@sjtu.edu.cn >
Please send your comments or bug reports to Dr. Wei .