Getting Started

Setup

Conda package manager

Atlas has one dependency: conda. All databases and other dependencies are installed on the fly. Atlas is based on snakemake which allows to run steps of the workflow in parallel on a cluster.

If you want to try atlas and have a linux computer (OSX may also work), you can use our example data for testing.

For real metagenomic data atlas should be run on a _linux_ sytem, with enough memory (min ~50GB but assembly usually requires 250GB).

You need to install anaconda or miniconda. If you haven’t done it already you need to configure conda with the bioconda-channel and the conda-forge channel. This are sources for packages beyond the default one.:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

The order is important by the way.

Install mamba

Conda can be a bit slow because there are so many packages. A good way around this is to use mamba (another snake).:

conda install mamba

From now on you can replace conda install with mamba install and see how much faster this snake is.

Install metagenome-atlas

We recommend you to install metagenome-atlas into a conda environment e.g. named atlasenv:

mamba create -y -n atlasenv metagenome-atlas
source activate atlasenv

Example Data

If you want to test atlas on a small example data here is a two sample, three genome minimal metagenome dataset, to test atlas. Even when atlas will run faster on the test data, it will anyway download all the databases and requirements, for the a complete run, which can take a certain amount of time and especially disk space (>100Gb).

The database dir of the test run should be the same as for the later atlas executions.

The example data can be downloaded as following:

wget https://zenodo.org/record/3992790/files/test_reads.tar.gz
tar -xzf test_reads.tar.gz

Usage

Start a new project

Let’s apply atlas on your data or on our example data:

atlas init --db-dir databases path/to/fastq

This command creates a samples.tsv and a config.yaml in the working directory.

Have a look at them with a normal text editor and check if the samples names are inferred correctly. Samples should be alphanumeric names and cam be dash delimited. Underscores should be fine too. See the example sample table

The BinGroup parameter is used during the genomic binning. In short: all samples in which you expect the same strain to be found should belong to the same group, e.g. all metagenome samples from mice in the same cage or location. If you want to use long reads for a hybrid assembly, you can also specify them in the sample table.

You should also check the config.yaml file, especially:

  • You may want to add ad host genomes to be removed.
  • You may want to change the resources configuration, depending on the system you run atlas on.

Details about the parameters can be found in the section Configure Atlas

Keep in mind that all databases are installed in the directory specified with --db-dir so choose it wisely.

Usage: atlas init [OPTIONS] PATH_TO_FASTQ

  Write the file CONFIG and complete the sample names and paths for all
  FASTQ files in PATH.

  PATH is traversed recursively and adds any file with '.fastq' or '.fq' in
  the file name with the file name minus extension as the sample ID.

Options:
  -d, --db-dir PATH               location to store databases (need ~50GB)
                                  [default: /Users/silas/Documents/GitHub/atla
                                  s/databases]
  -w, --working-dir PATH          location to run atlas
  --assembler [megahit|spades]    assembler  [default: spades]
  --data-type [metagenome|metatranscriptome]
                                  sample data type  [default: metagenome]
  --interleaved-fastq             fastq files are paired-end in one files
                                  (interleaved)
  --threads INTEGER               number of threads to use per multi-threaded
                                  job
  --skip-qc                       Skip QC, if reads are already pre-processed
  -h, --help                      Show this message and exit.

Run atlas

atlas run all

atlas run need to know the working directory with a samples.tsv inside it.

Take note of the --dryrun parameter, see the section Useful command line options for other handy snakemake arguments.

We recommend to use atlas on a Cluster execution system, which can be set up in a view more commands.

Usage: atlas run [OPTIONS]
                 [[qc|assembly|binning|genomes|genecatalog|None|all]]
                 [SNAKEMAKE_ARGS]...

  Runs the ATLAS pipline

  By default all steps are executed but a sub-workflow can be specified.
  Needs a config-file and expects to find a sample table in the working-
  directory. Both can be generated with 'atlas init'

  Most snakemake arguments can be appended to the command for more info see
  'snakemake --help'

  For more details, see: https://metagenome-atlas.readthedocs.io

Options:
  -w, --working-dir PATH  location to run atlas.
  -c, --config-file PATH  config-file generated with 'atlas init'
  -j, --jobs INTEGER      use at most this many jobs in parallel (see cluster
                          submission for mor details).  [default: 8]
  --profile TEXT          snakemake profile e.g. for cluster execution.
  -n, --dryrun            Test execution.  [default: False]
  -h, --help              Show this message and exit.

Execue Atlas

Cluster execution

Automatic submitting to cluster systems

Thanks to the underlying snakemake Atlas can submit parts of the pipeline automatically to a cluster system and define the appropriate resources. If one job has finished it launches the next one. This allows you use the full capacity of your cluster system. You even need to pay attention not to spam the other users of the cluster.

Thanks to the underlying snakemake system, atlas can submit parts of the pipeline to clusters and cloud systems. Instead of running all steps of the pipeline in one cluster job, atlas can automatically submit each step to your cluster system, specifying the necessary threads, memory, and runtime, based on the values in the config file. Atlas periodically checks the status of each cluster job and can re-run failed jobs or continue with other jobs.

See atlas scheduling jobs on a cluster in action https://asciinema.org/a/337467.

If you have a common cluster system (Slurm, LSF, PBS …) we have an easy set up (see below). Otherwise, if you have a different cluster system, file a GitHub issue (feature request) so we can help you bring the magic of atlas to your cluster system. For more information about cluster- and cloud submission, have a look at the snakemake cluster docs.

Set up of cluster execution

You need cookiecutter to be installed, which comes with atlas

Then run:

cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git

This opens a interactive shell dialog and ask you for the name of the profile and your cluster system. We recommend you keep the default name cluster. The profile was tested on slurm, lsf and pbs.

The resources (threads, memory and time) are defined in the atlas config file (hours and GB).

If you need to specify queues or accounts you can do this for all rules or for specific rules in the ~/.config/snakemake/cluster/cluster_config.yaml. In addition, using this file you can overwrite the resources defined in the config file.

Example for cluster_config.yaml with queues defined:

__default__:
# default parameter for all rules
  queue: normal
  nodes: 1


# The following rules in atlas need need more time/memory.
# If you need to submit them to different queues you can configure this as outlined.

run_megahit:
  queue: bigmem
run_spades:
  queue: bigmem

This rules can take longer
run_checkm_lineage_wf:
  queue: long

Now, you can run atlas on a cluster with:

atlas run <options> --profile cluster

As the whole pipeline can take several days, I usually run this command in a screen on the head node, even when system administrators don’t normally like that. On the head node atlas only schedules the jobs and combines tables, so it doesn’t use many resources. You can also submit the atlas command as a long lasting job.

If a job fails, you will find the “external jobid” in the error message. You can investigate the job via this ID.

The atlas argument --jobs now becomes the number of jobs simultaneously submitted to the cluster system. You can set this as high as 99 if your colleagues don’t mind you over-using the cluster system.

Single machine execution

If you cannot use the automatic scheduling you can still try to use atlas on a single machine (local execution) with a lot of memory and threads ideally. In this case I recommend you the following options. The same applies if you submit a single job to a cluster running atlas.

In theory you don’t need to adapt the parameters in the config file. However you should tell atlas how many threads and how much memory (GB) you have available on our system so Atlas can take this into account.

For local execution the --jobs command line arguments defines the number of threads used in total. Set it to the number of processors available on your machine. If you have less core available than specified in the config file. The jobs are downscaled. If you have more Atlas tries to start multiple jobs, to optimally use the cores on you machine. The same applies for the memory.

For example on a machine with 16 processors and 250GB memory you might want to run:

atlas run all --resources mem=245 --jobs 16

The whole pipeline can take more than a day. If for any reason the pipeline stops you can just rerun the same command after having inspected the error.

Cloud execution

Atlas, like any other snakemake pipeline can also easily be submitted to cloud systems. I suggest looking at the snakemake doc. Keep in mind any snakemake comand line argument can just be appended to the atlas command.

Useful command line options

Atlas builds on snakemake. We designed the command line interface in a way that additional snakemake arguments can be added to an atlas run call.

For instance the --profile used for cluster execution. Other handy snakemake command line arguments include.

--keep-going, which allows atlas in the case of a failed job to continue with independent steps.

For a full list of snakemake arguments see the snakemake doc.