Samples are defined under “samples” and have a unique name that does not
contain spaces or underscores (dashes are accepted). To only annotate, a
samples: sample-1: fasta: /project/bins/sample-1_contigs.fasta sample-2: fasta: /project/bins/sample-2_contigs.fasta
All other annotation parameters can be defined following samples. See Annotation.
Annotation and Quantification¶
To get counts, we need to define FASTQ file paths that will be mapped back to the FASTA regions.
In addition to specifying FASTQ paths for each sample, the configuration will also need to contain:
Reads are always assumed to be paired-end, so we only need to specify
samples: sample-1: fasta: /project/bins/sample-1_contigs.fasta fastq: /project/data/sample-1_pe.fastq sample-2: fasta: /project/bins/sample-2_contigs.fasta fastq: /project/data/sample-2_pe.fastq
In this case, we create a list using YAML syntax for both R1 and R2 indexes:
samples: sample-1: fasta: /project/bins/sample-1_contigs.fasta fastq: - /project/data/sample-1_R1.fastq - /project/data/sample-1_R2.fastq sample-2: fasta: /project/bins/sample-2_contigs.fasta fastq: - /project/data/sample-2_R1.fastq - /project/data/sample-2_R2.fastq
As data are assumed to be paired-end, we need to add
samples: sample-1: fasta: /project/bins/sample-1_contigs.fasta fastq: /project/data/sample-1_se.fastq paired: false sample-2: fasta: /project/bins/sample-2_contigs.fasta fastq: /project/data/sample-2_se.fastq paired: false
A complete example for annotation and quantification for samples with paired-end reads in separate and in interleaved FASTQs:
samples: sample-1: fasta: /project/bins/sample-1_contigs.fasta fastq: /project/data/sample-1_pe.fastq sample-2: fasta: /project/bins/sample-2_contigs.fasta fastq: - /project/data/sample-2_R1.fastq - /project/data/sample-2_R2.fastq quantification: true tmpdir: /scratch threads: 24 refseq_namemap: /pic/projects/mint/atlas_databases/refseq.db refseq_tree: /pic/projects/mint/atlas_databases/refseq.tree diamond_db: /pic/projects/mint/atlas_databases/refseq.dmnd # 'fast' or 'sensitive' diamond_run_mode: fast # setting top_seqs to 5 will report all alignments whose score is # at most 5% lower than the top alignment score for a query diamond_top_seqs: 2 # maximum e-value to report alignments diamond_e_value: "0.000001" # minimum identity % to report an alignment diamond_min_identity: 50 # minimum query cover % to report an alignment diamond_query_coverage: 60 # gap open penalty diamond_gap_open: 11 # gap extension penalty diamond_gap_extend: 1 # Block size in billions of sequence letters to be processed at a time. # This is the main parameter for controlling DIAMOND's memory usage. # Bigger numbers will increase the use of memory and temporary disk space, # but also improve performance. The program can be expected to roughly use # six times this number of memory (in GB). diamond_block_size: 6 # The number of chunks for processing the seed index (default=4). This # option can be additionally used to tune the performance. It is # recommended to set this to 1 on a high memory server, which will # increase performance and memory usage, but not the usage of temporary # disk space. diamond_index_chunks: 1 # 'lca', 'majority', or 'best'; summary method for annotating ORFs; when # using LCA, it's recommended that one limits the number of hits using a # low top_fraction summary_method: lca # 'lca', 'lca-majority', or 'majority'; summary method for aggregating ORF # taxonomic assignments to contig level assignment; 'lca' will result in # most stringent, least specific assignments aggregation_method: lca-majority # constitutes a majority fraction at tree node for 'lca-majority' ORF # aggregation method majority_threshold: 0.51