Day 2 - de novo Assembly

SPAdes

  1. Assemble the reads

    $ mkdir 4_assembly
    $ mkdir 4_assembly/hpy
    $ mamba activate assembler
    (assembler)$ spades.py --careful --pe1-1 2_bb_out/hpy_a45_1.fastq.gz \
                                     --pe1-2 2_bb_out/hpy_a45_2.fastq.gz \
                                     -o 4_assembly/hpy --cov-cutoff auto -t 6
  2. Check the number of contigs/scaffolds in the fasta file.

    (assembler)$ grep -i '>' scaffolds.fasta -c
  3. Barrnap (Optional)

    (annotator)$ barrnap -o 4_assembly/hpy/rrna.fa < 4_assembly/hpy/scaffolds.fasta > 4_assembly/hpy/rrna.gff

Alternatives

  1. IDBA

    $ fq2fa --merge --filter 2_bb_out/hpy_a45_1.fastq.gz 2_bb_out/hpy_a45_2.fastq.gz idba/hpy_a45_reads.fa
    $ idba_ud -r idba/hpy_a45_reads.fa -o idba/
  2. Velvet

    $ VelvetOptimiser.pl -s 79 -e 159 -f '-shortPaired -fastq -separate 2_bb_out/hpy_a45_1.fastq.gz 2_bb_out/hpy_a45_2.fastq.gz' -t 12 -d velvet/ -v
  3. Unicycler

    $ unicycler -1 2_bb_out/hpy_a45_1.fastq.gz -2 2_bb_out/hpy_a45_2.fastq.gz -o unicycler -t 6
  4. Shovill

    $ shovill --outdir 4_assembly/hpy --R1 2_bb_out/hpy_a45_1.fastq.gz --R2 2_bb_out/hpy_a45_2.fastq.gz

Contig Management

MeDuSa

  1. Check if there are any colons in the header of fasta and replace them (Optional)

    $ sed -i 's/:/_/g' scaffolds.fasta
  2. Make a new directory - 5_scaffolding, navigate to it and make sub directories hpy, hpy/Ref

    $ mkdir 5_scaffolding
    $ cd 5_scaffolding
    $ mkdir hpy
    $ mkdir hpy/Ref
  3. Copy scaffolds.fasta to hpy directory and reference genome of H. pylori to the hpy/Ref directory

    Keep only the required reference genomes in the Ref directory

    $ cp ../4_assembly/hpy/scaffolds.fasta hpy/
    $ cp ../ref_genome/hpy_a45.fasta hpy/Ref/
  4. Activate scaffolder env and run medusa

    $ mamba activate scaffolder
    (scaffolder)$ medusa -d -f hpy/Ref -i hpy/scaffolds.fasta -random 10 -w2 -v

    If you face cPickle error open python file and Change cPickle to pickle in ~/mambaforge/envs/scaffolder/share/medusa-1.6-2/script/netcon_mummer.py

Pilon

  1. Before running Pilon you have to index the fasta file and reads

    (mauve)$ mkdir 6_polishing
    (mauve)$ mkdir 6_polishing/hpy
    (mauve)$ cd 6_polishing/hpy
    (mauve)$ mamba deactivate
    $ mamba activate mappers
  2. Copy the scaffolds and index

    (mappers)$ cp ../../5_scaffolding/hpy/scaffolds.fastaScaffold.fasta ./scaffolds.fasta
    (mappers)$ bowtie2-build scaffolds.fasta hpy_assembly
  3. Align reads to genome (5 mins with 12 cores)

    (mappers)$ bowtie2 -x hpy_a45 -1 ../../2_bb_out/hpy_a45_1.fastq.gz -2 ../../2_bb_out/hpy_a45_2.fastq.gz -S reads_on_assembly.sam -p 6
  4. Convert SAM to BAM, sort and index

    (mappers)$ samtools view reads_on_assembly.sam -b -o reads_on_assembly.bam -@ 6
    (mappers)$ samtools sort reads_on_assembly.bam -o reads_on_assembly_sorted.bam -@ 6
    (mappers)$ samtools index reads_on_assembly_sorted.bam
  5. Run pilon

    (mappers)$ mamba deactivate
    (mappers)$ mamba activate shovill
    (shovill)$ pilon --genome scaffolds.fasta --frags reads_on_assembly_sorted.bam

    If there is a memory error open ~/mambaforge/envs/shovill/bin/pilon and increase the max memory option

    default_jvm_mem_opts = ['-Xms512m', '-Xmx1g'] to
    default_jvm_mem_opts = ['-Xms512m', '-Xmx4g']

GapCloser (Alternative)

  1. Activate filler env

    (shovill)$ mamba deactivate 
    $ mamba activate filler
  2. Create a hpy_a45_GC.config file

    max_read_len=250
    [LIB]
    name=a45
    avg_ins=468
    reverse_seq=0
    asm_flags=4
    rank=1
    pair_num_cutoff=3
    map_len=32
    q1=2_bb_out/hpy_a45_1.fastq.gz
    q2=2_bb_out/hpy_a45_2.fastq.gz
  3. Create directories and run GapCloser

    (filler)$ mkdir 7_filling
    (filler)$ mkdir 7_filling/hpy
    (filler)$ GapCloser -a 5_scaffolding/hpy/scaffolds.fastaScaffold.fasta -b hpy_a45_GC.config -o 7_filling/hpy/hpy_a45_GC.fasta -t 6

QC for the assembled genome

BUSCO

  1. Activate busco env.

    (shovill)$ cd ../
    (shovill)$ mamba deactivate
    $ mamba activate busco
    (busco)$ mkdir 8_assemblyQC
    (busco)$ mkdir 8_assemblyQC/busco
    (busco)$ cd 8_assemblyQC
  2. Run BUSCO

    (busco)$ busco -m genome -i ../6_polishing/hpy/pilon.fasta -o busco/hpy_a45 --auto-lineage-prok -c 6

    Check for memory error while running SEPP. Minimum 6GB is required

CheckM

  1. Activate checkm env

    (busco)$ mamba deactivate
    (busco)$ mamba activate checkm
    (checkm)$ mkdir checkm
  2. Make a new directory hpy and copy pilon.fasta

    (checkm)$ mkdir checkm/hpy
    (checkm)$ cp ../6_polishing/hpy/pilon.fasta checkm/hpy/hpy_a45.fasta
  3. Run checkm

    (checkm)$ checkm lineage_wf -x fasta checkm/hpy/ ./ -r -f checkm/hpy.txt -t 6

    Requires up to 40GB memory. May need to increase the swap size.

  4. Run assembly-stats

    (checkm)$ assembly-stats checkm/hpy/*.fasta > assembly_stats.txt
  5. Make genomes directory and copy the pilon.fasta

    $ mkdir draft_genomes
    $ cp 6_polishing/hpy/pilon.fasta draft_genomes/hpy_a45.fasta

Quast

Another QC tool for genome assemblies

Annotation

Gene detection and annotation

RAST

Online tool

Prokka

  1. Activate annotator env.

    $ mamba activate annotator
  2. Make a directory and run prokka

    (annotator)$ mkdir 9_annotation
    (annotator)$ prokka --outdir 9_annotation/hpy --prefix hpy_a45 --genus Helicobacter draft_genomes/hpy_a45.fasta

Phylogeny

Mafft

  1. Create Mafft env. and install

    (roary)$ mamba deactivate
    (base)$ mamba create -n phylogeny -y
    (base)$ mamba activate phylogeny
    (phylogeny)$ mamba install -c bioconda mafft
  2. Run MAFFT

    (phylogeny)$ mafft --maxiterate 100 --reorder --thread 10 16S_a45-Ref-Out.fasta > 16S_a45-Ref-Out_aln.fasta

RaxML

RaxML GUI

raxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 288426 -p 288426 -N 100 -m GTRGAMMA -O -o H_acinonychis -n 16S -s 16S_BSE-Ref-Out_aln_modified.fasta 

TYGS

Type Strain detection
Online server

Comparison

Pair-wise or multiple

D-Genies

Pair-wise comparison with dot plots

$ mamba activate dotplot
(dotplot)$ dgenies run

Mauve

Multiple genome comparison

Perform progressive alignment

Roary

  1. Activate pangenome env.

    Roary takes multiple gff files produced by prokka as input.
    So, download the required genomes (those you want to comapre) from a public database like NCBI.
    Run prokka on all those genomes.

    $ mamba activate pangenome
    (pangenome)$ cd ../
    (pangenome)$ mamba activate pangenome
  2. Make a new directory raory
    Run roary

    (pangenome)$ mkdir roary_out
    (pangenome)$ roary 9_annotation/*/*.gff -e -n -r -v -f roary_out/hpy -p 6

    If Error: not found File::Find::Rule

    (pangenome)$ cpan File::Find::Rule
  3. FriPan
    View the Roary results
    Copy roary output to FriPan root and rename it to filename.roary

    Change the server module in server.sh, as suggested in this site

Biosynthetic Gene Clusters

antiSMASH

Online or local

$ mamba activate bgc
(bgc)$ mkdir antismash_out
(bgc)$ antismash draft_genomes/genome_name.fasta --output-dir antismash_out/genome_name --genefinding-tool prodigal

Visualization

Circos

Display scaffolds, genes, GC content, etc

Online servers

  • KBase
  • Galaxy
  • BV-BRC



Pre-requisites | QC & mapping | de novo | Ampliseq

Drafted by: Abhishek Khatri
Edited by: Anwesh Maile