Day 2 - de novo Assembly
SPAdes
Assemble the reads
$ mkdir 4_assembly
$ mkdir 4_assembly/hpy
$ mamba activate assembler
(assembler)$ spades.py --careful --pe1-1 2_bb_out/hpy_a45_1.fastq.gz \
--pe1-2 2_bb_out/hpy_a45_2.fastq.gz \
-o 4_assembly/hpy --cov-cutoff auto -t 6
Check the number of contigs/scaffolds in the fasta file.
(assembler)$ grep -i '>' scaffolds.fasta -c
Barrnap (Optional)
(annotator)$ barrnap -o 4_assembly/hpy/rrna.fa < 4_assembly/hpy/scaffolds.fasta > 4_assembly/hpy/rrna.gff
Alternatives
IDBA
$ fq2fa --merge --filter 2_bb_out/hpy_a45_1.fastq.gz 2_bb_out/hpy_a45_2.fastq.gz idba/hpy_a45_reads.fa
$ idba_ud -r idba/hpy_a45_reads.fa -o idba/
Velvet
$ VelvetOptimiser.pl -s 79 -e 159 -f '-shortPaired -fastq -separate 2_bb_out/hpy_a45_1.fastq.gz 2_bb_out/hpy_a45_2.fastq.gz' -t 12 -d velvet/ -v
Unicycler
$ unicycler -1 2_bb_out/hpy_a45_1.fastq.gz -2 2_bb_out/hpy_a45_2.fastq.gz -o unicycler -t 6
Shovill
$ shovill --outdir 4_assembly/hpy --R1 2_bb_out/hpy_a45_1.fastq.gz --R2 2_bb_out/hpy_a45_2.fastq.gz
Contig Management
MeDuSa
Check if there are any colons in the header of fasta and replace
them (Optional)
$ sed -i 's/:/_/g' scaffolds.fasta
Make a new directory - 5_scaffolding
, navigate to it
and make sub directories hpy
, hpy/Ref
$ mkdir 5_scaffolding
$ cd 5_scaffolding
$ mkdir hpy
$ mkdir hpy/Ref
Copy scaffolds.fasta
to hpy
directory
and reference genome of H. pylori to the hpy/Ref
directory
Keep only the required reference genomes in the Ref
directory
$ cp ../4_assembly/hpy/scaffolds.fasta hpy/
$ cp ../ref_genome/hpy_a45.fasta hpy/Ref/
Activate scaffolder
env and run
medusa
$ mamba activate scaffolder
(scaffolder)$ medusa -d -f hpy/Ref -i hpy/scaffolds.fasta -random 10 -w2 -v
If you face cPickle error open python file and Change
cPickle
to pickle
in
~/mambaforge/envs/scaffolder/share/medusa-1.6-2/script/netcon_mummer.py
Pilon
Before running Pilon you have to index the fasta file and
reads
(mauve)$ mkdir 6_polishing
(mauve)$ mkdir 6_polishing/hpy
(mauve)$ cd 6_polishing/hpy
(mauve)$ mamba deactivate
$ mamba activate mappers
Copy the scaffolds and index
(mappers)$ cp ../../5_scaffolding/hpy/scaffolds.fastaScaffold.fasta ./scaffolds.fasta
(mappers)$ bowtie2-build scaffolds.fasta hpy_assembly
Align reads to genome (5 mins with 12 cores)
(mappers)$ bowtie2 -x hpy_a45 -1 ../../2_bb_out/hpy_a45_1.fastq.gz -2 ../../2_bb_out/hpy_a45_2.fastq.gz -S reads_on_assembly.sam -p 6
Convert SAM to BAM, sort and index
(mappers)$ samtools view reads_on_assembly.sam -b -o reads_on_assembly.bam -@ 6
(mappers)$ samtools sort reads_on_assembly.bam -o reads_on_assembly_sorted.bam -@ 6
(mappers)$ samtools index reads_on_assembly_sorted.bam
Run pilon
(mappers)$ mamba deactivate
(mappers)$ mamba activate shovill
(shovill)$ pilon --genome scaffolds.fasta --frags reads_on_assembly_sorted.bam
If there is a memory error open
~/mambaforge/envs/shovill/bin/pilon
and increase the max
memory option
default_jvm_mem_opts = ['-Xms512m', '-Xmx1g']
to
default_jvm_mem_opts = ['-Xms512m', '-Xmx4g']
GapCloser (Alternative)
Activate filler
env
(shovill)$ mamba deactivate
$ mamba activate filler
Create a hpy_a45_GC.config
file
max_read_len=250
[LIB]
name=a45
avg_ins=468
reverse_seq=0
asm_flags=4
rank=1
pair_num_cutoff=3
map_len=32
q1=2_bb_out/hpy_a45_1.fastq.gz
q2=2_bb_out/hpy_a45_2.fastq.gz
Create directories and run GapCloser
(filler)$ mkdir 7_filling
(filler)$ mkdir 7_filling/hpy
(filler)$ GapCloser -a 5_scaffolding/hpy/scaffolds.fastaScaffold.fasta -b hpy_a45_GC.config -o 7_filling/hpy/hpy_a45_GC.fasta -t 6
QC for the assembled genome
BUSCO
Activate busco
env.
(shovill)$ cd ../
(shovill)$ mamba deactivate
$ mamba activate busco
(busco)$ mkdir 8_assemblyQC
(busco)$ mkdir 8_assemblyQC/busco
(busco)$ cd 8_assemblyQC
Run BUSCO
(busco)$ busco -m genome -i ../6_polishing/hpy/pilon.fasta -o busco/hpy_a45 --auto-lineage-prok -c 6
Check for memory error while running SEPP. Minimum 6GB is
required
CheckM
Activate checkm
env
(busco)$ mamba deactivate
(busco)$ mamba activate checkm
(checkm)$ mkdir checkm
Make a new directory hpy
and copy
pilon.fasta
(checkm)$ mkdir checkm/hpy
(checkm)$ cp ../6_polishing/hpy/pilon.fasta checkm/hpy/hpy_a45.fasta
Run checkm
(checkm)$ checkm lineage_wf -x fasta checkm/hpy/ ./ -r -f checkm/hpy.txt -t 6
Requires up to 40GB memory. May need to increase the swap size.
Run assembly-stats
(checkm)$ assembly-stats checkm/hpy/*.fasta > assembly_stats.txt
Make genomes
directory and copy the
pilon.fasta
$ mkdir draft_genomes
$ cp 6_polishing/hpy/pilon.fasta draft_genomes/hpy_a45.fasta
Quast
Another QC tool for genome assemblies
Annotation
Gene detection and annotation
Prokka
Activate annotator
env.
$ mamba activate annotator
Make a directory and run prokka
(annotator)$ mkdir 9_annotation
(annotator)$ prokka --outdir 9_annotation/hpy --prefix hpy_a45 --genus Helicobacter draft_genomes/hpy_a45.fasta
Phylogeny
Mafft
Create Mafft env. and install
(roary)$ mamba deactivate
(base)$ mamba create -n phylogeny -y
(base)$ mamba activate phylogeny
(phylogeny)$ mamba install -c bioconda mafft
Run MAFFT
(phylogeny)$ mafft --maxiterate 100 --reorder --thread 10 16S_a45-Ref-Out.fasta > 16S_a45-Ref-Out_aln.fasta
RaxML
RaxML GUI
raxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 288426 -p 288426 -N 100 -m GTRGAMMA -O -o H_acinonychis -n 16S -s 16S_BSE-Ref-Out_aln_modified.fasta
TYGS
Type Strain detection
Online server
Comparison
Pair-wise or multiple
D-Genies
Pair-wise comparison with dot plots
$ mamba activate dotplot
(dotplot)$ dgenies run
Mauve
Multiple genome comparison
Perform progressive alignment
Roary
Activate pangenome
env.
Roary takes multiple gff files produced by prokka as input.
So, download the required genomes (those you want to comapre) from a
public database like NCBI.
Run prokka on all those genomes.
$ mamba activate pangenome
(pangenome)$ cd ../
(pangenome)$ mamba activate pangenome
Make a new directory raory
Run roary
(pangenome)$ mkdir roary_out
(pangenome)$ roary 9_annotation/*/*.gff -e -n -r -v -f roary_out/hpy -p 6
If Error: not found File::Find::Rule
(pangenome)$ cpan File::Find::Rule
FriPan
View the Roary results
Copy roary output to FriPan root and rename it to filename.roary
Change the server module in server.sh
, as suggested in
this site
Biosynthetic Gene Clusters
antiSMASH
Online or local
$ mamba activate bgc
(bgc)$ mkdir antismash_out
(bgc)$ antismash draft_genomes/genome_name.fasta --output-dir antismash_out/genome_name --genefinding-tool prodigal
Visualization
Circos
Display scaffolds, genes, GC content, etc