Day 1 - Bacterial Genome Analysis

Raw reads QC

Check the quality with

FastQC

  1. Navigate to bga directory and activate qc env

    $ cd ~/ws_jul2023/bga
    $ mamba activate qc
  2. Open FastQC GUI. Analyze and save the reports

    (qc)$ fastqc
  3. Check the Basic statistics, Per base quality, Sequence length distribution, Overrepresented sequences and Adapter content sections

  4. Make a directory 1_raw_qc and save results to it

BBDuk

Copied adapters.fa from ~/mambaforge/envs/qc/opt/bbmap-39.01-0/resources/adapters.fa to current directory as bb_adapters.fa

  1. Run bbduk.sh

    (qc)$ mkdir 2_bb_out
    (qc)$ bbduk.sh -Xmx4g in1=raw_reads/hpy_a45_1.fastq.gz \
                          in2=raw_reads/hpy_a45_2.fastq.gz \
                          out1=2_bb_out/hpy_a45_1.fastq.gz \
                          out2=2_bb_out/hpy_a45_2.fastq.gz \
                          ref=bb_adapters.fa \
                          k=23 mink=7 ktrim=r hdist=1 qtrim=r trimq=20 minlen=100 tpe tbo

    Expected result

    Input:              1741880 reads           436749684 bases.
    QTrimmed:           1521272 reads (87.34%)  103531426 bases (23.70%)
    KTrimmed:           394948 reads (22.67%)   13236420 bases (3.03%)
    Trimmed by overlap: 8601 reads (0.49%)      88509 bases (0.02%)
    Total Removed:      170588 reads (9.79%)    116856355 bases (26.76%)
    Result:             1571292 reads (90.21%)  319893329 bases (73.24%)
  2. Open FastQC GUI. Analyze and save the reports

    (qc)$ fastqc

Trimmomatic (Alternative)

  1. Run trimmomatic. You can use the same adapters file

    (qc)$ mkdir trim_out
    (qc)$ cd trim_out
    (qc)$ trimmomatic PE -phred33 \
                         ../raw_reads/hpy_a45_1.fastq.gz \
                         ../raw_reads/hpy_a45_2.fastq.gz \
                         hpy_a45_1_paired.fq.gz hpy_a45_1_unpaired.fq.gz \
                         hpy_a45_2_paired.fq.gz hpy_a45_2_unpaired.fq.gz \
                         ILLUMINACLIP:../bb_adapters.fa:2:30:10 \
                         SLIDINGWINDOW:4:20 MINLEN:100

    Expected result

    Input Read Pairs:         870940 
    Both Surviving: 5         99798 (68.87%) 
    Forward Only Surviving:   160897 (18.47%) 
    Reverse Only Surviving:   28249 (3.24%) 
    Dropped:                  81996 (9.41%)
  2. Open FastQC GUI. Analyze and save the reports

    (qc)$ fastqc

After filtering the raw reads, you can choose either of the following methods depending on the availability of a reference genome or intra-species variations.

Basically, if you have a reference genome and do not expect much variation from it, then the reads can be mapped to the reference. Else, de novo assembly is preferred.

Mapping to a Reference Genome

  1. Index the reference genome sequence

    (qc)$ mamba activate mappers
    (mappers)$ mkdir 3_mapping
    (mappers)$ mkdir 3_mapping/hpy
    (mappers)$ cd 3_mapping/hpy
    (mappers)$ cp ../../ref_genome/hpy_a45.fasta ./
    (mappers)$ bwa index -a is hpy_a45.fasta
  2. Align Reads separately

    (mappers)$ cp ../../2_bb_out/hpy* ./
    (mappers)$ bwa aln -t 6 hpy_a45.fasta hpy_a45_1.fastq.gz > hpy_a45_1.sai
    (mappers)$ bwa aln -t 6 hpy_a45.fasta hpy_a45_2.fastq.gz > hpy_a45_2.sai
  3. Create SAM file and convert it to BAM

    (mappers)$ bwa sampe hpy_a45.fasta hpy_a45_1.sai hpy_a45_2.sai hpy_a45_1.fastq.gz hpy_a45_2.fastq.gz > hpy_a45_aln.sam
    (mappers)$ samtools view -S hpy_a45_aln.sam -b -o hpy_a45_aln.bam -@ 6
  4. Sort and index BAM file

    (mappers)$ samtools sort hpy_a45_aln.bam -o hpy_a45_sorted.bam -@ 6
    (mappers)$ samtools index hpy_a45_sorted.bam
  5. Creating a consensus

    (mappers)$ bcftools mpileup -f hpy_a45.fasta hpy_a45_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > hpy_a45_consensus.fq
  6. Convert to fasta & change the identifier

    (mappers)$ mamba activate emboss
    (emboss)$ seqret -osformat fasta hpy_a45_consensus.fq -out2 hpy_a45_consensus.fa
  7. Exporting unmapped reads (Optional)

    (emboss)$ mamba activate bam2fastq
    (bam2fastq)$ bam2fastq --no-aligned --force --strict -o hpy_a45_unmapped#.fq a45_sorted.bam
  8. Check the rRNA Genes (Optional)

    (bam2fastq)$ mamba activate annotator
    (annotator)$ barrnap -o cons_rrna.fa < hpy_a45_consensus.fa > cons_rrna.gff



    Pre-requisites | QC & mapping | de novo | Ampliseq

    ````