Day 1 - Bacterial Genome Analysis

Raw reads QC

Check the quality with

FastQC

Navigate to bga directory and activate qc env

$ cd ~/ws_jul2023/bga
$ mamba activate qc

Open FastQC GUI. Analyze and save the reports
```
(qc)$ fastqc
```
Check the Basic statistics, Per base quality, Sequence length distribution, Overrepresented sequences and Adapter content sections
Make a directory 1_raw_qc and save results to it

BBDuk

Copied adapters.fa from ~/mambaforge/envs/qc/opt/bbmap-39.01-0/resources/adapters.fa to current directory as bb_adapters.fa

Run bbduk.sh

(qc)$ mkdir 2_bb_out
(qc)$ bbduk.sh -Xmx4g in1=raw_reads/hpy_a45_1.fastq.gz \
                      in2=raw_reads/hpy_a45_2.fastq.gz \
                      out1=2_bb_out/hpy_a45_1.fastq.gz \
                      out2=2_bb_out/hpy_a45_2.fastq.gz \
                      ref=bb_adapters.fa \
                      k=23 mink=7 ktrim=r hdist=1 qtrim=r trimq=20 minlen=100 tpe tbo

Expected result

Input:              1741880 reads           436749684 bases.
QTrimmed:           1521272 reads (87.34%)  103531426 bases (23.70%)
KTrimmed:           394948 reads (22.67%)   13236420 bases (3.03%)
Trimmed by overlap: 8601 reads (0.49%)      88509 bases (0.02%)
Total Removed:      170588 reads (9.79%)    116856355 bases (26.76%)
Result:             1571292 reads (90.21%)  319893329 bases (73.24%)

Open FastQC GUI. Analyze and save the reports
```
(qc)$ fastqc
```

Trimmomatic (Alternative)

Run trimmomatic. You can use the same adapters file

(qc)$ mkdir trim_out
(qc)$ cd trim_out
(qc)$ trimmomatic PE -phred33 \
                     ../raw_reads/hpy_a45_1.fastq.gz \
                     ../raw_reads/hpy_a45_2.fastq.gz \
                     hpy_a45_1_paired.fq.gz hpy_a45_1_unpaired.fq.gz \
                     hpy_a45_2_paired.fq.gz hpy_a45_2_unpaired.fq.gz \
                     ILLUMINACLIP:../bb_adapters.fa:2:30:10 \
                     SLIDINGWINDOW:4:20 MINLEN:100

Expected result

Input Read Pairs:         870940 
Both Surviving: 5         99798 (68.87%) 
Forward Only Surviving:   160897 (18.47%) 
Reverse Only Surviving:   28249 (3.24%) 
Dropped:                  81996 (9.41%)

Open FastQC GUI. Analyze and save the reports
```
(qc)$ fastqc
```

After filtering the raw reads, you can choose either of the following methods depending on the availability of a reference genome or intra-species variations.

Basically, if you have a reference genome and do not expect much variation from it, then the reads can be mapped to the reference. Else, de novo assembly is preferred.

Mapping to a Reference Genome

Index the reference genome sequence

(qc)$ mamba activate mappers
(mappers)$ mkdir 3_mapping
(mappers)$ mkdir 3_mapping/hpy
(mappers)$ cd 3_mapping/hpy
(mappers)$ cp ../../ref_genome/hpy_a45.fasta ./
(mappers)$ bwa index -a is hpy_a45.fasta

Align Reads separately

(mappers)$ cp ../../2_bb_out/hpy* ./
(mappers)$ bwa aln -t 6 hpy_a45.fasta hpy_a45_1.fastq.gz > hpy_a45_1.sai
(mappers)$ bwa aln -t 6 hpy_a45.fasta hpy_a45_2.fastq.gz > hpy_a45_2.sai

Create SAM file and convert it to BAM

(mappers)$ bwa sampe hpy_a45.fasta hpy_a45_1.sai hpy_a45_2.sai hpy_a45_1.fastq.gz hpy_a45_2.fastq.gz > hpy_a45_aln.sam
(mappers)$ samtools view -S hpy_a45_aln.sam -b -o hpy_a45_aln.bam -@ 6

Sort and index BAM file

(mappers)$ samtools sort hpy_a45_aln.bam -o hpy_a45_sorted.bam -@ 6
(mappers)$ samtools index hpy_a45_sorted.bam

Creating a consensus

(mappers)$ bcftools mpileup -f hpy_a45.fasta hpy_a45_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > hpy_a45_consensus.fq

Convert to fasta & change the identifier

(mappers)$ mamba activate emboss
(emboss)$ seqret -osformat fasta hpy_a45_consensus.fq -out2 hpy_a45_consensus.fa

Exporting unmapped reads (Optional)

(emboss)$ mamba activate bam2fastq
(bam2fastq)$ bam2fastq --no-aligned --force --strict -o hpy_a45_unmapped#.fq a45_sorted.bam

Check the rRNA Genes (Optional)

(bam2fastq)$ mamba activate annotator
(annotator)$ barrnap -o cons_rrna.fa < hpy_a45_consensus.fa > cons_rrna.gff

Pre-requisites | QC & mapping | de novo | Ampliseq

````