Long Read Alignment

Let's start by looking at some outputs of long read sequencing from the Oxford Nanopore (ONT) platform. These are sequences from the K562 cell line, prepared with the ONT cDNA sequencing kit (poly-A selected). Off the machines, the data will consist of a FAST5 or POD5 file, which are a compressed representation of the raw signal. These are subsequently run through a basecalling algorithm (such as Dorado) to generate FASTQ files.

The choice of basecalling algorithm and parameters goes pretty deep, so we'll assume that reasonable choices have been made. For simplicity, we've also subset the data to include just small portions of the genome, including a few genes of interest.

Go ahead and pull down this fastq file:

wget https://storage.googleapis.com/bfx_workshop_tmp/k562_ont_raw.fastq.gz

To start, let's go ahead and trim the data with the pychopper tool from ONT. It's used to identify, orient, trim, and un-fuse reads. We don't have a local install of this tool, so let's use docker to run it:

docker run -it -v $(pwd -P):/data ontresearch/wf-isoforms

pychopper -t 8 -m edlib -k PCS111 -r pychopper_report.pdf -S pychopper_stats.txt k562_ont_raw.fastq.gz | gzip > k562_ont_trimmed.fastq.gz

After that finishes, we can look at the output reports, in the forms of some statistics (pychopper_stats.txt) and some plots about them (pychopper_report.pdf). In this case, our data has been pre-selected, and so essentially all of the reads are usable, but in some real-life runs, a substantial fraction of them might not be.

Sequence QC

Next, let do some basic QC on this fastq using a tool called NanoPlot.

mkdir nanoplot
NanoPlot --fastq k562_ont_trimmed.fastq.gz --prefix nanoplot/

Open the NanoPlot-report.html file and note that our reads here are quite long compared to short read data - over 1000 bp long. ONT typically advertises reads reaching tens of thousands of base pairs, though - why isn't this the case here?

Answer

This data is created from RNA, which means that the lengths of the molecules are dependent on the lengths of the transcripts, which are not typically tens of thousands of bases long

Scrolling down, you can see this length, along with other stats, represented graphically with embedded dynamic plots, or you can retrieve simple png images from the output folder. Also note the read mean read quality, which is substantially lower than what we'd see with Illumina short-read sequencing.

Looking at the read lengths plots, they look a little multimodal. In a real experiment, that would be weird, but nothing to worry about in this case - it's an artifact that appears because we're only using a very small amount of data.

Aligning the reads

Next up, let's align this data, using a tool called minimap2. It's generally the go-to aligner for long-read data, whether from RNA or DNA.

Just like with short-read DNA data, we'll need a reference genome to align against. Pull down the human build 38 reference and untar it:

wget https://storage.googleapis.com/bfx_workshop_tmp/ref_genome_sm.tar
tar -xvf ref_genome.tar

Your command for running the alignment should look like this:

minimap2 -ax splice -uf ref_genome_sm.fa.gz k562_ont_trimmed.fastq.gz >k562_ont.sam

While that's running, a few things to notice here:

We don't need to create an index ahead of time for minimap. It creates its index of the reference genome very quickly and efficiently compared to short-read aligners like bwa.
We're using the -x splice parameter, which indicates that it should be doing spliced alignment. This handles RNAseq data that has to account for introns interrupting the reads. We'll talk in more detail about spliced alignment that during the RNAseq portion of this course. We're also using the -uf flags which help it find exon boundaries more effectively, by looking at the sequence context.

Now, we've got an aligned SAM file, and you can look at it with less and see that it follows the familiar format of header, followed by alignments. As before, we're going to want to sort and index this bam in order to make it useful for downstream steps.

samtools view -Sb k562_ont.sam | samtools sort >k562_ont.bam
samtools index k562_ont.bam

At this point, let's load this up in IGV and get a feel for what we've got going on here. Let's grab an illumina bam, generated from the same cell line, for comparison:

wget https://storage.googleapis.com/bfx_workshop_tmp/k562_illumina.bam https://storage.googleapis.com/bfx_workshop_tmp/k562_illumina.bam.bai

Exercise:

Open both the Illumina and ONT bams in IGV, using the Human (hg38) reference genome.
Navigate to the U2AF1 gene and answer some questions:
- How many exons are spanned by a typical short read? How about a typical long read? Which has more even coverage over the exons of these genes?
- Right click on the Gene track and choose "Expanded" Can you match up individual long reads with full-length transcripts? What about short reads? How might this affect your ability to understand alternative splicing?
- Zoom into an exon, until you can see basepair level changes. Which data has a higher error rate? Are the classes of errors you see between the two platforms different? What kinds of analyses would be limited by this error rate?
Now navigate to the DNMT3A gene and zoom in on the 3' (left) end.
- What do you notice about the reads and the coverage?
- Do these look like full-length transcripts to you?
- Go to the ensembl website and look at the gold transcripts - the most stable, well characterized. How long is the the dominant isoform?
- Compare this to the short read data. Which is more useful in this situation? What kinds of biases does this filtering create when estimating gene expression or transcript abundance?

Expression Quantification

Transcript assembly and quanitification is too CPU intensive to do in this course, but we recently ran ESPRESSO on some cDNA from 3 "normal" CD34+ samples and compared them to 4 samples from AML patients. We've taken those results and subset them to just the values from the U2AF1 gene.

Start a new session in IGV using human b38 then load the GTF file:

https://storage.googleapis.com/bfx_workshop_tmp/u2af1.gtf

As with the short-read data, the GTF file contains the transcript models. Navigate to the U2AF1 gene and expand the track (scroll down past the "Gene" track if needed)

Most transcripts have Ensembl IDs, but those that are marked as ESPRESSO* are novel assemblies. Do you think this one is a real event? Could it be explained by any of the artifacts we talked about?

Now let's look at some transcript abundance data from this gene. Pull down a TSV file with that info:

wget https://storage.googleapis.com/bfx_workshop_tmp/u2af1_abundance.tsv

What do you notice about the relative intensities? Let's make a plot comparing them. In R, grab the data:

expr = read.table("u2af1_abundance.tsv", sep="\t", header=T)

Drop some columns we don't need and set the transcript id as the row names:

txid=expr$transcript_ID
expr=expr[,4:nrow(expr)]
rownames(expr)=txid

Now make a plot of transcript abundance:

colors = c('#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a','#ffff99')

pdf("u2af1_abundance.pdf",width=10,height=8)

barplot(as.matrix(expr),col=colors, las=2, ylab="Abundance")
legend("topright", legend=rownames(expr), fill=colors, cex=0.8, xpd = TRUE, ncol=2)

dev.off()

It can also be useful to look at a proportional barplot:

#for each column, convert to a proportion
expr.prop <- apply(expr, 2, function(x){x*100/sum(x,na.rm=T)})

pdf("u2af1_abundance_prop.pdf",width=10,height=8)

barplot(as.matrix(expr.prop),col=colors, las=2, ylab="Abundance")
legend("topright", legend=rownames(expr), fill=colors, cex=0.8, xpd = TRUE, ncol=2)
dev.off()

Are there any obvious changes in transcript abundance between these two sets of samples?

Look up U2AF1 on the ensembl website and find the two most common transcripts. Are they the transcripts you'd expect? (click on "region in detail" if needed).

chrisamiller/long_read_alignment.md