Created
September 5, 2016 08:22
-
-
Save claczny/adefc4d6be84c4d42a6498d1b69ac31a to your computer and use it in GitHub Desktop.
A Makefile to compute the average genome coverage, coverage distribution, and other things from an input BAM-file. N.B. Secondary expansion (.SECONDEXPANSION) is used to create and populate a dedicated directory per sample.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SHELL=/bin/bash | |
SAMPLE?=<YOUR_SAMPLE> | |
DOUBLED_SAMPLE = $(SAMPLE)/$(SAMPLE) | |
RDIR?=results | |
DDIR?=data | |
##### | |
# BEAUTY TARGETS | |
##### | |
.PHONY: all extract_fastq bam2sam sort_bam index_bam genomecov | |
all: extract_fastq genomecov | |
extract_fastq: $(RDIR)/$(DOUBLED_SAMPLE).fq | |
bam2sam: $(RDIR)/$(DOUBLED_SAMPLE).sam | |
sort_bam: $(RDIR)/$(DOUBLED_SAMPLE).srtd.bam | |
index_bam: $(RDIR)/$(DOUBLED_SAMPLE).srtd.bai | |
genomecov: $(RDIR)/$(DOUBLED_SAMPLE).srtd.cov_avg.txt | |
# SOME BASIC STATISTICS | |
get_unique_seq_count: $(RDIR)/$(DOUBLED_SAMPLE).sam | |
awk '{print $$1}' $^ | sort |uniq -c | wc -l | |
get_mapq_distribution: $(RDIR)/ $(DOUBLED_SAMPLE).sam | |
awk -F"\t" '{print $$5}' $^ | sort |uniq -c | |
get_cigar_distribution: $(RDIR)/$(DOUBLED_SAMPLE).sam | |
awk -F"\t" '{print $$6}' $^ | sort |uniq -c | |
##### | |
# ACTUAL TARGETS | |
##### | |
.SECONDARY: | |
.SECONDEXPANSION: | |
$(RDIR)/%.fq: $(DDIR)/$$(notdir $$*).bam | |
mkdir -p $(dir $@) | |
@date | |
time bedtools bamtofastq -i $^ -fq $@ | |
@date | |
$(RDIR)/%.sam: $(DDIR)/$$(notdir $$*).bam | |
@date | |
time samtools view $^ > $@ | |
@date | |
$(RDIR)/%.srtd.bam: $(DDIR)/$$(notdir $$*).bam | |
@date | |
time samtools sort $^ $(@:.bam=) | |
@date | |
%.bai: %.bam | |
@date | |
time samtools index $^ $@ | |
@date | |
%.srtd.cov_hist.txt: %.srtd.bam %.srtd.bai | |
@date | |
bedtools genomecov -ibam $(word 1,$^) > $@ | |
@date | |
%.cov_avg.txt: %.cov_hist.txt | |
@date | |
awk -F"\t" 'BEGIN {pc=""} \ | |
{\ | |
c=$$1;\ | |
if (c == pc) {\ | |
cov=cov+$$2*$$5;\ | |
} else {\ | |
print pc,cov;\ | |
cov=$$2*$$5;\ | |
pc=c}\ | |
} END {print pc,cov}' $^ | tail -n +2 > $@ | |
@date | |
##### | |
# CLEAN-UP | |
##### | |
clean: | |
echo "TODO: clean" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
DESCRIPTION
This Makefile can be used to compute the average coverage, coverage histogram, and other results given a BAM-file.
Secondary expansion is used to create a separate directory which is named after the sample.
This should make the organization easier as all the results will be grouped by sample.
There may be other ways to achieve a similar effect using
make
, however, this is the only one that I stumbled upon so far.SETUP
You must provide the name of your sample in the
SAMPLE
variable. Moreover, please set yourresults
(output) anddata
(input) directories accordingly.While they can be specified within the Makefile, they may be specified in the
make
call, e.g.,make SAMPLE=FOO RDIR=/path/to/my/results DDIR=/path/to/my/bam/files
, thus, with the input BAM-file being/path/to/my/bam/files/FOO.bam
.DOUBLED_SAMPLE
is used to create a dedicated results directory for each input BAM-file, e.g.extract_fastq: $(RDIR)/$(DOUBLED_SAMPLE).fq
aims at extracting the mapped sequences for the sample of interest into the respective results-directory.SECONDARY EXPANSION
.SECONDEXPANSION
is used to name the dependencies according to the target.The aim is to make the whole approach more generic and make sure that all the newly "made" files are put into their respective, sample-specific folder.
$(RDIR)/%.fq
is the typical way of creating a generalmake
-target (s.a. pattern rules).$(DDIR)/$$(notdir $$*).bam
is where the secondary expansion is used. The$$
(double dollar sign) highlights this.More precisely, on the first expansion, the
$$*
will evaluate to$*
. In this way, upon the secondary expansion, the$*
can be evaluated to the value of the place holder%
, e.g., FOO, which is known/set after the first expansion.In a way, it is similar to creating a dedicated Makefile for
FOO.bam
with the rule$(RDIR)/FOO/FOO.fq
: $(DDIR)/FOO.bam`, but much more generic :)CARRYING ON
Once the files that are dependent on the input BAM-file are created in the sample-specific results folder, the
make
-rules can be written as usual, e.g.,%.bai: %.bam
. This target-line means: For a given BAM-file, create a respective BAI-file. For example, it the target is/path/to/your/results/FOO.srtd.bam
, the dependency will be/path/to/your/results/FOO.srtd.bam
. It represents a pattern-rule, s. above.WHAT'S THE POINT OF THIS?
The main point of this is that I often encounter the case where I have multiple individual files and a set of independent computations has to performed on each of them.
This results in multiple files being generated per input file - think, Single Instruction Multi Data (SIMD) or even Multiple Instruction Multiple Data (MIMD).
Using pattern rules, this is extremely convenient to achieve with
make
, yet the organization of the results suffers as all the results would typically be created in the same results folder, without any subdirectories.This Makefile should help alleviate this cluttering while preserving all the nice functionalities of
make
, e.g., running multiple jobs concurrently using the-j
option.