Last active
February 17, 2025 16:19
-
-
Save dceoy/99d976a2c01e7f0ba1c813778f9db744 to your computer and use it in GitHub Desktop.
[Python] Read VCF (variant call format) as pandas.DataFrame
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
import io | |
import os | |
import pandas as pd | |
def read_vcf(path): | |
with open(path, 'r') as f: | |
lines = [l for l in f if not l.startswith('##')] | |
return pd.read_csv( | |
io.StringIO(''.join(lines)), | |
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str, | |
'QUAL': str, 'FILTER': str, 'INFO': str}, | |
sep='\t' | |
).rename(columns={'#CHROM': 'CHROM'}) |
a way of doing it that will use all fields on any vcf using pyvcf https://pyvcf.readthedocs.io/en/v0.4.6/INTRO.html
import pandas as pd
import vcf
def read(f):
reader = vcf.Reader(open(f))
df = pd.DataFrame([vars(r) for r in reader])
out = df.merge(pd.DataFrame(df.INFO.tolist()),
left_index=True, right_index=True)
return out
run read(your_vcf)
If anyone's interested, I was looking for a way to do this too and ended up writing the pyvcf
submodule:
A quick example of pyvcf.VcfFrame
:
data = {
'CHROM': ['chr1', 'chr2'],
'POS': [100, 101],
'ID': ['.', '.'],
'REF': ['G', 'T'],
'ALT': ['A', 'C'],
'QUAL': ['.', '.'],
'FILTER': ['.', '.'],
'INFO': ['.', '.'],
'FORMAT': ['GT', 'GT'],
'Steven': ['0/1', '1/1']
}
vf = pyvcf.VcfFrame.from_dict([], data)
vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0 chr1 100 . G A . . . GT 0/1
1 chr2 101 . T C . . . GT 1/1
To read a VCF file into VcfFrame:
vf = pyvcf.VcfFrame.from_file('example.vcf')
This was so so useful. Thank you very much @dceoy
It works great. Thanks
Hi,
Did you find a solution for not finding the result after you use the python script ? I am facing the same issue
This was all I need for now. Thank you very much!! :)
That was indeed usefull! Thank you very much!!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I developed
pdbio
package. Please use it. @pdorsainthttps://github.com/dceoy/pdbio
This package is a Pandas-based data handling tool and supports the use from a command-line.
Example of VCF data handling: