Skip to content

Instantly share code, notes, and snippets.

@claraj
Last active March 31, 2021 13:47
Show Gist options
  • Save claraj/0dea4bef2e9ac5e84462b4dbdb4ffe2c to your computer and use it in GitHub Desktop.
Save claraj/0dea4bef2e9ac5e84462b4dbdb4ffe2c to your computer and use it in GitHub Desktop.
Python bioinformations 101
# Example of real-world use of Python string manipulation - DNA analysis
#
# DNA is made of ATGC
#
# A-T are always paired
# G-C are always paired
#
# So if you have a sequence of one side of a DNA molecule, can you use Python to generate the other side?
#
dna1 = 'ACCAGTACCAGTGT'
dna2 = 'GTACACCAGGTCTA'
# Remember
# A pairs with T and T pairs with A
# C pairs with G and G pairs with C
# so for dna1, it begins ACCA... so output string will start TGGT...
# In DNA, the A, T, C, G are codes for generating proteins
# you are made of 100's of different kinds of proteins.
# Most of your DNA doesn't appear to make proteins - only about 1% of it encodes protein.
# A part of DNA that encodes a protein is called a gene So how do you find which parts do,
# or where are your genes in your DNA?
#
# So biologist are interested in where certain codes are. One code is ATG and this
# is called a 'start codon' and that means 'start making a protein here'
#
# Does string 1 have any genes in it?
# Does string 2 have any genes in it?
#
# If so, what's the index of where that gene is?
dna3 = 'ACGACGGATACGCGGGAGCTATTCATCTGTGTTGAGAAACACCGGAGAACTTATTGGTCTGTCAAGATTGCGACTGTGGTATAGCTCACCCGGTCGCGGCTTTCTAGT' \
'TAGTGGCCAGCTCCCGTGTATTTGGAAGCTGAGAGAAGGACCCCTGTGGTTCGAATCAGCTCACGAGCGCTGGCACACCGCAATCAGCCGGCTAATAAAATTCGTACG' \
'GACTGCCCCACACAAGAAGACGGTAAATTTATCAACACTATAGTTGCTATACACCAGGAGCGAGCGTAAATTTGTAGCGGTCAGATTAACTTGCTGGGAACGAACCAT' \
'TGTCGCCCTCTGCAGCAAGTTAGTTGGCATCATTGGTACTGCCCTTCACTGGTAGCAGCTCCCCCTGTAATATATCCGTGGCCACTATTCAAGGGCTCAAATAGGCGA' \
'CCCAGAGACCATTATAGGCGGTACAGCGCTGGTAGGTTTGCCTGGGCAGATATCGTTAGCCCCTTCTGCGCGCTATAAGATAGCGAAGGATAATTCTGCGGGACCA' \
'TGGTCGTCTCCTAACCTCAGGGTGGGATTCCTGGCAGGTGGACCGGGCGCGCATCGAGAGCATTCGGGGTTCCTACCAGCCAGGGAAATCGGGTCGACCACTAGGCAA' \
'TGAGCGGCTCACACCGATTTTCTTAAGAGACGTAACAAAGCCCGCATTAACGGCTGGAGTGAATCACCGTACGACTACCTAAGCCTCATTGGGATCCACTGTAAACCC' \
'CTTCGCCGGTGTTGGGTGTCCGCAACGCCTCTGCTTTTTGCGTACAGTCGGCGTGGTGGAGTCCGCGGCCATACTGGCGGTTGGTTTGTAGAACAGTGTAACGACGTG' \
'TGTCACTGCCCCCCGTAGCTTCTATTGCCCTGTTTGGGAGGTTCTATAGGGGTTACAGAGTAGTTTTAAGTTTTAGCACGACAGCACCAGTATTGCCAGTGACGCCGT' \
'TGAGGCCGCAAAAGTGATTAACCCCCGTGGGACCGGATACGTTCCCAGCGGCAATCCTTGTCTTACCGCCGGACTGCGGAGCGAAGGGAGAAGTAACCGTGGTAATTA'
dna4 = 'CAGAGCAATGTCTGTTAGATAATCTCTCGTCTGGATAGCGAGAAGTTTCCGGAAGACGATTGTTTCCAACGAAAGGGCTGATAACTACACTCTGTCGCGCTTCTTTCG' \
'TGTTCGCCAAGGGCACATTGGTTTAAAAGTGATCTCGAGAGACGTTTTCCTGACTTGTTGTGTTATATCAACGTAACTTTTAAGTCATATTTTCTCCCTACCCCAGAC' \
'TAGACGGGTTCCTTTCATCGTCCACCGAGTTGCTTACGAGCAUGACACTTAGCCGGGGAAAAGTTCGCAATTCCGCGACAGCGTCAGGTGTCAAACAGATCCAAGCGA' \
'AGGCCGCCGTGTAACGGAGAATTGTGGGCGCAGTCAAATAGCTAATTATTGGGAAAGGCCAAGTGGAGTCCGTCAGCGGAACAGCCTGGGCGGACGCGCTGCCGCTCG' \
'TTCACCTCGCCTGCCTTCGTGTTGGGGACCGGATACGTTCCCAGCGGCAATCCTTGTCTTACCGCCGGACTGCGGAGCGAAGGGAGAAGTAACCGTGGTAATTAGCGA' \
'GAGACCGTTGAGGCGCGGGGCGATCCGCCCTTGAGTGGACTCCAAACACATTCGACGAAGGGGTGGGAACATAAGTTAATTGGAGGGTCGGGGAAGTCCCACGCCCGG' \
'TCCCTACATGATTGCACATAGTTCGTTCACCAACGGGCGATCTTCCTCACACTAGAGGAACGAGTAGTACTCCAGACATTGAGTCAGTTGCAGACCAAGTGGAGGGAA' \
'CGATTTTTAUGGGCCGCTCAGGTACTAGTGCTAGACCCTACAAACGGCACTGGTGACCCGCTCCCGAGTTTGCGCTGTTACGTGTCCCTTAAAGTATACTTCGATCCT' \
'AACATCGCGGCCATACGACGCTTAAATATTTCACCAGTTGTGTTTCGCGCAUGGAGTTGTTCTGTGTTATCGGCGAGTCTCCATTGCACGTCATCAACTAAAAACCAC' \
'GGCCACACAGACATGCCTTGATTCTTCCCGCGACGGTAGGTTTGCCTGGGCAGATATCGTTAGCCCCTTCTGCGCGCTATAAGATAGCGATAGTAGGTTTAACTATCA'
print(dna4.upper())
# why using the \ line extender and not triple-quoted strings?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment