Skip to content

Instantly share code, notes, and snippets.

@jo-lang
Last active January 26, 2023 09:30
Show Gist options
  • Save jo-lang/03656ce436b045831cb0965662e26cff to your computer and use it in GitHub Desktop.
Save jo-lang/03656ce436b045831cb0965662e26cff to your computer and use it in GitHub Desktop.
Example script using a Counter to count occurancies of all words in a given text file
from collections import Counter
# Change the path to whatever file you want to be read
pth = 'MOBY-DICK.txt'
with open(pth) as f:
words = f.read().lower().split()
ct = Counter(words)
print (ct.most_common(100))
print (ct['in,']) # just splitting on spaces will produce strings that have non-alphabetic characters at the beginning or end (like 'in,' or "'the")
# target = '-wordCount.'.join(pth.rsplit('.', 1))
# with open(target, 'w') as f:
# f.write('\n'.join([f'{k}: {v}' for k,v in ct.most_common()]))
@jo-lang
Copy link
Author

jo-lang commented Jan 25, 2023

To remove the strings with non alphabetic characters a line of regex could help.
There are two examples
– include only alphabetic character
– exclude some non alphabetic characters

from collections import Counter
import re 


pth = 'MOBY-DICK.txt'

with open(pth) as f:
    # choose one of the two lines below 
    txt = re.sub(r'[^a-zA-Z\s]+', ' ', f.read().lower()) # this will also remove the æ in vertebræ (8 occurancies). 
    txt = re.sub(r'[\.\,\(\)\'\"\;”\“\—\!\?\:]+', ' ', f.read().lower()) # this might not find all non-alphabetic charcters. 
    words = txt.split()
    ct = Counter(words)
    
print (ct.most_common(100))
print (ct['in,'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment