Last active
January 26, 2023 09:30
-
-
Save jo-lang/03656ce436b045831cb0965662e26cff to your computer and use it in GitHub Desktop.
Example script using a Counter to count occurancies of all words in a given text file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import Counter | |
# Change the path to whatever file you want to be read | |
pth = 'MOBY-DICK.txt' | |
with open(pth) as f: | |
words = f.read().lower().split() | |
ct = Counter(words) | |
print (ct.most_common(100)) | |
print (ct['in,']) # just splitting on spaces will produce strings that have non-alphabetic characters at the beginning or end (like 'in,' or "'the") | |
# target = '-wordCount.'.join(pth.rsplit('.', 1)) | |
# with open(target, 'w') as f: | |
# f.write('\n'.join([f'{k}: {v}' for k,v in ct.most_common()])) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To remove the strings with non alphabetic characters a line of regex could help.
There are two examples
– include only alphabetic character
– exclude some non alphabetic characters