Skip to content

Instantly share code, notes, and snippets.

@jo-lang
Last active November 16, 2018 16:05
Show Gist options
  • Save jo-lang/18034bd0fdbaf20d8fe92939536c9bb7 to your computer and use it in GitHub Desktop.
Save jo-lang/18034bd0fdbaf20d8fe92939536c9bb7 to your computer and use it in GitHub Desktop.
find words (with one or more of a selection of characters) in a source file and save them to a new file
# -*- coding: utf-8 -*-
import re
from random import shuffle
look_for = 'áàãâéêçíóôõúü'
omit = '[,:;.()!"?]'
max_amount = 2000000
pattern = r'[^\s]*'+ '[' + re.escape(look_for) + ']' + r'[^\s]*'
found_words = []
with open('source-text.txt', encoding='utf-8') as f:
for l in f.readlines():
l = re.sub(omit, ' ', l)
matches = re.findall(pattern, l, flags=0)
found_words.extend(matches)
with open('filtered_words.txt', 'w', encoding='utf-8') as target:
result = list(set(found_words))
shuffle(result)
if len(result) < max_amount:
max_amount = len(result)
target.write('\n'.join(result[:max_amount]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment