Created
November 29, 2012 18:18
-
-
Save jordanmessina/4170908 to your computer and use it in GitHub Desktop.
When given a blog rss feed, count_the_Is.py determines the number of first-person pronouns as a percentage of the total text.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
import sys | |
import feedparser | |
from pyquery import PyQuery as pq | |
FIRST_PERSON_PRONOUNS = [ | |
' I ', ' I\'d ', ' I\'m ', | |
' me ', ' my ', ' mine ', ' myself ', ' Me ', ' My ', ' Mine ', ' Myself ', | |
] | |
def occurrences(string, sub): | |
"""Count the occurences of a substring in a string, allowing overlaps""" | |
count = start = 0 | |
while True: | |
start = string.find(sub, start) + 1 | |
if start > 0: | |
count+=1 | |
else: | |
return count | |
def main(): | |
if len(sys.argv) != 2: | |
print "Wrong number of arguments, please supply the RSS feed url as a command line argument" | |
sys.exit() | |
feed = feedparser.parse(sys.argv[1]) | |
for item in feed['items']: | |
if 'content' in item: | |
content = item.content[0]['value'] | |
elif 'summary' in item: | |
content = item.summary | |
content_text = pq(content).text() | |
i_count = sum([occurrences(content_text, x) for x in FIRST_PERSON_PRONOUNS]) | |
word_count = len(content_text.split()) | |
print "{title}\nTotal 'I' count: {total}\nTotal word count: {word_count}\nAvg: {avg}\n\n".format( | |
title=item.title.encode('UTF-8'), | |
total=i_count, | |
word_count=word_count, | |
avg=float(i_count)/float(word_count) | |
) | |
if __name__ == '__main__': | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some examples:
Tim Ferris
$ ./count_the_is.py http://feeds.feedburner.com/TimFerriss
The 4-Hour Chef Launch — Marketing/PR Summary of Week One
Total 'I' count: 13
Total word count: 1497
Avg: 0.00868403473614
Food Photography Made Easy — Simple Tricks and Pro Tips from The 4-Hour Chef
Total 'I' count: 29
Total word count: 1730
Avg: 0.0167630057803
What to Do If Boycotted by 1,000+ Bookstores? Open Your Own Bookstores, Of Course.
Total 'I' count: 9
Total word count: 255
Avg: 0.0352941176471
Meet The New York City Food Marathon: 26.2 Dishes in 26 Locations in 24 Hours
Total 'I' count: 0
Total word count: 105
Avg: 0.0
The 4-Hour Chef is LIVE — Dr. Oz, NYC Cabs, TaskRabbit, London, and More
Total 'I' count: 5
Total word count: 352
Avg: 0.0142045454545
37 Signals
$ ./count_the_is.py http://feeds.feedburner.com/37signals/beMH
Pattern vision
Total 'I' count: 0
Total word count: 410
Avg: 0.0
Deconstructing the Cityscape
Total 'I' count: 25
Total word count: 402
Avg: 0.0621890547264
VIDEO: Inspiring talk by Adam Savage of Mythbusters…
Total 'I' count: 0
Total word count: 12
Avg: 0.0
Cities with signals
Total 'I' count: 0
Total word count: 128
Avg: 0.0
Tablets are waiting for their Movable Type
Total 'I' count: 3
Total word count: 205
Avg: 0.0146341463415
Publishers shouldn't be app developers
Total 'I' count: 0
Total word count: 228
Avg: 0.0
VIDEO: A really fun and smart TEDx talk by Rodney…
Total 'I' count: 3
Total word count: 77
Avg: 0.038961038961
Seeing the world, on the clock
Total 'I' count: 40
Total word count: 537
Avg: 0.0744878957169
The British are coming!
Total 'I' count: 0
Total word count: 185
Avg: 0.0
Better remote collaboration will make protectionism harder
Total 'I' count: 2
Total word count: 413
Avg: 0.00484261501211
Fred Wilson
$ ./count_the_is.py http://feeds.feedburner.com/avc
Media Metrix Multi Platform
Total 'I' count: 7
Total word count: 453
Avg: 0.0154525386313
The # Discover Tab
Total 'I' count: 10
Total word count: 134
Avg: 0.0746268656716
MBA Mondays: The Revenue Model Hackpad, Take Two
Total 'I' count: 15
Total word count: 232
Avg: 0.0646551724138
MBA Mondays: The Revenue Model Hackpad
Total 'I' count: 26
Total word count: 426
Avg: 0.0610328638498
What Has Changed
Total 'I' count: 5
Total word count: 970
Avg: 0.00515463917526
The Flow and The Balance
Total 'I' count: 12
Total word count: 434
Avg: 0.0276497695853
How Boxee Saved Our Thanksgiving (And How The Jets Ruined It)
Total 'I' count: 17
Total word count: 317
Avg: 0.0536277602524
Giving Thanks
Total 'I' count: 14
Total word count: 175
Avg: 0.08
The Missing Ad Unit
Total 'I' count: 15
Total word count: 347
Avg: 0.0432276657061
CSEdWeek
Total 'I' count: 6
Total word count: 281
Avg: 0.0213523131673