Skip to content

Instantly share code, notes, and snippets.

@michaelHL
Created November 18, 2018 08:21
Show Gist options
  • Save michaelHL/137913df7fd2902c5f088a2fee7257be to your computer and use it in GitHub Desktop.
Save michaelHL/137913df7fd2902c5f088a2fee7257be to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in all three of the data files.\n",
"Split the play in `midsummer.txt` up so each scene can be considered individually."
]
},
{
"cell_type": "code",
"execution_count": 199,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SCENES:\n",
"SCENE I. Athens. A room in the Palace of THESEUS.\n",
"\n",
"[Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Atten\n",
"--------------------------------------------------\n",
"SCENE II. The Same. A Room in a Cottage.\n",
"\n",
"[Enter SNUG, BOTTOM, FLUTE, SNOUT, QUINCE, and STARVELING.\n",
"--------------------------------------------------\n",
"SCENE I. A wood near Athens.\n",
"\n",
"[Enter a FAIRY at One door, and PUCK at another.]\n",
"\n",
"PUCK\n",
"How now, spiri\n",
"--------------------------------------------------\n",
"SCENE II. Another part of the wood.\n",
"\n",
"[Enter TITANIA, with her Train.]\n",
"\n",
"TITANIA\n",
"Come, now a roundel a\n",
"==================================================\n",
"POSITIVE WORDS:\n",
"a+\tabound\tabounds\tabundance\tabundant\taccessable\taccessible\tacclaim\tacclaimed\tacclamation\n",
"==================================================\n",
"NEGATIVE WORDS:\n",
"2-faced\t2-faces\tabnormal\tabolish\tabominable\tabominably\tabominate\tabomination\tabort\taborted\n"
]
}
],
"source": [
"import re\n",
"\n",
"# ------------------------------------\n",
"# 全局变量\n",
"# ------------------------------------\n",
"\n",
"PLAY_FILE = 'midsummer.txt'\n",
"PLAY_FILE_ENCODING = 'UTF-8-Sig'\n",
"NEGATIVE_WORDS_FILE = 'negative-words.txt'\n",
"NEGATIVE_WORDS_FILE_ENCODING = 'cp1252'\n",
"POSITIVE_WORDS_FILE = 'positive-words.txt'\n",
"POSITIVE_WORDS_FILE_ENCODING = 'cp1252'\n",
"\n",
"# ------------------------------------\n",
"# 剧本文字\n",
"# ------------------------------------\n",
"\n",
"play_text = open(PLAY_FILE, encoding=PLAY_FILE_ENCODING).read()\n",
"\n",
"# 每幕(ACT)包含两场(SCENE)\n",
"# 注意后续处理并不需要明确具体 ACT 以及 SCENE,故直接利用正则进行匹配\n",
"# 首先匹配每幕\n",
"acts_pat = re.compile(\n",
" r'(?<=^ACT)(?:.*?\\n)(.*?)(?=ACT|End of Project)', re.S | re.M)\n",
"acts_text = act_pat.findall(play_text)\n",
"\n",
"# 两场戏剧文字匹配模式\n",
"scene1_pat = re.compile(r'(SCENE I.*?)(?=SCENE II)', re.S)\n",
"scene2_pat = re.compile(r'(SCENE II.*?\\Z)', re.S)\n",
"\n",
"# 将每幕中文字归到所有场次中去\n",
"scenes = []\n",
"for act in acts_text:\n",
" scenes.append(scene1_pat.search(act).group())\n",
" scenes.append(scene2_pat.search(act).group())\n",
"\n",
"# ------------------------------------\n",
"# 积极、消极词汇\n",
"# ------------------------------------\n",
"\n",
"pos_text = open(POSITIVE_WORDS_FILE,\n",
" encoding=POSITIVE_WORDS_FILE_ENCODING).read()\n",
"neg_text = open(NEGATIVE_WORDS_FILE,\n",
" encoding=NEGATIVE_WORDS_FILE_ENCODING).read()\n",
"\n",
"\n",
"def parseValidWords(s):\n",
" \"\"\"\n",
" 从文本中析出有效词语\n",
" \"\"\"\n",
"\n",
" words = []\n",
" lines = s.splitlines()\n",
" for line in lines:\n",
" if line and not line.startswith(';'):\n",
" words.append(line)\n",
" return words\n",
"\n",
"\n",
"neg_words = parseValidWords(neg_text)\n",
"pos_words = parseValidWords(pos_text)\n",
"\n",
"# ------------------------------------\n",
"# 粗略预览\n",
"# ------------------------------------\n",
"\n",
"print('SCENES:')\n",
"for i, scene in enumerate(scenes[:4]):\n",
" if i:\n",
" print('-' * 50)\n",
" print(scene[:100])\n",
"print('=' * 50)\n",
"print('POSITIVE WORDS:')\n",
"print('\\t'.join(pos_words[:10]))\n",
"print('=' * 50)\n",
"print('NEGATIVE WORDS:')\n",
"print('\\t'.join(neg_words[:10]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Develop a single measure based on the word occurrences that will describe the positivity/negativity of the scene."
]
},
{
"cell_type": "code",
"execution_count": 200,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(165, 154), (41, 60), (136, 174), (85, 109), (115, 122), (215, 354), (128, 111), (17, 20), (179, 246), (26, 41)]\n"
]
}
],
"source": [
"# 编译积极、消极词汇正则,这里开启忽略大小写\n",
"pos_pat = re.compile('|'.join(map(re.escape, pos_words)), re.I)\n",
"neg_pat = re.compile('|'.join(map(re.escape, neg_words)), re.I)\n",
"\n",
"# 所有场戏的积极消极词语数量统计\n",
"scene_emotions = []\n",
"for scene in scenes:\n",
" pos_cnt = len(pos_pat.findall(scene))\n",
" neg_cnt = len(neg_pat.findall(scene))\n",
" scene_emotions.append((pos_cnt, neg_cnt))\n",
"\n",
"print(scene_emotions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里指定一个简单的策略:对于一场戏中出现的积极、消极词语出现的次数数组(积极词数,消极词数),计算其均值数,如果:\n",
"- 积极词数高于均值数的一个百分比(比如 5%),那就说这场戏是积极的;\n",
"- 消极词数高于均值数的一个百分比(比如 5%),那就说这场戏是消极的;\n",
"- 其他情况为情感中立的。"
]
},
{
"cell_type": "code",
"execution_count": 201,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(165, 154) ==> Neutral\n",
"(41, 60) ==> Negative\n",
"(136, 174) ==> Negative\n",
"(85, 109) ==> Negative\n",
"(115, 122) ==> Neutral\n",
"(215, 354) ==> Negative\n",
"(128, 111) ==> Positive\n",
"(17, 20) ==> Negative\n",
"(179, 246) ==> Negative\n",
"(26, 41) ==> Negative\n"
]
}
],
"source": [
"def judge_emotion(pairs, threshold=0.05):\n",
" \"\"\"\n",
" 判断给定情感词数元组代表的积极性与消极性\n",
" \"\"\"\n",
"\n",
" emo = ''\n",
" mean = sum(pairs) / 2\n",
" if pairs[0] > pairs[1] and pairs[0] / mean - 1 > threshold:\n",
" emo = 'Positive'\n",
" elif pairs[0] < pairs[1] and pairs[1] / mean - 1 > threshold:\n",
" emo = 'Negative'\n",
" else:\n",
" emo = 'Neutral'\n",
"\n",
" return emo\n",
"\n",
"\n",
"for p in scene_emotions:\n",
" print('{} ==> {}'.format(p, judge_emotion(p)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看出在 `情感因子` 为 5% 时,这 10 场戏中很少有积极的戏(仅 1 场)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make a plot of the measure as a y-axis, with scene number as an x-axis."
]
},
{
"cell_type": "code",
"execution_count": 202,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# 参考:\n",
"# https://matplotlib.org/gallery/lines_bars_and_markers/barchart.html\n",
"# https://python-graph-gallery.com/10-barplot-with-number-of-observation\n",
"\n",
"scene_emotions_pos = [x[0] for x in scene_emotions]\n",
"scene_emotions_neg = [x[1] for x in scene_emotions]\n",
"\n",
"ind = np.arange(1, len(scenes)+1)\n",
"width = 0.35\n",
"\n",
"plt.bar(ind - width / 2, scene_emotions_pos, width, label='Positive')\n",
"plt.bar(ind + width / 2, scene_emotions_neg, width, label='Negative')\n",
"plt.xticks(ind)\n",
"plt.grid(False)\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When a character starts speaking, their name appears in capitals, on its own line. Which character(s) speak most often?"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\"LYSANDER\" appears 50 times.\n",
"\"THESEUS\" appears 48 times.\n",
"\"HERMIA\" appears 48 times.\n",
"\"DEMETRIUS\" appears 47 times.\n",
"\"BOTTOM\" appears 47 times.\n",
"\"QUINCE\" appears 38 times.\n",
"\"HELENA\" appears 36 times.\n",
"\"PUCK\" appears 33 times.\n",
"\"OBERON\" appears 29 times.\n",
"\"TITANIA\" appears 23 times.\n"
]
}
],
"source": [
"from collections import defaultdict\n",
"\n",
"# 全部戏剧正文文本\n",
"play_content = '\\n'.join(scenes)\n",
"\n",
"# 人名匹配规则\n",
"name_pat = re.compile(r'^[A-Z]+$', re.M)\n",
"\n",
"# 匹配所有人名\n",
"characters = name_pat.findall(play_content)\n",
"\n",
"# 人物出现次数列表\n",
"characters_dct = defaultdict(int)\n",
"for c in characters:\n",
" characters_dct[c] += 1\n",
"\n",
"sorted_characters = sorted(characters_dct.items(), key=lambda kv: kv[1], reverse=True)\n",
"for k, v in sorted_characters[:10]:\n",
" print('\"{}\" appears {} times.'.format(k, v))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"Lysander\" is a Talkaholic!"
]
},
{
"cell_type": "code",
"execution_count": 204,
"metadata": {},
"outputs": [],
"source": [
"from jupyterthemes import jtplot\n",
"jtplot.reset()\n",
"# jtplot.style(theme='oceans16')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment