《山林道》-「當初說這裡有天 會由樹變成路 一醒覺經已殺出這條路.... 我只盼這裡有天 變回樹 撤回路 疏忽了趕快去補趁還未老... 」 |
Applied Data Science with Python的第四課-Applied Text Mining in Python,是更多有關文字、語言的處理,可以想像最後一課的social network應該有機會處理網絡社交平台上獲取的資料?
最基本的是文字的 Regular Expression 處理,在第一課時已略有使用過,幾個常用方法包括:
### Regular expression import re text = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \ UNSG @NY @NY_Society for Ethical Culture bit.ly/2guVelr' # 字串 text_list = text.split(' ')
# 分拆字串
[w for w in text_list if re.search('@[A-Za-z0-9_]+', w)]
# Dataframe 的字串處理 time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", "Tuesday: The dentist's appointment is at 11:30 am.", "Wednesday: At 7:00pm, there is a basketball game!", "Thursday: Be back home by 11:15 pm at the latest.", "Friday: Take the train at 08:10 am, arrive at 09:00am."] df = pd.DataFrame(time_sentences, columns=['text'])
# 查找吻合的pattern
df['text'].str.split().str.len() # find the number of tokens for each string in df['text'] df['text'].str.findall(r'(\d?\d):(\d\d)') # group and find the hours and minutes df['text'].str.replace(r'\w+day\b', '???') # replace weekdays with '???' df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))') # extract the entire time, the hours, the minutes, and the perio
然後就開始自然語言處理器 NLTK。課程是英語為主,但香港地更實際的一定是中文的處理,從前在 R 就聽過 Jieba 套件,Python世界似乎也是 Jieba 最廣為人所認識,當然這個 Jieba套件之後也要找機會試試。
所以,還是先溫習課程上所學習對英文的流程。首先是把文章進行分詞 (Tokenization),然後針對英文文法上的不同時態/詞型,分辨詞性 (Noun/Verb/Adj/...) 及縮減成詞根的提取 (Stemming) 或還原Lemmatization)。 這樣以後才可以做一些詞頻統計、Vectorization 後做預測模型、兩篇文章內容的相似度的比較。
### 讀檔
with open('moby.txt', 'r') as f:
text_raw = f.read()
text_rawlower = text_raw.lower()
# NLTK word_tokenize 分詞,"This is Python." => "This", "is", "Python", "."
text_tokens = nltk.word_tokenize(text_rawlower)
text1 = nltk.Text(text_tokens) # Text Object
# Lemmatization - 把同一個詞的不同形式map到同一個詞。另一個類似的方法是Stemming,把字詞縮減為本身的詞幹
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return None
text_lemmatized = []
lemmatizer = WordNetLemmatizer()
for word, pos in pos_tag(text1):
wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
text_lemmatized.append(lemmatizer.lemmatize(word, pos=wordnet_pos))
# 字詞的出現次數
from nltk import FreqDist
text_dist = FreqDist(text_lemmatized)
sortedToken = sorted(list(set(text_lemmatized)), key=lambda token: text_dist[token], reverse=True)
[(token, text_dist[token]) for token in sortedToken if len(token) >= 5][:50] # 50個最常出現,而長度>=5的字詞
(text_dist['abc']+text_dist['Abc'])/sum(text_dist.values())*100 # 字詞出現的百分比
### 預測模型 - y-test value
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train) # Transformer
X_train_vectorized = vect.transform(X_train) # sparse matrix of type '<class 'numpy.int64'>'
vect = TfidfVectorizer(min_df=3).fit(X_train) #TfidfVectorizer
# len(vect.get_feature_names())
# vect.get_feature_names()[::200]
X_train_vectorized = vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
feature_names = np.array(vect.get_feature_names())
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Jieba 中文分詞 1-詞頻統計
Jieba 才是重點吧。以2017年的施政報告為例,先將全文抄錄到 'self_policy2017.txt' 文字檔,因應文章內容自行增添一些詞彙到 'selfdict_policy.txt',可以看到兩年的調頻變化,例如「一帶一路」在2016年是大熱詞彙,2017年雖然仍是重點,但及次數有所減少; 大抵經過近年「年青人」在社會參與度的變化,在2017年多番提及。 輸出的結果如下:
import jieba
# 讀檔 with open('self_policy2017.txt', 'r') as f: text_raw = f.read() text_rawlower = text_raw.lower()
from nltk import FreqDist
from collections import Counter
# 中文 Jieba 分詞
jieba.set_dictionary('dict_jiebaZhTW.txt') jieba.load_userdict("selfdict_policy.txt") text_tokens = jieba.lcut(text_raw) # 詞頻統計 text_dist = FreqDist(text_tokens) c = Counter() for w in text_tokens: if len(w)>=3 and w != '\r\n': c[w] += 1 print('2017年施政報告 常用詞頻度統計結果') maxNum = c.most_common(1)[0][1] for (k,v) in c.most_common(50): print('%s%s %s %d' % (' '*(5-len(k)), k, '*'*int(v*20/maxNum), v))
Jieba 中文分詞 2-文字雲 & 文本相似度
當自行找中文的處理方法時,可見 Jieba 的使用比年幾前變得普遍了,一篇五月天的例子,所以也仿效一下別人,用 Jieba 處理歌詞的資料,用的是謝安琪和The Pancakes 的歌。來自2008年的 Binary、2013年的Kontinue、再加兩首單曲專輯-山林道和拾回;The Pancakes 在2007年的 1,2,3,4,5,6,cheese!、2017年的 腦殘遊記 。都是從 Mojim.com 網上抄下來,儲存成這樣的excel格式。準備輸入到 np.dataframe 的 Excel 資料 |
### 所有套件的準備
import nltk
import jieba
import jieba.analyse
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from gensim import corpora,models,similarities
第一個目標是看看常用字詞,無論是所有歌詞或 Dataframe中部份的歌詞,再用 wordcloud 學做一個文字雲。
專輯《腦殘遊記》 |
專輯《1,2,3,4,5,6, cheese!》 |
專輯《Kontinue》 |
專輯《Binary》 |
# 用台灣的繁中字詞庫
jieba.set_dictionary('dict_jiebaZhTW.txt')
jieba.load_userdict("selfdict_lyric.txt")
word1=""
for line in text_pd[text_pd['Album']=='Binary']['Lyric']:
tags = jieba.cut(line)
word1 = word1 + " ".join(tags)
wc = WordCloud(background_color="white", #背景顏色
font_path="NotoSansCJKtc-Regular.otf", #設置字體
max_words = 2000 , #文字雲顯示最大詞數
stopwords=["Oh","沒有","不會","哪個","之後","怎麼","都不","and","to","the","of","is","are","have","we","our","they","them","you","your"]) #停用字詞
wc.generate(word1)
wc.to_file("show_word1.png")
準備了另一首歌(嘉琳的看風景)作為測試,比較歌詞的相似度。
樣本中最接近的是「家明」,但整體也不是很多相似。同一首歌的話會得到 similiarity ~= 1 |
# 用原本的歌詞作為比對的歌詞庫
word_list_all = [] for doc in text_pd['Lyric']: word_list = [word for word in jieba.cut(doc)] word_list_all.append(word_list) #word_list_all dictionary = corpora.Dictionary(word_list_all) corpus = [dictionary.doc2bow(doc) for doc in word_list_all] tfidf = models.TfidfModel(corpus) # 測試的歌詞-"看風景" test_pd = pd.read_excel('self_lyric_test.xls', sheet_name='Sheet1', skiprows=0, skip_footer=0, header=0, names=None, na_values=['...']) doc_test = jieba.cut(test_pd['Lyric'][1]) doc_test_vec = dictionary.doc2bow(doc_test) tfidf[doc_test_vec]
# 計算相似度
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=len(dictionary.keys())) sim = index[tfidf[doc_test_vec]] df_sim = pd.DataFrame({ 'Title' : text_pd['Title'], 'Singer' : text_pd['Singer'], 'sim' : sim }).sort_values(by=['sim'], ascending=False)
print("相似度 of: ",test_pd['Title'][1])
df_sim
沒有留言:
張貼留言