《山林道》-「當初說這裡有天 會由樹變成路 一醒覺經已殺出這條路.... 我只盼這裡有天 變回樹 撤回路 疏忽了趕快去補趁還未老... 」 |
Applied Data Science with Python的第四課-Applied Text Mining in Python,是更多有關文字、語言的處理,可以想像最後一課的social network應該有機會處理網絡社交平台上獲取的資料?
最基本的是文字的 Regular Expression 處理,在第一課時已略有使用過,幾個常用方法包括:
### Regular expression import re text = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \ UNSG @NY @NY_Society for Ethical Culture bit.ly/2guVelr' # 字串 text_list = text.split(' ')
# 分拆字串
[w for w in text_list if re.search('@[A-Za-z0-9_]+', w)]
# Dataframe 的字串處理 time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", "Tuesday: The dentist's appointment is at 11:30 am.", "Wednesday: At 7:00pm, there is a basketball game!", "Thursday: Be back home by 11:15 pm at the latest.", "Friday: Take the train at 08:10 am, arrive at 09:00am."] df = pd.DataFrame(time_sentences, columns=['text'])
# 查找吻合的pattern
df['text'].str.split().str.len() # find the number of tokens for each string in df['text'] df['text'].str.findall(r'(\d?\d):(\d\d)') # group and find the hours and minutes df['text'].str.replace(r'\w+day\b', '???') # replace weekdays with '???' df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))') # extract the entire time, the hours, the minutes, and the perio
然後就開始自然語言處理器 NLTK。課程是英語為主,但香港地更實際的一定是中文的處理,從前在 R 就聽過 Jieba 套件,Python世界似乎也是 Jieba 最廣為人所認識,當然這個 Jieba套件之後也要找機會試試。
所以,還是先溫習課程上所學習對英文的流程。首先是把文章進行分詞 (Tokenization),然後針對英文文法上的不同時態/詞型,分辨詞性 (Noun/Verb/Adj/...) 及縮減成詞根的提取 (Stemming) 或還原Lemmatization)。 這樣以後才可以做一些詞頻統計、Vectorization 後做預測模型、兩篇文章內容的相似度的比較。