2018年6月6日 星期三

Coursera - Applied Data Science with Python 學習筆記 03 - machine learning

第三課, Coursera上的最後一課已經開課,加快記完這篇就要再追一追進度了。機器學習隨著人工智能近年變成流行Buzzword,其實在引入神經網路(Neural Networks)前,一些模型例如Regression, logistics regression, KNN Classifier 等 迴歸 或 分類 模型; K means 的叢集;PCA 的縮減維度方法;Decision Tree的決策指引;很多概念都是統計和數學所已有的,只是一個機械學習的課程會對實作更有重視。

今次只是在有課程框架下去學python。當以<第一課>的技巧處理好數據成合用的格式, <第二課>繪圖後對數據有視覺化的印象, 可以進入第三課建立模型去做數據的形容/預測。其他例子可以參考之前幾篇,例如早期用 R 的:
[R] ML4B 課堂重溫 - 淺談 KNN (K-th Nearest Neighbors) 算法
[R] Show Me The Code - Machine Learning的簡易入門

在課堂的練習中,所有今次用到有關的scikit-learn library。從library的架構中可以留意到model selection, preprocessing, models, metrics幾大類別,對應著建立模型過程中的不同需要:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import validation_curve

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics.scorer import SCORERS
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.datasets import make_classification, make_blobs, make_friedman1, load_breast_cancer

下面的不同機械學習模型,使用的流程也差不多,都是在準備了數據Dataset 和分類器 Classifier 後,用Training Dataset進行 Model Fitting,得出訓練好的Classifier/Model之後,就可以用來對Testing Dataset進行預測。而且也要檢驗準確度。

為了保留一部分數據在建模後,可以用來驗證預測新數據時的準確度,所以有Training/Testing Dataset 之分。當訓練模型時 overfit 了 training data ,可能會減低對新數據(如 Testing data) 預測時的準確度。理想中的效果應該準確度在Training或Testing的數據時都能相約地高。

# 將數據分成Training / Testing 兩部份
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

[Optional Start]
# 選好了classifier之後也可以用 cross_val_score() 將整組數據分為 'cv' 份,對某個 Classifier 進行多次交叉驗證去看平均準確度得分。
scores = cross_val_score(clf, X_full, y_full, cv=5, scoring='accuracy')   

# 對於有不同參數的模型,也可嘗試不同參數下的結果,例如 SVC 中的gamma參數作試驗:
train_scores, test_scores = validation_curve(SVC(), X, y, param_name='gamma', param_range=np.logspace(-3, 3, 4), cv=3)

sklearn.preprocessing 中有預先作變數的scale調整、或者將變數的值提升到某次方
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# X 變數的次方    
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X_train)   # [x, y] -> [1, x, y, xx, xy, yy]
[Optional End]


以基本的 K Nearest Neighbour -Classification 為例開始,由目標附近的K個資料點作投票,推測目標所屬的類別:
knn = KNeighborsClassifier(n_neighbors = 5)   # 準備 KNN-Classifier
knn.fit(X_train, y_train)   #用 Training Data (或者scaled的X_train_scaled) 去做model fitting
print('Accuracy on Training Set={:.2f}, Testing Set={:.2f}'.format( knn.score(X_train, y_train) , knn.score(X_test, y_test) ))   # 兩組數據的準確度 Accuacy
knn.predict( [X_test] )   # 預測新的觀察值

而Dummy Classifier 分類器是指一些沒有訓練過程的模型,例如求遠預測"Yes"、大多數、比例等,可以用來作為比較的基準。
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
dummy_majority.score(X_test, y_test

對於 y值是連續型數據的情況,統計學首先會想起的多數是 Linear Regression,找一條方程式,令到預測時的sum of squared error最小 -Minimize: $ ||Y-X\beta||_2^2 $。變數可以是多項式轉換,囚為「Linear」是指方程式對模型的參數而言(這些參數一般寫作 $\beta$ - Beta),某組數據要適合使用Linear Regression,對數據本身有一些假設。
linreg = LinearRegression().fit(X_train, y_train)
print('linear model coeff (w): {}'.format(linreg.coef_))
print('linear model intercept (b): {:.3f}'.format(linreg.intercept_))
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))

plt.figure(figsize=(5,4))
plt.scatter(X_R1, y_R1, marker= 'o', s=50, alpha=0.8)
plt.plot(X_R1, linreg.coef_ * X_R1 + linreg.intercept_, 'r-'

當有幾個變數可能有很高相關性時,Beta 的估計值會是「bias」的 - 重覆很多很多次抽樣後估計出來的一堆 Beta 估計值,平均數未必趨於母體(Population)的真實Beta。一個方法是觀察變數的 correlation matrix/scatter plot, 主動剔走相關變數,令留下的都是獨立變數。另一方法是為了限制出現極端的beta estimator,在最小化誤差的同時,penalize大的beta estimator,所以有Ridge和Lasso的變化版:
當中 Ridge Regression 要最小化的是:$\frac{1}{n} ||Y-X\beta||_2^2 + \lambda \sum |\beta|^2$
而 Lasso Regression 的是 $\frac{1}{n} ||Y-X\beta||_2^2 + \lambda \sum |\beta|$
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)
# linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)

print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Logistic Regression 對y值分佈的假設,透過logit function的變換,從Normal變為Binomial。可以做到分類的效果,和預測是/否的機率
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1).fit(X_train, y_train)

h = 6; w = 8
print('Item with var1 {} and var2 {} is predicted to be: {}'.format(h,w, ['False', 'True'][clf.predict([[h,w]])[0]]))
print('Accuracy on training set: {:.2f}'.format( clf.score(X_train, y_train) ))
print('Accuracy on test set: {:.2f}'.format( clf.score(X_test, y_test) ))

y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))
y_score_list

從 Support Vector Machine 開始,是比較在大數據和機械學習的領域才看得多的方法。
from sklearn.svm import LinearSVC
clf = LinearSVC(C=5, random_state = 0).fit(X_train, y_train)

from sklearn.svm import SVC
clf = SVC(kernel = 'linear', C=1.0).fit(X_train, y_train)
clf = SVC(kernel = 'poly', degree = 3).fit(X_train, y_train)
clf = SVC(kernel = 'rbf', gamma=1.0).fit(X_train, y_train)   

#分類樹,與後來的 Random Forest 相關
from sklearn.tree import DecisionTreeClassifier
clf= DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)
plot_decision_tree(clf, iris.feature_names, iris.target_names)   # plotting

#和更多的模型:
nbclf = GaussianNB().fit(X_train, y_train)
rfclf = RandomForestClassifier().fit(X_train, y_train)
gbclf = GradientBoostingClassifier().fit(X_train, y_train)

神經網絡模型的 Multi-layer Perceptron (MLP) 比較是平時會提到用來解釋人工智能的模型。以人眼的圖像識別作比擬,每個視網膜上的接收器被光線觸發出一個像素;當連結到所有第2層的神經,去觸發不同方向小線段的認知,連結到所有第3層的神經,去觸發不同形狀的小線段的認知;層層遞進組合成:形狀、物件、場境等。聽過一個對deep learning的簡單形容,就是有很多很多層的 Neural Network。
雖然現在的「人工智能」Neural Network 以人腦神經元的激活和人腦的複雜網絡作比喻,但我猜想人腦更複雜地在於它未必是一層層的分工。是否有一天真正模仿這種複雜的神經元網絡?物種萬千,當中的運作原理又可有不同?
this_activation = 'logistic'   # ['logistic', 'tanh', 'relu']
nnclf = MLPClassifier(hidden_layer_sizes = [10, 10], solver='lbfgs', random_state = 0, activation = this_activation, alpha = 0.1).fit(X_train, y_train)
plot_class_regions_for_classifier(nnclf, X_train, y_train, X_test, y_test, 'Title')


# 最後,有關 Performance 的 metrics:
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall)
from sklearn.metrics import confusion_matrix
y_dummy_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_predicted)
print('Most frequent class (dummy classifier)\n', confusion)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))

from sklearn.metrics import classification_report
print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))

print("Mean squared error: {:.2f}".format(mean_squared_rror(y_test, y_predict)))
print("r2_score: {:.2f}".format(r2_score(y_test, y_predict)))
print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))

from sklearn.metrics.scorer import SCORERS
print(sorted(list(SCORERS.keys())))




沒有留言:

張貼留言