diff --git a/5.MachineLearning/1-codingLanguage.md b/5.MachineLearning/1-codingLanguage.md new file mode 100644 index 0000000..e541276 --- /dev/null +++ b/5.MachineLearning/1-codingLanguage.md @@ -0,0 +1,15 @@ +## 算法实现的常用编程语言以及优缺点 + +常用算法实现语言有Python, Scala和R三种。 + +Python是一门通用脚本语言,在最近几年发展迅速,有活跃的开源社区不断贡献越来越多的机器学习和深度学习算法包。由于建模简单,同时对工程化有很好的支持,所以同时受到数据科学家和软件工程师的喜爱。 + +R是一门为经典统计学设计的脚本语言,有最全和最成熟的开源机器学习算法包。但是R偏向于快速建模和极简语法,很多情况下并不适合工程化。 + +Scala是一门专门为大数据而设计的新兴静态语言,也是大数据计算模块Spark的原生语言,非常适合大数据类应用的工程化,但针对机器学习算法和可视化的开源库并不成熟。 + +三种常用语言的算法包以及活跃程度:可以见到Python不论在算法还是工程化都拥有相对较高的活跃度: + +![alt_text](images/codingLanguage.jpg) + +[源数据](https://activewizards.com/blog/comparison-of-top-data-science-libraries-for-python,-r-and-scala-infographic/) \ No newline at end of file diff --git a/5.MachineLearning/2-classifier.md b/5.MachineLearning/2-classifier.md new file mode 100644 index 0000000..5315a8d --- /dev/null +++ b/5.MachineLearning/2-classifier.md @@ -0,0 +1,263 @@ +## 分类算法开发过程 + +### 数据读取 + +案例使用sklearn库自带的数字库: + +```python +from sklearn.datasets import load_digits +digits = load_digits() +print("数据行数,列数:" , digits.data.shape) +print("标签行数,列数:", digits.target.shape) +``` + +数据行数,列数: (1797, 64) + +标签行数,列数: (1797,) + +### 显示示例图片 + +```python +import numpy as np +import matplotlib.pyplot as plt +fig, ax = plt.subplots(2,5,figsize=(20,4)) +for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])): + ax[0,index].imshow(np.reshape(image, (8,8)), cmap='gray') + ax[0,index].set_title('%i\n' % label, fontsize = 20) + ax[1,index].text(0,0,np.reshape(image, (8,8))) + ax[1,index].set_xticks([]) + ax[1,index].set_yticks([]) +``` +![alt_text](images/digits.jpg) + +### 创建训练集和测试集 + +```python +from sklearn.model_selection import train_test_split +x_train, x_test, y_train, y_test = train_test_split( + digits.data, digits.target, test_size=0.1, random_state=0) +print('训练集数量:{0},测试集数量:{1}'.format(x_train.shape[0],x_test.shape[0])) +``` +训练集数量:1617,测试集数量:180 + +### scikit-learn四步建模 + +**第1步:** 引用算法包(案例为分类问题,以逻辑回归为例): + +```python +from sklearn.linear_model import LogisticRegression +``` + +**第2步:** 生成模型实例,模型命名为logisticRegr: + +```python +logisticRegr = LogisticRegression() +``` + +**第3步:** 训练模型,输入为训练集数据和标签x_train, y_train,输出为模型参数: + +```python +logisticRegr.fit(x_train, y_train) +``` +![alt_text](images/lr.jpg) + +**第4步:** 预测新样本的标签,输入为测试集,输出为一列标签数组: + +```python +predictions = logisticRegr.predict(x_test) +``` + +打印前5个预测结果: + +```python +fig, ax = plt.subplots(2,5,figsize=(20,4)) +for index, (image, label) in enumerate(zip(x_test[0:5], predictions[0:5])): + ax[0,index].imshow(np.reshape(image, (8,8)), cmap='gray') + ax[0,index].set_title('%i\n' % label, fontsize = 20) + ax[1,index].text(0,0,np.reshape(image, (8,8))) + ax[1,index].set_xticks([]) + ax[1,index].set_yticks([]) +``` +![alt_text](images/digitsResult.jpg) + +### 衡量模型性能 + +最简单的分类问题模型衡量标准是准确率(accuracy)= 正确预测的图像数 / 所有测试图像数 + +```python +score = logisticRegr.score(x_test, y_test) +print('逻辑回归准确率为 {0:.2%}'.format(score)) +``` +逻辑回归准确率为 95.56% + +另外一种更严谨的分类模型衡量标准是混淆矩阵(confusion matrix)。 + +混淆矩阵也称误差矩阵,是表示精度评价的一种标准格式,用n行n列的矩阵形式来表示。在人工智能中,混淆矩阵(confusion matrix)是可视化工具,特别用于监督学习。混淆矩阵的每一列代表了预测类别 ,每一列的总数表示预测为该类别的数据的数目;每一行代表了数据的真实归属类别,每一行的数据总数表示该类别的数据实例的数目。每一列中的数值表示真实数据被预测为该类的数目。处于对角线上的数字指正确预测的数目;所有不在对角线上的数字都是错误预测的数目。 + +由混淆矩阵可以得出不同于准确率(accuracy)的其他衡量标准,比如精度(precision)和召回率(recall)。在数据集不平衡的情况下,这两个衡量标准比准确率更能体现模型性能。 + +引用metrics包,计算混淆矩阵:输入为测试集实际标签y_test和预测标签predictions: + +```python +from sklearn import metrics +cm = metrics.confusion_matrix(y_test, predictions) +print(cm) +``` +![alt_text](images/confusionMatrix.jpg) + +也可以用热力图形式表现(matplotlib颜色选择在[这里](https://matplotlib.org/examples/color/colormaps_reference.html)): + +```python +plt.figure(figsize=(6,6)) +plt.imshow(cm, cmap='Wistia') +plt.title('confusion matrix', size = 15) +plt.colorbar() +tick_marks = np.arange(10) +plt.xticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], size = 10) +plt.yticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], size = 10) +plt.grid(False) +plt.ylabel('Actual label', size = 10) +plt.xlabel('Predicted label', size = 10) +width, height = cm.shape + +for x in np.arange(width): + for y in np.arange(height): + plt.annotate(str(cm[x][y]), xy=(y,x), + horizontalalignment='center', + verticalalignment='center') +``` +![alt_text](images/confusionMatrixHeatmap.jpg) + +### 其他sklearn分类模型 + +随机森林分类器: + +```python +from sklearn.ensemble import RandomForestClassifier +randomForest = RandomForestClassifier(random_state=4711) +randomForest.fit(x_train, y_train) +score_rfc = randomForest.score(x_test, y_test) +print('随机森林准确率为 {0:.2%}'.format(score_rfc)) +``` +随机森林准确率为 95.00% + +支持向量机分类器: + +```python +from sklearn.svm import SVC +svc = SVC(kernel="linear") +svc.fit(x_train, y_train) +score_svc = svc.score(x_test, y_test) +print('线性支持向量机准确率为 {0:.2%}'.format(score_svc)) +``` +线性支持向量机准确率为 97.78% + +KNN最近邻分类器: + +```python +from sklearn.neighbors import KNeighborsClassifier +kn = KNeighborsClassifier(10) +kn.fit(x_train, y_train) +score_kn = kn.score(x_test, y_test) +print('最近邻准确率为 {0:.2%}'.format(score_kn)) +``` +最近邻准确率为 96.67% + +[简单神经网络分类器](http://scikit-learn.org/stable/modules/neural_networks_supervised.html): + +```python +from sklearn.neural_network import MLPClassifier +nn = MLPClassifier() +nn.fit(x_train, y_train) +score_nn = nn.score(x_test, y_test) +print('神经网络准确率为 {0:.2%}'.format(score_nn)) +``` +神经网络准确率为 96.11% + +朴素贝叶斯分类器: + +```python +from sklearn.naive_bayes import GaussianNB +nb = GaussianNB() +nb.fit(x_train, y_train) +score_nb = nb.score(x_test, y_test) +print('朴素贝叶斯准确率为 {0:.2%}'.format(score_nb)) +``` +朴素贝叶斯准确率为 85.00% + +### 调参的重要性 + +以前例中的随机森林模型为例:调整模型的max_features参数,定义节点分枝时使用的特征数量,平衡variance和bias: + +将max_features设置为范围1-9: + +```python +list_max_features = np.arange(1,10,1) + +list_scores = list() +for nrFeatures in list_max_features: + randomForest = RandomForestClassifier(max_features=nrFeatures,random_state=4711) + randomForest.fit(x_train, y_train) + score_rfc = randomForest.score(x_test, y_test) + list_scores += [score_rfc] + +plt.figure(figsize=(5,5)) +plt.plot(list_max_features,list_scores) +plt.ylabel('random forest accuracy') +plt.xlabel('max_feature') +``` +![alt_text](images/paramTuning.jpg) + + +### 数据预处理的重要性 + +预处理方法1:[标准化数据处理](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py):调整特征数据范围和方差,使所有特征的平均值为0,方差为1。避免由于特征范围和变化幅度不同对模型结果产生影响。 + +```python +from sklearn.preprocessing import StandardScaler +scaler = StandardScaler() +scaler.fit(x_train) +x_train_scaled = scaler.transform(x_train) +x_test_scaled = scaler.transform(x_test) +``` + +预处理方法2:[主成分分析](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html),转化特征为线性不相关,并实现降维: + +```python +from sklearn.decomposition import PCA +pca = PCA(.99) +pca.fit(x_train_scaled) +x_train_decomposed = pca.transform(x_train_scaled) +x_test_decomposed = pca.transform(x_test_scaled) +``` + +再用相同的朴素贝叶斯分类模型进行训练和预测,得到更高准确率: + +```python +nb_decomposed = GaussianNB() +nb_decomposed.fit(x_train_decomposed, y_train) +score_nb_decomposed = nb_decomposed.score(x_test_decomposed, y_test) +print('朴素贝叶斯准确率为 {0:.2%}'.format(score_nb_decomposed)) +``` +朴素贝叶斯准确率为 95.00% + + +内容参考: + +[python tutorial](https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/Logistic_Regression/LogisticRegression_toy_digits_Codementor.ipynb) + +[sklearn classifiers](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) + + + + + + + + + + + + + + diff --git a/5.MachineLearning/3-preprocessing.md b/5.MachineLearning/3-preprocessing.md new file mode 100644 index 0000000..4c99b69 --- /dev/null +++ b/5.MachineLearning/3-preprocessing.md @@ -0,0 +1,614 @@ +## 特征工程简介 + +![alt_text](images/preprocessingGraph.jpg) + +[来源](https://blog.csdn.net/u010089444/article/details/70053104) + +我们常说的特征工程是这个图里的特征处理部分。下面基于sklearn的特征处理库以及鸢尾花数据集分步说明一下。 + +```python +import pandas as pd +import seaborn as sns +import matplotlib.pyplot as plt + +from sklearn.datasets import load_iris + +iris = load_iris() +irisData = pd.concat([pd.DataFrame(iris.data, columns=iris.feature_names), + pd.DataFrame(iris.target,columns=['species'])],axis=1) +``` + +### 特征处理 · 特征清洗 · 清洗异常样本 + +最常见的发现异常样本的方式都是目检,比如用box-plot/直方图/散点图显示不同维度间的关系,从而找到离群点。 + +```python +sns.boxplot(x="species", y="petal length (cm)", data=irisData) +plt.show() +``` +![alt_text](images/boxplot.jpg) + +```python +sns.pairplot(irisData, hue="species", vars=irisData.columns.tolist()[0:4]) +``` +![alt_text](images/pairplot.jpg) + +还可以用高斯分布定义N个sigma以外的数据点都是异常点: + +```python +sepalwid = irisData.loc[irisData.species==0, "sepal width (cm)"] +sns.distplot(sepalwid) +``` +![alt_text](images/distplot.jpg) + +```python +threshold_pos = np.mean(sepalwid) + 4 * np.var(sepalwid) +threshold_neg = np.mean(sepalwid) - 4 * np.var(sepalwid) +outlier = irisData.loc[(irisData.species==0) & ( + (irisData["sepal width (cm)"]>threshold_pos) | + (irisData["sepal width (cm)"]