转自CSDN博客:Daeyeon7
说明
除了传统的决策树(rpart)算法,条件推理树(ctree)是另一种比较常用的基于树的分类算法。两者的不同之处是,条件推理树是选择分类变量时的依据是显著性测量的结果,而不是采用信息最大化法(rpart采用的是基尼系数)。
操作
调用party包的ctree命令来构建分类器
library(zoo)library(party)ctree.model = ctree(churn ~ .,data = trainset)ctree.model Conditional inference tree with 18 terminal nodesResponse: churn Inputs: international_plan, voice_mail_plan, number_vmail_messages, total_day_minutes, total_day_calls, total_day_charge, total_eve_minutes, total_eve_calls, total_eve_charge, total_night_minutes, total_night_calls, total_night_charge, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls Number of observations: 2315 1) international_plan == {no}; criterion = 1, statistic = 173.582 2) number_customer_service_calls <= 3; criterion = 1, statistic = 133.882 3) total_day_minutes <= 259.3; criterion = 1, statistic = 232.371 4) total_eve_minutes <= 258.7; criterion = 1, statistic = 39.065 5)* weights = 1544 4) total_eve_minutes > 258.7 6) total_day_minutes <= 222.9; criterion = 1, statistic = 47.453 7)* weights = 209 6) total_day_minutes > 222.9 8) voice_mail_plan == {yes}; criterion = 1, statistic = 20 9)* weights = 8 8) voice_mail_plan == {no} 10)* weights = 28 3) total_day_minutes > 259.3 11) voice_mail_plan == {no}; criterion = 1, statistic = 46.262 12) total_eve_charge <= 14.09; criterion = 1, statistic = 37.877 13)* weights = 21 12) total_eve_charge > 14.09 14) total_night_minutes <= 178.3; criterion = 1, statistic = 19.789 15)* weights = 23 14) total_night_minutes > 178.3 16)* weights = 60 11) voice_mail_plan == {yes} 17)* weights = 34 2) number_customer_service_calls > 3 18) total_day_minutes <= 159.4; criterion = 1, statistic = 34.903 19) total_eve_minutes <= 233.2; criterion = 0.991, statistic = 11.885 20) voice_mail_plan == {no}; criterion = 0.99, statistic = 11.683 21)* weights = 40 20) voice_mail_plan == {yes} 22)* weights = 7 19) total_eve_minutes > 233.2 23)* weights = 16 18) total_day_minutes > 159.4 24)* weights = 96 1) international_plan == {yes} 25) total_intl_charge <= 3.51; criterion = 1, statistic = 35.28 26) total_intl_calls <= 2; criterion = 1, statistic = 28.013 27)* weights = 40 26) total_intl_calls > 2 28) number_customer_service_calls <= 3; criterion = 0.957, statistic = 8.954 29) total_day_minutes <= 271.5; criterion = 1, statistic = 25.328 30) total_eve_charge <= 25.82; criterion = 0.987, statistic = 11.167 31)* weights = 116 30) total_eve_charge > 25.82 32)* weights = 7 29) total_day_minutes > 271.5 33)* weights = 11 28) number_customer_service_calls > 3 34)* weights = 14 25) total_intl_charge > 3.51 35)* weights = 41
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
条件推理树可视化
plot(ctree.model)
通过减少特征值,再重新绘制分类树,得到一颗简化的条件推理树
daycharge.model = ctree(churn ~ total_day_charge,data = trainset)plot(daycharge.mode)
用total_day_charge作为唯一分割条件得到的推理树
输出结果图可以显示出每个中间节点的相应的依赖的变量名称与p值,分裂条件在左右的分枝上有所显示,叶子节点可以显示不同类别样本的个数n,以及样例属于0与1的概率。从图中可以知道,当total_day_charge的值大于48.18时,节点9的亮灰域要大于深灰域,这意味日消费大于48.18的客户流失文概率要非常大(类标签yes)
####评测条件推理树的预测能力
ctree.predict = predict(ctree.model,testset)table(ctree.predict,testset$churn)ctree.predict yes no yes 99 15 no 42 862
使用caret包的confusionMarix完成
library(lattice)library(ggplot2)library(caret)confusionMatrix(table(ctree.predict,testset$churn))Confusion Matrix and Statisticsctree.predict yes no yes 99 15 no 42 862 Accuracy : 0.944 95% CI : (0.9281, 0.9573) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7449 Mcnemar's Test P-Value : 0.0005736 Sensitivity : 0.70213 Specificity : 0.98290 Pos Pred Value : 0.86842 Neg Pred Value : 0.95354 Prevalence : 0.13851 Detection Rate : 0.09725 Detection Prevalence : 0.11198 Balanced Accuracy : 0.84251 'Positive' Class : yes
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
调用treeresponse( )函数,输出这一列的概率
tr = treeresponse(ctree.model,newdata = testset[1:5,])> tr[[1]][1] 0.03497409 0.96502591[[2]][1] 0.02586207 0.97413793[[3]][1] 0.02586207 0.97413793[[4]][1] 0.02586207 0.97413793[[5]][1] 0.03497409 0.96502591
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
本节首先使用prediction函数实现测试数据集的标记(类别号),然后调用table函数生成分类表,最后使用caret包内置的confusionMatrix函数来评价预测性能。除了predict函数,也可以使用treeresponse函数来评估类概率,该函数通常会选择概率更高的类标号来标记数据。
本节样例展示了使用测试数据集testset中的前五条记录来得到分类的概率的估计值,调用treeresponse函数可以得到这5个概率的具体值,可以根据这个来判断样例的类别标号。
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请
点击举报。