数据挖掘算法之关联规则

关联规则从来都是在现实生活中的关联知识发现中应该广泛，实际通常都是在购物商场的商品购买关系发现时应用较多，这里用一个商场购物调查问卷进行说明，R语言关联规则的具体用法。

插一句闲话：国庆节没有来得及更新，抱歉！

我希望看这篇文章之前，首先能看上一篇关联规则原理文章：

来一场啤酒与尿不湿的旅行——关联规则

首先，需要的数据是：调查问卷数据8993条，考虑到大多数人都能一起做，该数据是关联规则包arules自带数据集，这样每个人都能使用一样的数据了。

首先需要安装包，为了照顾可能有一些新手不会安装包，这里还是把需要的两个包的安装语法给出：

install.packages('arules')

install.packages('arulesViz')

对于arules包自带的数据集'IncomeESL'，8993行、14列，其中几列是这样的：

income

an ordered factor with levels [0,10) < [10,15) < [15,20) < [20,25) < [25,30) < [30,40) < [40,50) < [50,75) < 75

sex

a factor with levels male female

marital status

a factor with levels married cohabitation divorced widowed single

age

an ordered factor with levels 14-17 < 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < 65

education

an ordered factor with levels grade <9 < grades 9-11 < high school graduate < college (1-3 years) < college graduate < graduate study

occupation

a factor with levels professional/managerial sales laborer clerical/service homemaker student military retired unemployed

years in bay area

an ordered factor with levels <1 < 1-3 < 4-6 < 7-10 < >10

dual incomes

a factor with levels not married yes no

number in household

an ordered factor with levels 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9

number of children

an ordered factor with levels 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9

householder status

a factor with levels own rent live with parents/family

type of home

a factor with levels house condominium apartment mobile Home other

ethnic classification

a factor with levels american indian asian black east indian hispanic pacific islander white other

language in home

a factor with levels english spanish other

具体信息可以通过查看帮助：?IncomeESL

首先加载包与包中数据：

> dim(IncomeESL)

[1] 6876 14

dat <- as(IncomeESL,'transactions')是为了将数据转化成可以进行关联规则分析的transactions对象。

接着，通过apriori函数进行关联规则生成，设定最小支持度门槛值为0.1，最小置信度门槛为0.6，然后通过summary函数查看规则：

可以看到一共产生了2360条规则，这些规则包括单一项目集，和多项目集，包含的项目数量从1到6都有，画出支持度~置信度散点图：

plot(rules,method = 'scatterplot')

这个图应该怎么看呢，显然横轴表示支持度（>=0.1），纵轴表示置信度（>=0.6），右边的颜色对比条反映点的颜色对应的增益（>=1）。颜色越深，表示增益值越大。

这么多规则，不同规则可能侧重点不同，通常我们不需要把每一个规则都展现出来，只需要提出我们需要研究的问题的相关规则即可。

比如说，我们这里想知道什么样的人会拥有自己的房子，可以通过筛选RHS（结果项目）为{householder status=own}的【显著】规则（增益大于1），并以支持度排序出前5名作进一步检查：

这里说明一下，为什么要选取支持度排名前五呢，与前面支持度要大于0.1一个道理，我们想要找出这样一种规则，这种规则应该是基于大量的数据的，支持度太小而增益满足条件也是不行的，因为它本身样本数量就是很小的。

得出的前5条规则是这样的：

这是什么意思呢，就是说支持度排名前五的5条关联规则中：

首先对符号进行说明吧，lhs表示规则条件，rhs表示结果项目，比如【打雷(lhs)==>下雨(rhs)】。

结婚==>有房支持度(0.26) 置信度(0.68) 增益(1.8)

结婚、说英语==>有房支持度(0.25) 置信度(0.7) 增益(1.85)

...

其它的，就不一一说明了，这里以有房为结果项目的5条规则里面，有1项目集，也有2项目集，并且2项目集又有各种组合。

其实呢还是上面那个图：

我们有没有发现，越靠近左下角，增益越大，对应的关联规则越可信，也就是说好像(支持度，置信度)越小，产生的关联规则越可信，反之，左上角的，(支持度，置信度)越大，好像对应的关联规则增益越小，越不可信。这是因为，支持度和置信度大小与增益没有直接关联，试想一下，支持度大表明规则条件本身出现次数较多，置信度大表明某个条件下结果项目出现的较多。但是，这个条件之外，结果项目可能更多，这样的话，结果项目单独出现的概率还是有可能大于条件概率。

如果我们想研究其它的规则，也可以通过改变条件进行筛选出自己需要的规则。这里就不一一列出了。

这里我们可能会对收入”income”感兴趣，那么尝试找一下以收入“income=[40,50)”作为结果项目的规则：

> income <- subset(rules,subset=rhs %in% 'income=[40,50)' & lift<1)> income

set of 0 rules

发现，不存在以收入“income=[40,50)”作为结果项目的规则。

实际上：

> income <- subset(rules,subset=rhs %in% 'income=[0,10)' & lift>1)> income

set of 2 rules

income=[0,10)的只有两条规则，但是我们想得出高收入人群的关联规则，怎么办呢，为什么关于收入的规则很少。

因为关于收入的支持度本身就很少，也就是数据条数很少，其它数据相对较大，怎么样可以让其产生关联规则呢？我们可以适当增大支持度为0.2.

> inspect(sort(rulesIncome,by='confidence'))

lhs rhs support confidence lift count[1] {householder status=own, type of home=house, language in home=english} => {income=$40 } 0.2020070 0.6762415 1.791154 1389[2] {householder status=own, type of home=house} => {income=$40 } 0.2107330 0.6665133 1.765387 1449[3] {householder status=own, language in home=english} => {income=$40 } 0.2325480 0.6555966 1.736472 1599[4] {householder status=own} => {income=$40 } 0.2436009 0.6482198 1.716934 1675[5] {marital status=married, language in home=english} => {income=$40 } 0.2248400 0.63376 1.676853 1546[6] {marital status=married} => {income=$40 } 0.2370564 0.6146305 1.627966 1630

这样，就产生了6条我们需要的关联规则。可以看出结婚与否与是否有房是两个主要前提条件。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。