Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of marketing on customer acquisition, retention, and churn or to predict disease risk and susceptibility in patients.

Random forest is a capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which or your variables are important in the underlying data being modeled.

This is a post about random forests using Python.

译者信息

MtrS
翻译于 3年前

2人顶此译文

随机森林是一个高度灵活的机器学习方法，拥有广泛的应用前景，从市场营销到医疗保健保险。既可以用来做市场营销模拟的建模，统计客户来源，保留和流失。也可用来预测疾病的风险和病患者的易感性。

随机森林是一个可做能够回归和分类。它具备处理大数据的特性，而且它有助于估计或变量是非常重要的基础数据建模。

这是一篇关于使用Python来实现随机森林文章。

What is a Random Forest?

Random forest is solid choice for nearly any prediction problem (even non-linear ones). It's a relatively new machine learning strategy (it came out of Bell Labs in the 90s) and it can be used for just about anything. It belongs to a larger class of machine learning algorithms called ensemble methods.

Ensemble Learning

Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer.

Random forest is a brand of ensemble learning, as it relies on an ensemble of decision trees. More on ensemble learning in Python here: Scikit-Learn docs.

译者信息

MtrS
翻译于 3年前

2人顶此译文

什么是随机森林？

随机森林是几乎任何预测问题 (甚至非直线部分) 的固有选择。它是一个相对较新的机器学习的策略（在 90 年代产生于贝尔实验室 ) 和它可以几乎用于任何方面。它属于机器学习算法一大类----- 集成学习方法。

集成学习

集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型，各自独立地学习和作出预测。这些预测最后结合成单预测，因此优于任何一个单分类的做出预测。

随机森林是集成学习的一个子类，由于它依靠于策率树的合并。你可以在这找到用python实现集成学习的文档： Scikit 学习文档。

Randomized Decision Trees

So we know that random forest is an aggregation of other models, but what types of models is it aggregating? As you might have guessed from its name, random forest aggregates Clasification (or Regression) Trees. A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.

Random Forest

The algorithm to induce a random forest will create a bunch of random decision trees automatically. Since the trees are generated at random, most won't be all that meaningful to learning your classification/regression problem (maybe 99.9% of trees).

_{If an observation has a length of 45, blue eyes, and 2 legs, it's going to be classified as red.}

译者信息

MtrS
翻译于 3年前

3人顶此译文

随机决策树

我们知道随机森林是其他的模型聚合，但它聚合了什么类型模型？你可能已经从其名称、随机森林聚合分类（或回归）的树中猜到。决策树是由一系列的决策的组合，可用于分类观察数据集。

随机森林

算法引入了一个随机森林来自动创建随机决策树群。由于树随机生成的树，大部分的树(或许 99.9%树) 不会对学习的分类/回归问题都有意义。

如果观察到长度为 45 ，蓝眼睛，和 2 条腿，就被归类为红色。

Arboreal Voting

So what good are 10000 (probably) bad models? Well it turns out that they really aren't that helpful. But what is helpful are the few really good decision trees that you also generated along with the bad ones.

When you make a prediction, the new observation gets pushed down each decision tree and assigned a predicted value/label. Once each of the trees in the forest have reported its predicted value/label, the predictions are tallied up and the mode vote of all trees is returned as the final prediction.

Simply, the 99.9% of trees that are irrelevant make predictions that are all over the map and cancel each another out. The predictions of the minority of trees that are good top that noise and yield a good prediction.

译者信息

袁不语
翻译于 3年前

3人顶此译文

树的投票

所以10000个（概率上）糟糕的模型有TMD什么好的？好吧，这样确实没什么特别的好处。但是随着很多糟糕的决策树被生成，其中也会有很少确实很优秀的决策树。

当你要做预测的时候，新的观察到的特征随着决策树自上而下走下来，这样一组观察到的特征将会被贴上一个预测值/标签。一旦森林中的每棵树都给出了预测值/标签，所有的预测结果将被归总到一起，所有树的模式投票被返回做为最终的预测结果。

简单来说，99.9%不相关的树做出的预测结果涵盖所有的情况，这些预测结果将会彼此抵消。少数优秀的树的预测结果将会超脱于芸芸“噪音”，做出一个好的预测。

Why you should I use it?

It's Easy

Random forest is the Leatherman of learning methods. You can throw pretty much anything at it and it'll do a serviceble job. It does a particularly good job of estimating inferred transformations, and, as a result, doesn't require much tuning like SVM (i.e. it's good for folks with tight deadlines).

An Example Transformation

Random forest is capable of learning without carefully crafted data transformations. Take the thef(x) = log(x)function for example.

Create some fake data and add a little noise.

import numpy as npx = np.random.uniform(1, 100, 1000)y = np.log(x) + np.random.normal(0, .3, 1000)

full gist here

译者信息

袁不语
翻译于 3年前

2人顶此译文

为什么你让我用它？

简单

随机森林就是学习方法中的Leatherman呀。你几乎可以把任何东西扔进去，它基本上都是可供使用的。在估计推断映射方面特别好用，以致都不需要像SVM那样做很多调试（也就是说对于那些最后期限很紧的家伙们真是太棒了）。

[译者注：Leatherman就是那家生产多功能折叠刀的公司，类似瑞士军刀]

一个映射的例子

随机森林在没有精心准备的数据映射的情况下也能学习。以方程f(x) = log(x)为例。

制造一些假数据，并且加上一点儿噪音。

import numpy as npx = np.random.uniform(1, 100, 1000)y = np.log(x) + np.random.normal(0, .3, 1000)

full gist here

If we try and build a basic linear model to predictyusingxwe wind up with a straight line that sort of bisects thelog(x)function. Whereas if we use a random forest, it does a much better job of approximating thelog(x)curve and we get something that looks much more like the true function.

You could argue that the random forest overfits thelog(x)function a little bit. Either way, I think this does a nice job of illustrating how the random forest isn't bound by linear constraints.

译者信息

MtrS
翻译于 3年前

2人顶此译文

如果我们建立了一个基本的线性模型通过使用 x 来预测y，我们需要作一条直线，算是平分 log (x) 函数。而如果我们使用一个随机的森林，它不会更好的逼近 log (x) 曲线并能够使得它更像实际函数。

你也许会说随机森林有点扰乱了 log(x) 函数。不管怎样，我都认为这做了一个很好的说明如何随机森林并未绑定于线性约束。

Uses

Variable Selection

One of the best use cases for random forest is feature selection. One of the byproducs of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

When a certain tree uses a one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.

In the following examples, we're trying to figure out which variables are most important for classifying a wine as being red or white.

译者信息

袁不语
翻译于 3年前

2人顶此译文

使用

变量选择

随机森林最好的用例之一是特征选择。尝试很多决策树变种的一个副产品就是你可以检测每棵树中哪个变量最合适/最糟糕。

当一棵树使用一个变量，而另一棵不使用这个变量，你就可以从是否包含这个变量来比较价值的减少或增加。优秀的随机森林实现将为你做这些事情，所以你需要做的仅仅是知道去看那个方法或参数。

在下述的例子中，我们尝试去指出对于将酒分为红酒或者白酒哪个变量是最重要的。

Classification

Random forest is also great for classification. It can be used to make predictions for categories with multiple possible values and it can be calibrated to output probabilities as well. One thing you do need to watch out for is overfitting. Random forest can be prone to overfitting, especially when working with relatively small datasets. You should be suspicious if your model is making "too good" of predictions on our test set.

One way to overfitting is to only use really relevant features in your model. While this isn't always cut and dry, using a feature selection technique (like the one mentioned previously) can make it a lot easier.

Regression

Yep. It does regression too.

I've found that random forest--unlike other algorithms--does really well learning on categorical variables or a mixture of categorical and real variables. Categorical variables with high cardinality (# of possible values) can be tricky, so having something like this in your back pocket can come in quite useful.

译者信息

袁不语
翻译于 3年前

2人顶此译文

分类

随机森林也很善于分类。它可以被用于为多个可能目标类别做预测，它也可以被校正输出概率。你需要注意的一件事情是过拟合。随机森林容易产生过拟合，特别是在数据集相对小的时候。当你的模型对于测试集合做出“太好”的预测的时候就应该怀疑一下了。

产生过拟合的一个原因是在模型中只使用相关特征。然而只使用相关特征并不总是事先准备好的，使用特征选择（就像前面提到的）可以使其更简单。

回归

是的，它也可以做回归。

我们已经发现随机森林——不像其它算法——对分类变量或者分类变量和真实变量混合学习的非常好。具有高基数（可能值的#）的分类变量是很棘手的，所以在你的口袋中放点儿这样的东西将会是非常有用的。

A Short Python Example

Scikit-Learn is a great way to get started with random forest. The scikit-learn API is extremely consistent across algorithms, so you horse race and switch between models very easily. A lot of times I start with something simple and then move to random forest.

One of the best features of the random forest implementation in scikit-learn is then_jobsparameter. This will automatically paralellize fitting your random forest based on the number of cores you want to use. Here's a great presentation by scikit-learn contributor Olivier Grisel where he talks about training a random forest on a 20 node EC2 cluster.

from sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport numpy as npiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75df['species'] = pd.Factor(iris.target, iris.target_names)df.head()train, test = df[df['is_train']==True], df[df['is_train']==False]features = df.columns[:4]clf = RandomForestClassifier(n_jobs=2)y, _ = pd.factorize(train['species'])clf.fit(train[features], y)preds = iris.target_names[clf.predict(test[features])]pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

Looks pretty good!

Final Thoughts

Random forests are remarkabley easy to use given how advanced they are. As with any modeling, be wary of overfitting. If you're interested in getting started with random forest inR, check out the randomForest package.

译者信息

袁不语
翻译于 3年前

2人顶此译文

一个简短的python例子

Scikit-Learn是开始使用随机森林的一个很好的方式。scikit-learn API在所以算法中极其的一致，所有你测试和在不同的模型间切换非常容易。很多时候，我从一些简单的东西开始，然后转移到了随机森林。

随机森林在scikit-learn中的实现最棒的特性是n_jobs参数。这将会基于你想使用的核数自动地并行设置随机森林。这里是scikit-learn的贡献者Olivier Grisel的一个很棒的报告，在这个报告中他谈论了使用20个节点的EC2集群训练随机森林。

from sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport numpy as npiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75df['species'] = pd.Factor(iris.target, iris.target_names)df.head()train, test = df[df['is_train']==True], df[df['is_train']==False]features = df.columns[:4]clf = RandomForestClassifier(n_jobs=2)y, _ = pd.factorize(train['species'])clf.fit(train[features], y)preds = iris.target_names[clf.predict(test[features])]pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

看起来很不错！

结语

随机森林相当容易使用，而且很强大。对于任何建模，都要注意过拟合。如果你有兴趣用R语言开始使用随机森林，那么就签出randomForest包。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。

译者信息

MtrS翻译于 3年前

What is a Random Forest?

Ensemble Learning

译者信息

MtrS翻译于 3年前

什么是随机森林？

集成学习

Randomized Decision Trees

Random Forest

译者信息

MtrS翻译于 3年前

随机决策树

随机森林

Arboreal Voting

译者信息

袁不语翻译于 3年前

树的投票

Why you should I use it?

It's Easy

An Example Transformation

译者信息

袁不语翻译于 3年前

为什么你让我用它？

简单

一个映射的例子

译者信息

MtrS翻译于 3年前

Uses

Variable Selection

译者信息

袁不语翻译于 3年前

使用

变量选择

Regression

译者信息

袁不语翻译于 3年前

A Short Python Example

Final Thoughts

译者信息

袁不语翻译于 3年前

一个简短的python例子

结语

MtrS
翻译于 3年前

MtrS
翻译于 3年前

MtrS
翻译于 3年前

袁不语
翻译于 3年前

袁不语
翻译于 3年前

MtrS
翻译于 3年前

袁不语
翻译于 3年前

袁不语
翻译于 3年前

袁不语
翻译于 3年前