Random Forest, a particular type of Ensemble Method
What you will learn
- What Ensemble Methods are
- Random Forest, a particular type of Ensemble Method
- How to use Random Forest
- Random Forest’s properties
- Random Forest’s variable importance
Ensemble Methods (EM for short) are a popular set of algorithms that improve classification or regression by combining the predictions of several base models. The combined prediction is usually better compared to the prediction of any single constituent learner.
Among ensemble methods, two of the most popular families of methods are:
- Averaging (Random Forest, Extremely Randomized Trees): build a set of multiple independent models and then average (or vote) the predictions of each model. Averaging improves prediction by reducing variance and thus is well suited with complex models as base learners.
- Boosting (Gradient Boosted Trees): incrementally build an ensemble of weak classifiers to produce a powerful ‘commitee’. In boosting each new model is trained with a re-weighted version of the data to emphasize the training instances that previous models mis-classified. Boosting starts with a simple model and works by gradually reduce its error at each step.
Why are Ensemble Methods so popular? Despite there is not a single winning algorithm in Machine Learning (“no free lunch” theorem), EM have generally many positive aspect and they are often used as a first (really good) starting point in many competitions and problems.
- They are a non-parametric model, meaning that the user doesn’t need to have a prior knowledge of the distribution behind the dataset or no prior knowledge of the problem.
- They automatically treat continuos and categorical variables, without the need of encoding and they are insensitive of different variables scales (no preprocessing).
- EM can capture high non-linear and complex interaction between variables and, depending on the algorithm, they are parallelizable (with an appropriate backend) thus capable of treating huge amount of data.
- They require few hyperparameter to tune and are robust to overfitting.
Any algorithm can be used as base learner but one of the most frequent choices is to use Trees. DecisionTrees are a supervised learning algorithms used for classification and regression. The following graph represents a Decision Tree.
The algorithm starts by building the Tree at the root and performs a binary split on a single variable, dividing the input space in two areas. The node’s two children are further divided and the process iterates until a certain criterion is met. At each node the splitting point (the splitting point is the pair value, split value) is chosen in order to minimize impurity in the two branches. Impurity is measured by Gini impurity, that is roughly how many samples belong to a single class w.r.t. the total number of samples. A gini impurity of zero means that every example belong to the same class. The algorithm ends in leaves that contain constant values, that is the value predicted by the algorithm. The algorithm’s goal is to have pure leaves, meaning that each leaf have samples of the same class.
Trees are a powerful methods, but they tend to learn a very complex model, if not treated differently for example with pruning or deliberately specifying its depth. For this reason they can be used with both methods: averaging full grown trees or boosting several small trees.
In this series of posts we will describe two popular Ensemble Methods based on Trees: Random Forest and Gradient Tree Boosting. In the next paragraphs we will concentrate on Random Forest, while in the next post we will discuss about Gradient Boosting Trees.
Random Forests (RF) are an ensemble method based on a modified version of Bagging (Bootstrapaggregating) invented by Leo Breiman in 2001. Bagging is a form of Averaging that build each base learner with a random sample with replacement (values can be duplicated) drawn from the dataset. The rationale of this algorithm is that complex models have high variance and averaging many of them contributes to reducing it. In principle a complex model is able to perfectly learn the true function, but this is also its drawback. In practice it adapts to much to the subset of the data that is seeing, thus several complex model that are trained with different subset of the data display high variability in the resulting model. This type of problem can be addressed by averaging many complex models as RF does: it buils a large ensemble of trees and then averages their prediction to obtain a final model.
In pure Bagging samples are not necessarily independent and thus averaging doesn’t account for all variance. For this reason the idea of Random Forests is to reduce the correlation of each tree (and thus decrease variance) by randomly selecting a subset of input variables for each split in the tree. This procedure builds trees that are less correlated with each other and thus averaging them improve performance.
Random Forests are a popular method because they work surprisingly well with hyperparameters’ default values and, even if they are less interpretable than trees, they can nevertheless provide precious insights into the data, as we will see later.
In the picture below we can see a representation of a Random Forest. Each rectangular box is a tree of the forest and each bar in the box represent a level of the tree. Each color is associated with a feature. The size of each color is the number of nodes in the tree that uses that feature as splitting point. It is possible to note that trees are quite similar in structure and that features showing up at the root of the trees are only a subset of all the features (clearly the most important ones). Rectangles appear not to be perfect because leaves can show up at intermediate levels.
The beauty of RF: basically there is just ONE hyperparameter to tune
The main hyperparameter to adjust is the number of variables selected at random for each split. Let be the total number of features, and the number of features chosen at each split (), the recommended value for is for classification problems andfor regression problems. These are rules of thumb and they work well for most datasets. In principle lower values of reduce the correlation between any pair of trees and hence averaging works better in reducingvariance (at the cost of less powerful trees).
The second important parameter to tune is the number of trees in the forest. Since RF are anaveragingmethod they do not usually overfit by adding more trees and the larger the better (but it increases computational cost). In addition, note that results will stop getting significantly better beyond a sufficient number of trees.
Random Forests are said to hardly overfit the data. This is not always the case and the average of fully grown trees can result in a model that is too rich. If this is a concern, there are few ways to reduce tree depth, either by specifying the limit directly or by setting the number of training samples in the leaf, or the minimum number of samples to split.
Extension to Random Forests
In Extremely Randomized Trees randomness goes one step further, in the way splits are computed. After the random subset of candidate features is selected, instead of looking for the most discriminative thresholds,thresholds are drawn at random for each candidate feature. The relation with the output is retained by selecting the best pair feature and random-threshold as the splitting point (if= 1 the trees would be totally random). The rationale of this choice is that if a feature is important, a significant fraction of the trees will have (approximately) the same feature with the same splitting point at the same position in the tree, increasing the correlation of each tree (hence increasing variance). By randomly selecting the threshold the algorithm allows to further reduce the Variance of the model.
Random Forests Properties
RF – Properties inherited from Trees:
- Can handle categorical predictors
- Do not require normalization
- Support natively multiclass problems
- Some concepts (for example XOR and additive functions) are hard to learn
- Tend to favor categorical features with many categories
- Random Forest algorithm has its own peculiar properties and moreover they introduce additional features that are useful to further analyze the data
RF – Peculiar properties:
- Few hyperparameters to tune (default values are often sufficent)
- Almost insensitive to overfitting
- If the number of relevant features is small, Random Forests can perform poorly because, at each split, the probabilty of picking irrelevant variables is higher
RF – Additional features:
- Out Of Bag (OOB) score. During the building phase of each tree some samples are left out (OOB samples). OOB samples are used to calculate a reliable error measure without relying on an test set
- Variable importance. Trees (and ensemble of trees) can provide a measure indicating how important or useful a variable is in predicting the outcome
- Feature selection: allows to reduce dimensionaly and thus improve algorithm speed and convergence while keeping the most of the capabilities. The procedure can be automated to remove the last feature until certain stopping criterion (e.g. decrease in accuracy) is met
- Partial dependence: it shows the relation between the target and a chosen set of varialbe (at most two at a time), marginalizing over the other variables. The chosen variables are usually the most important and this plot is used to gain insight about the function learned by the ensemble and how it models the dependence between the target and the most important variables
- Proximity measure: random forest can grow a proximity matrix, constructed by passing the OOB samples through each tree and increasing the proximity of two sample if they ends up in the same terminal node. Plotting this matrix should provide insight on which data points are effectively close, at least as learned by the random forest classfier. However it tends to produce similar graphs and its utility has been doubted
Variable importance is a measure of how much a variable is deemed important or useful by the Random Forest’s model. There are mainly two methods for calculating variable importance: decrease in node impurity and mean decrease in accuracy.
- mean decrease in node impurity: feature importance is calculated by looking at the splits of each tree. The importance of the splitting variable is proportional to the improvement to the gini index given by that split and it is accumulated (for each variable) over all the trees in the forest.
- mean decrease in accuracy: This method, proposed in the original paper, passes the OOB samples down the tree and records prediction accuracy. A variable is then selected and its values in the OOB samples are randomly permuted. OOB samples are passed down the tree and accuracy is computed again. A decrease in accuracy obtained by this permutation is averaged over all trees for each variable and it provides the importance of that variable (the higher the decreas the higher the importance). This method is known in the literature to have better statistical properties with respect to gini impurity and it is available in many packages and implementations. Since at the moment is not availabe in scikit-learn we propose our implementation in the addutils package.
In the pictures below we can see an example of variable importance graph. The two pictures refers to the dataset “Boston housing” that telates house values of Boston suburbs to a serie of variables. As can be seen both variable importance methods agrees that the two most important features are “Avg Rooms per Dwelling” and “% Lower pop Status”. With this simple dataset the two methods looks similar, with few exceptions for the least important variables. This method of RF can be used to either discard irrelevant features (for example “Land_zn / lots_over_25k_sqft”) or further understand the process. For example we can say that if the number of rooms is an important parameter in determining the price of the house as well as the perchentage of lower-status population.
Fig. 1: Decrease in node impurity
Fig. 2: Mean decrease in accuracy
Where RF can be succesfully used:
- Good starting point with new dataset, can be used as baseline for performance comparison
- Complex and etherogeneous problems with interaction effects
- No a-priori knowledge of the model (non-parametric model)
Where RF (usually) fail:
- Sparse problems → Use Lasso, Elastic-Net or other sparse models
- Additive functions → Gradient Boosted Trees or Generalized Additve Models
- Categorical features with many categories → preprocessing to reduce categories (or try with naïve bayes)