Ensemble Methods: Gradient Boosted Trees

Introduction To Gradient Boosting Using Gradient Boosting With Trees

What You Will Learn

• Introduction to Gradient Boosting
• Using Gradient Boosting with Trees
• Gradient Boosted Trees hyperparameters
• Gradient Boosted Trees tuning
• Gradient Boosted Trees properties

Notebooks available on github:Ensemble MethodsEnsemble Methods Advanced


As popular as Random Forests and used in many contexts (for example Yahoo web page ranking algorithm) Gradient Boosted Trees are another type of Ensemble Method. In this post we will describe Gradient Boosted Trees and in particular we will concentrate on explaining the meaning of each hyperparameter and tume them to achieve better performance.

Gradient Boosting

This is a form of boosting that learns a function sequentially, each member of the ensemble learns on the error of its predecessor. The final model is the aggregation of all individual predictions. The errors at each iteration are called residuals. Suppose you fit the model to minimize squared loss ((y = {1 \over 2}  [y_i – f(x_i)]^2)). In this case residuals (computed as (y_i – f(x_i))) can be viewed as the negative gradient of the function, so fitting residuals is really fitting the model to the negative gradient, that is updating the function based on the negative gradient. We are actually updating the ensemble model using gradient descent. Boosting can be generalized to arbitrary loss functions, which means deriving the gradient based on the loss function and use this derivation as the rule to calculate residuals

Gradient Boosted Trees

Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a special case of Gradient Boosting where the weak learners are regression trees. The method was invented by Jerome H. Friedman in 1999.

In the picture below there is an example of residuals fitting of a GBRT with 3 trees.

The first tree fits a step function to the sample data. Tree 2 sees only the residuals of examples not correctly classified by tree 1, and so on for every iteration. The final model is the sum of all models.

GBRT Hyperparameters

Unlike Random Forests, GBRT have many hyperparameters to set in order to achive good performance. They can be grouped in two meta-parameter areas: Tree Structure and Regularization.

In the following section we will use this notation:

rmin = recommended minimum value: lower values can be used but are not recommended for common applications.
default = default value set by the particular implementation or recommended by the algorithm’s authors
rmax = recommended maximum value: higher values can be used but are not recommended for common applications.

Tree Structure

The optimal tree structure is problem dependent and can be controlled with these parameters:

depth: it controls the maximum allowed level of interaction between variables. A tree of depth=2 will include the interaction effect of two variables at most. The interaction between variables is generally unknown, but in most cases it’s a low value. In many applications depth=2 is too low, while depth > 10 is unlikely required. Specifying depth implies that leaves are expanded for every internal nodee until the desired depth is reached. The resulting Trees are balanced with leaves (2^{depth}). Recommended values: [rmin = 4; default = 6; rmax = 8].

  • max leaf nodes: alternative way to control the depth of the trees. A tree with max leaf nodes = n has at most n – 1 split nodes and can model interaction of order n-1, this behavior is similar to depth = n – 1 but the algorithm is somewhat different. Trees are grown in a greedy best-first fashion, at each split the node with the highest impurity is chosen to be further split while the node with lower impurity becomes a leaf. The resulting tree is unbalanced with leaves at every level. Recommended values: [rmin = 3; default = 5; rmax = 7].
  • mean sample leaf: it puts a constraint on the number of samples in each leaf, hence it reduces the effects of outliers (you cannot have for example leaf with one node). This parameter depends on the two previous hyperparameters. With shallow Trees it is very difficult to have leaves with small values and this parameters is less important. Suppose instead you grow deeper Trees (depth = 100), the probability to overfitting increases. If you increase min sample leaf you will expect fewer leaves because some leaves that would have previously formed now do not meet the minimum requirements. Recommended values: [rmin = 1; default = 1; rmax = samples_num * 0.05].


The optimal number of boosting iterations is also problem dependent. Each iteration reduce the training error, so given a sufficient number of iteration the training error can be made arbitrarily small. However this can cause overfitting, thus there is an optimal number of iteration that should be found. There are also other ways to perform regularization. Let’s review main parameters in this area:

  • shrinkage:
    • iterations: the number of boosting iterations to perform (i.e. number of Trees). The more Trees the better but at higher computational costs. It is the main parameter to control model performance and thus it is a form of regularization.Recommended values: [rmin = 100; default = 100; rmax = matches computational resources].
    • learning rate: it’s another form of regularization. It scales the contribution of each Tree to the current model. A decrease in learning rate increases the shrinkage and has the effect of “reinforce concept”: the redundancy between trees increases. Thus small learning rate values require an higher number of iterations.Recommended values: [rmin = 0.01; default = 0.1; rmax = 0.5].
  • subsampling:
    • subsample: choosing \( subsample < 10 \) leads to a reduction of variance and an increase in bias. It’s the fraction of samples to be used for fitting the individual base learners. As in Random Forest the left out samples are used to calculate OOB estimate.
    • m: the number of features to consider when looking for the best split. It has the same meaning as for the Random Forests. Lower values of \(m\) reduce the correlation between any pair of trees and improve performance.

Hyperparameters Tuning

There is no unique way to do it, and several best practices are availabe. We suggest two of the most common methods.

A three way approach based on scitkit-learn implementation

  1. set iterations to an high value (based on current
  2. select a range of values for the remaining hyperparameters and find the best values comparing all possible combinations.
  3. finally set iterations even higher and fine tune learning rate

Another approach (based on the work of Friedman) is somewhat different.

  1. Set learning rate to a small value.
  2. Keep the Trees small with low values of either depth or min leaf nodes.
  3. Find the corresponding iterations by using OOB for calculating the error. Small values of learningrate require an high number of iterations. This increases computational costs but since trees are usually small the effect is lessened.

GBRT Properties

GBRT – Properties inherited from Trees:

  • Natural handling data of mixed type (numerical and categorical)
  • Automatically detection of non-linear feature interactions
  • Cannot extrapolate (it is not possible to predict beyond the minimum and maximum limits of the response variable in the training data)
  • Work well with heterogeneous data (features can have different scale)

GBRT – Peculiar properties:

  • High predictive power.
  • Robustness to outliers in input space (via robust loss functions)
  • Support for different Loss functions.
  • Fits naturally additive functions.
  • Requires careful tuning (RF are faster to tune, they have essentially one parameter)
  • Slow to train (but fast in prediction)
  • Depending on the implementation it is possible to add new trees to an ensemble.
  • Scalability issue: due to its sequential nature it is difficult to parallelize.

GBRT – Additional features:

  • Out Of Bag (OOB) score. If subsampling is selected, GBRT behaves similarly at Random Forests. It is possibile to use OOB samples to compute the optimal number of boosting itarations. It has the advantage of being computed on the fly as a by-product of the algorithm, but it has the disadvantage of being a pessimistic estimate of the actual loss, especially for high number of trees.
  • Variable importance. Trees (and ensemble of trees) can provide a measure indicating how important or useful a variable is in predicting the outcome.
  • Partial dependence: it shows the relation between the target and a chosen set of variables (at most two at a time), marginalizing over the other variables. The chosen variables are usually the most important and this plot is used to gain insight about the function learned by the ensemble and how it models the dependence between the target and the most important variables

Seguimi anche su Google+