Skip to content
Oscar Serra edited this page Feb 24, 2015 · 4 revisions

Table of Contents

Random Forests

Like Boosting, Random Forests also bring together very dummy classifiers <math>h_j(\cdot)</math> and create a hyper-classifier that not only is simple to understand but also fast to run.

Objective

We want to create binary decision trees that we will traverse in a depth-first path until we find a leaf. The leafs will inform us about the probability of our new example <math>m</math> belonging to the different classes <math>c</math>.

We will create <math>Q</math> different trees and aggregate all their result in an ensemble. The aggregation of trees is thus called a forest.

Every tree will differ from the rest because they are created at random. More precisely, there are typically two sources of randomization: one is their training dataset (like bootstrapping) and the other is a randomized way of generating dummy classifiers.

Model and Learning pseudocode

The creation of the tree is a recursive process. In every step, a specific weak classifier is chosen from the generated fauna, linked to the node, and then the process is repeated for each of its child. The particularity of these weak classifiers is that they chop the input hyperspace in two subregions, so every child nodes has an easier classification task than its parent because the subspace to segment is smaller.

As the example in the image above tries to depict (check the explanatory video), every classifier divides the space in two subregions by testing if one particular dimension of the input vector <math>X_{m}</math> satisfies an inequality. Other kinds of decision-making questions could be asked while dealing with categorical segmentations.

Step 0 - Initialization

Let's define:

<math>h_jk(\cdot)</math> as the family of <math>J</math> dummy classifiers generated at node <math>k</math>.

<math>S_k</math> as the subspace of <math>X</math> assigned to each node <math>k</math>.

We will also initialize <math>S_0 = \mathbb{R}^n</math> so the root of the tree <math>k = 0</math> has visibility of all the examples <math>X_m</math>. In practice, the training dataset for every tree will be a random subset of the original database, which adds diversity to the forest.

Step 1 - Generating weak classifiers

There is a lot of possible weak classifiers we can choose from, and we will try to pick the simplest we can think of. A common choice is to check if an input feature <math>n</math> is greater than a specific threshold <math>\gamma_{jn}</math> or not. An efficient way to pick thresholds, for example, is to take them from the examples in the visible subspace <math>S_k</math>.

Choosing the dimension <math>n</math> to check from is done randomly, which helps adding diversity to the forest.

Step 2 - Creating histograms

For every weak classifier <math>h_j(\cdot)</math>, we create the two histograms out of the two splits that would be created if the weak classifier 'j' was chosen.

We then calculate the entropy 'H' of each histogram by using Shannon's Information Theory:

<math> H_k = E[I(S_k)] = - \sum_c p(c_k) \cdot log(p(c_k)) </math>, being <math>p(c)=\frac{\textrm{# of class 'c' examples}}{\textrm{total # of examples}}</math>

Step 3 - Choosing best information gain

We want the entropy <math>H</math> to decrease as much as possible when we jump from a node to the next, and we can measure each jump in terms of information gain, which will be the difference of entropy between the two nodes.

It could be the case that a split decreases entropy in one side and increases in the other, so we want to make sure that we are gaining anything at all in the binary split. Therefore we will perform a weighted average of the entropy of the two child to compare it to the parent. That will let us create an error score <math>\epsilon_j</math> for classifier <math>j</math> on node <math>k</math>, with child <math>ka</math> and <math>kb</math>:

<math> \epsilon_j = H_k - \frac{\textrm{# examples in ka}}{\textrm{# examples in k}} \cdot H_{ka} - \frac{\textrm{# examples in kb}}{\textrm{# examples in k}} \cdot H_{kb} </math>

Step 4 - Stopping the recursivity

There is a number of stop conditions that we can come up with:

  • When the histogram contains only one class (or entropy is close to zero)
  • When no weak classifier is found that provides an information gain (in which case we might want regenerate the collection of weak classifiers instead)
  • ...
The heuristics about when to stop the recursion are also called pre-pruning. In general, if we stop too late, we will probably overfit and, if we stop too soon, we will not generalize enough.

Step 5 - Ensembling trees in a forest

The final step of the algorithm is, for every given input vector <math>X_m</math>, to take the histograms and average them. This will be the output <math>\widehat{Y}</math> of our random forest.

Discussion

Advantages

Random forests are very suitable for distributed computing because they make efficient use of memory and computation:

  • Every tree does not need to see all examples <math>M</math>, they can load a subset of the database in memory (#1 source of randomness).
  • Weak classifiers look at only one column of the data, so each tree could load only a subset of the available features <math>N</math> (#2 source of randomness)
  • Every tree could be learned and computed by a different processor on the cloud, and the information then gathered and averaged by the master.
Random trees are quite fast, a new example is expected to be classified in <math>O(log(M))</math> time.

When comparing the different weak classifiers used by the different trees in the forest, we can make a count and rank relevance of features, therefore using random forests for feature selection.

Extensions

In order to avoid overfitting, and after some amount of pre-pruning, we might want to do some post-pruning. We are going to be deciding which branches to cut off and which ones to keep, like learning to turn a normal tree into a bonsai. This can be done by cross-validation.

If we want a faster method, we can compare the number of terminal nodes with the quadratic classification error and come up with a cost function to optimize.

Instead of using the Shannon's Entropy Theorem to score weak classifiers, we can use the Gini index instead.

Limitations

Decision trees are fast to traverse, but since no attention is put in the boundary between two classes, its generalization ability is somewhat limited. Moreover, all boundaries are not only linear but also parallel to the axis, so a diagonal boundary between classes will be split in a stair-like boundary.

Further reading

Shannon-Hartley Theorem - Wikipedia Entropy - Wikipedia Random Forest - Wikipedia

Clone this wiki locally