Gradient boosting

While looking at ResNets, One thought came to my mind - “So isn’t this boosting?”. An immediate next step of researching this topic led me to this stackoverflow discussion, and later a review session of gradient boosting algorithms.

Actually the Wikipedia page of gradient boosting summarizes the algorithm fairly well. High-level description of what it is with my own words:

“An additive model of weaker trees with forward feature selection trained by gradient descent method that aims to learn to avoid making errors made in previous stages (trees).”

How gradient boosting works:

Define a boosting tree (\(F_M\)) that we aim for \(M\) stages (M weak learners/trees): \(\hat{F}_M(x_i) = \Sigma_{m=1}^M \alpha_m h_m(x_i) + \text{const.}\) where \(\alpha_m\) is the learning rate.
Define loss at the \(m^{th}\) stage/base learner:
- loss function: \(L(y, F_m(x))\)
- example base learner: \(h_m(x_i) = y_i - \hat{y}_{i,m} = y_i - F_m(x_i)\)
- goal is to minimize the loss
At stage 0, as there was no stage before it. Therefore, in the very first tree, we are fitting the tree with \(y_i\) directly: \(F_0(x_i) = \arg \min_\alpha \Sigma_{i=1}^n L(y_i, \alpha)\)
At stage \(m\) (where \(m \neq 0\)), we are fitting the \(m^{th}\) tree with the residual: \(F_m(x_i) = F_{m-1}(x_i) + \arg \min_{h_m} \Sigma_{i=1}^n L(y_i, F_{m-1}(x_i) + \alpha h_m(x_i))\)
Repeat #4 and keep updating the model until convergence (\(F_m(x_i) = F_{m-1}(x_i) + \alpha_mh_m(x_i)\)).

Also, nicely explained source for boosting algorithm by CS 329P : Practical Machine Learning (2021 Fall):


1
2
3
4
5
6
7
8
9
10
11
12
  class GradientBoosting:
    def __init__(self, base_learner, n_learners, learning_rate):
        self.learners = [clone(base_learner) for _ in range(n_learners)]
        self.lr = learning_rate
    def fit(self, X, y):
        residual = y.copy()
        for learner in self.learners:
            learner.fit(X, residual)
            residual -= self.lr * learner.predict(X)
    def predict(self,X):
        preds = [learner.predict(X) for learner in self.learners]
        return np.array(preds).sum(axis=0) * self.lr

Here, we can leverage different base_learner (e.g. regression or classification models with differen objective functions)

Gradient boosting regression

Loss function: MSE!

\[\begin{align*} L(y, F_m(x_i)) & = \frac{1}{n}\Sigma_{i=1}^n(F_m(x_i)-y_i)^2\\ \arg \min_{F_m} \Sigma_{i=1}^n L(y_i, F_m) & = -\frac{\partial L(y, F_m(x_i))}{\partial F_m} \propto \Sigma_{i=1}^n(y_i-F_m(x_i)) \rightarrow \boxed{r_m = \text{pseudo-residual}}\\ \arg \min_{\gamma} L(y, F_{m-1}(x_i)+\gamma) &\approx \arg \min_{\gamma} [L(y, F_{m-1}(x_i)) +\frac{\partial L(y, F_{m-1}(x_i))}{\partial F}\gamma +\frac{1}{2}\frac{\partial^2 L(y, F_{m-1}(x_i))}{\partial F^2}\gamma^2]\\ & = 0 + \frac{\partial L(y, F_{m-1}(x_i))}{\partial F} + \frac{\partial^2 L(y, F_{m-1}(x_i))}{\partial F^2}\gamma \stackrel{\text{set}}{=} 0\\ \gamma & = - \frac{\frac{\partial L(y, F_{m-1}(x_i))}{\partial F}}{\frac{\partial^2 L(y, F_{m-1}(x_i))}{\partial F^2}} = \boxed{\frac{\Sigma_{i=1}^n y_i - F_m(x_i)}{n}} \end{align*}\]

Source code from sklearn

Great video materials:

StatQuest

Gradient boosting classification

Loss function: Can use the same as logistic regression, details illustrated here.

\[\begin{align*} L(y, F_m(x_i)) & = \Pi_{i=1}^n F_m(x_i)^y_i(1-F_m(x_i))^{1-y_i}\\ \ell(y, F_m(x_i)) & = \Sigma_{i=1}^n [y_i\log(F_m(x_i)) + (1-y_i)\log(1-F_m(x_i))] \\ & = \Sigma_{i=1}^n y_i\log(\frac{F_m(x_i)}{1-F_m(x_i)}) + \log(1-F_m(x_i))\\ & = \Sigma_{i=1}^n y_i\log(\text{Odds}) + \log(1-\frac{\exp(\log(\text{Odds}))}{1+\exp(\log(\text{Odds}))})\\ & = \Sigma_{i=1}^n y_i\log(\text{Odds}) - \log(1+\exp(\log(\text{Odds})))\\ \arg \min_{\log(\text{Odds})} \ell(y, F_m(x_i)) &= \frac{\partial \ell(y, F_m(x_i))}{\partial \log(\text{Odds})} = \Sigma_{i=1}^n y_i - \frac{\exp(\log(\text{Odds}))}{1+\exp(\log(\text{Odds}))} \\ &= \Sigma_{i=1}^n y_i - F_m(x_i) \rightarrow \boxed{r_m = \text{pseudo-residual}}\\ \arg \min_{\gamma} \ell(y, F_{m-1}(x_i)+\gamma) &\approx \arg \min_{\gamma} [\ell(y, F_{m-1}(x_i)) +\frac{\partial \ell(y, F_{m-1}(x_i))}{\partial F}\gamma +\frac{1}{2}\frac{\partial^2 \ell(y, F_{m-1}(x_i))}{\partial F^2}\gamma^2]\\ & = 0 + \frac{\partial \ell(y, F_{m-1}(x_i))}{\partial F} + \frac{\partial^2 \ell(y, F_{m-1}(x_i))}{\partial F^2}\gamma \stackrel{\text{set}}{=} 0\\ \gamma & = - \frac{\frac{\partial \ell(y, F_{m-1}(x_i))}{\partial F}}{\frac{\partial^2 \ell(y, F_{m-1}(x_i))}{\partial F^2}} = \boxed{\frac{\Sigma_{i=1}^n y_i - F_m(x_i)}{\Sigma_{i=1}^n F_m(x_i)*(1-F_m(x_i))}} \end{align*}\]

Source code from sklearn

Great video materials:

StatQuest