Cengiz Zopluoglu: How Does Extreme Gradient Boosting (XGBoost) Work?

Cengiz Zopluoglu

In one of the working papers under review, I am using the Extreme Gradient Boosting (XGBoost) to identify examinees with potential item preknowledge in a certification exam. In the first draft I submitted, I did not talk much about the technical aspects of XGBoost and went straight to my application. However, one of the reviewers asked more description of the XGBoost to help readers get a conceptual understanding of how XGBoost works.

After spending a few weeks and going over the original XGBoost paper and this presentation so many times, I finally felt that I had a good grasp of it. Then, I came up with a small numerical illustration in the context of the problem I am working on. Here, I provide an informal introduction of this illustration with accompanying R code, so that anyone can play with it.

You can download the illustration dataset from this link.

Dataset

Suppose, we have 30 observations with three variables, \(RT_1\), \(RT_2\), and \(y\). \(RT_1\) and \(RT_2\) are the response time (in seconds) for two test items and our features (predictors). \(y\) is a binary outcome variable with 0 indicating that the individual did not have item preknowledge and 1 indicating that the individual had item preknowledge. The goal is to develop a tree ensemble model to predict the outcome given the feature variables.


d <- read.csv("data/illustration2.csv")
d$Obs <- 1:30
d <- d[,c(4,1,2,3)]

require(knitr)
require(kableExtra)

t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:4,width="1em")

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

First, we need an objective function to optimize. So, we can look back and check if we are improving our model. In the context of predicting a binary outcome, XGBoost is using a logistic loss function.

\[\ell(y_i,\hat{y}_i)=y_i ln(1+e^{-\hat{y}_i})+(1-y_i)ln(1+e^{\hat{y}_i}),\]

where \(y_i\) and \(\hat{y}_i\) are the observed and predicted outcome for the \(i^{th}\) observation. Before we start iterations, XGBoost assigns a predicted value of 0 for each observation as our initial prediction. Let’s call this Iteration 0.


d$y0 <- 0
colnames(d)[5] <- "$\\hat{y}^0$"
  
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:5,width="1em")

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

The value of the loss function based on \(y_i\) and \(\hat{y}^0\) at Iteration 0 is 20.79442.


y  <- d[,4]
y0 <- d[,5]

loss0 <- sum(y*log(1+exp(-y0)) + (1-y)*log(1+exp(y0)))
loss0


[1] 20.79442

Iteration 1

To build the first tree, we need the first and second order gradient statistics of the loss function from Iteration 0. Chen and Guestrin (2016) used a second-order Taylor approximation for the loss function and derived the equations for the first- and second-order gradient statistics. In this case, we can compute the first- and second-order gradient statistics for each observation as the following:

\[g_i^{(1)} = -\frac{(y_i-1)e^{\hat{y_i}^{(0)}}+y_i}{e^{\hat{y_i}^{(0)}}+1}\]

\[h_i^{(1)} = \frac{e^{\hat{y_i}^{(0)}}}{(e^{\hat{y_i}^{(0)}}+1)^2}\]


d$g1 <- -((y-1)*exp(y0)+y)/(exp(y0)+1)
d$h1 <- exp(y0)/((exp(y0)+1)^2)

t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 2))

					Iteration 1
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1
1	114	36	1	0	-0.5	0.2
2	92	121	1	0	-0.5	0.2
3	49	24	1	0	-0.5	0.2
4	58	30	1	0	-0.5	0.2
5	40	17	1	0	-0.5	0.2
6	67	66	1	0	-0.5	0.2
7	30	79	1	0	-0.5	0.2
8	19	14	1	0	-0.5	0.2
9	33	8	1	0	-0.5	0.2
10	172	38	1	0	-0.5	0.2
11	55	37	0	0	0.5	0.2
12	103	79	0	0	0.5	0.2
13	68	117	0	0	0.5	0.2
14	194	17	0	0	0.5	0.2
15	68	84	0	0	0.5	0.2
16	48	127	0	0	0.5	0.2
17	48	200	0	0	0.5	0.2
18	61	47	0	0	0.5	0.2
19	37	155	0	0	0.5	0.2
20	58	84	0	0	0.5	0.2
21	139	97	0	0	0.5	0.2
22	99	84	0	0	0.5	0.2
23	47	50	0	0	0.5	0.2
24	41	54	0	0	0.5	0.2
25	29	90	0	0	0.5	0.2
26	117	31	0	0	0.5	0.2
27	44	104	0	0	0.5	0.2
28	34	62	0	0	0.5	0.2
29	83	20	0	0	0.5	0.2
30	64	57	0	0	0.5	0.2

Now the tricky part here is to find the best split of these observations that would improve the predictions most (reduce the value of loss function most). The observations can be split based on the RT1 or RT2 features. For instance, suppose we divide the observations into two groups based on whether the RT1 value is below or above 58.5 seconds. Below is a hypothetical tree structure with two leaves. There are 16 observations in the first leaf (\(S_1\)), and 14 observations in the second leaf (\(S_2\)).

A hypothetical split of observations based on the RT1 feature

There are many potential splits available. We can say the number of possible theoretical splits are \(2^{30}\). However, most of them are not relevant here. In this case, there are only 53 possible splits of interest, 27 based on RT1 and 26 based on RT2 (as many as the number of unique elements for RT1 and RT2).

So, how could we know if this particular split is good or bad? Or, how could we know the best split that would improve our predictions most? To evaluate a tree structure like this, we need a quantity to measure its quality. Chen and Guestrin (2016) derived the Gain score for evaluating a fixed tree structure.

\[ Gain = \left [ \sum_{l=1}^{L} \left ( \frac{\sum_{i \epsilon S_l}g_i)^2}{\sum_{i \epsilon S_l}h_i+\lambda} - \frac{(\sum_{i}g_i)^2}{\sum_{i}h_i+\lambda} \right ) \right ]-\gamma \]

Here, \(l\) indexes the leaf, \(L\) is the number of leaves in the tree structure, and \(S_l\) indicates the set of observations in the \(l^{th}\) leaf. \(\lambda\) and \(\gamma\) are the parameters to penalize the complexity of the tree structure and should be set to specific values. In practice, these values are typically determined through a parameter tuning procedure. For the sake of this illustration, I set the values to 1.5 and 1 for \(\lambda\) and \(\gamma\), respectively.

Now, we will compute the gain score for each possible split based on the RT1 and RT2 features, and then check which one gives the best solution.


g <- d$g1
h <- d$h1

lambda = 1.5
gamma = 1

# All Splits based on RT1

gain <- c()
s <- sort(unique(d$RT1))+.5

for(i in 1:length(s)) {
  
  split  <- (d$RT1<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")

# All Splits for RT2

gain <- c()
s <- sort(unique(d$RT2))+.5

for(i in 1:length(s)) {
  
  split <- (d$RT2<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")

pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]

kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
38.5	3.512	RT2
36.5	3.081	RT2
30.5	2.742	RT2
47.5	2.444	RT2
37.5	2.011	RT2
31.5	1.651	RT2

As we can see, the split with the largest gain is the one based on a threshold level of 38.5 for the RT2 feature. So, the first branch of the first tree is developed.

The algorithm will be repeated for the observations on each leaf. The algorithm will compute a gain score for all potential splits on each leaf separately to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves). Let’s first look at all potential splits on the first leaf (RT2 < 38.5).


# All RT1 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT2<38.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2<38.5 & d$RT1<s[i])
     gr2 <- which(d$RT2<38.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT2<38.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2<38.5 & d$RT2<s[i])
     gr2 <- which(d$RT2<38.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
49.5	0.148	RT1
58.5	-0.105	RT1
114.5	-0.275	RT1
172.5	-0.387	RT1
40.5	-0.529	RT1
55.5	-0.711	RT1

The results indicate that there is only one potential split with a positive gain, a split based on a threshold level of 49.5 for the RT1 feature. Now, let’s check all potential splits on the second leaf (RT2 > 38.5).


# All RT1 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT2>38.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2>38.5 & d$RT1<s[i])
     gr2 <- which(d$RT2>38.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT2>38.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2>38.5 & d$RT2<s[i])
     gr2 <- which(d$RT2>38.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
30.5	-0.412	RT1
139.5	-1.000	RT1
200.5	-1.000	RT1
34.5	-1.103	RT1
64.5	-1.512	RT1
37.5	-1.598	RT1

It seems none of the possible splits on the second leaf yield a positive gain score. In this case, the search algorithm suggests adding a new branch on the first leaf based on a threshold level of 49.5 for the RT1 feature. Our first tree structure now looks like this.

The first tree structure with a depth of 2

The depth of our tree is now 2. As you can imagine, we can continue to search, and add even more branches to the tree by picking the best split as long as there is a positive gain. This would increase the depth of the tree. The more depth a tree structure has, the more complex interactions it accounts for when predicting the outcome. For the sake of this illustration, we stop and end Iteration 1 here.

Once, the tree structure is determined, we can now compute the weights for each leaf. Our first tree structure has three leaves now. Chen and Guestrin (2016) derived that the optimal leaf weight for a fixed tree structure is equal to

\[w_l=- \frac{\sum_{i \epsilon S_l}g_i}{\sum_{i \epsilon S_l}h_i+\lambda}\]

Let’s compute the leaf weights on the first tree structure.


  gr1 <- which(d$RT2 > 38.5)
  gr1


 [1]  2  6  7 12 13 15 16 17 18 19 20 21 22 23 24 25 27 28 30


  gr2 <- which(d$RT2 <38.5 & d$RT1<49.5)
  gr2


[1] 3 5 8 9


  gr3 <- which(d$RT2 <38.5 & d$RT1>49.5)
  gr3


[1]  1  4 10 11 14 26 29


        w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
        w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)
        w3 = -sum(g[gr3])/(sum(h[gr3])+lambda)
        
        w1


[1] -1.04

w2


[1] 0.8

w3


[1] -0.1538462

These weights represent how we will improve our predictions at the end of Iteration 1 before we move on to Iteration 2. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^0\)) and obtain the new predictions at the end of Iteration 1 (\(\hat{y}^1\)).


 d$w  <- NA
 d$y1 <- NA
 
 d[gr1,]$w = w1
 d[gr2,]$w = w2
 d[gr3,]$w = w3
 
 d$y1 <- d[,5] + d$w
 
 colnames(d)[8] <- "$w^{1}$"
 colnames(d)[9] <- "$\\hat{y}^1$"
   
   
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4))

					Iteration 1
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1	\(w^{1}\)	\(\hat{y}^1\)
1	114	36	1	0	-0.5	0.25	-0.154	-0.154
2	92	121	1	0	-0.5	0.25	-1.040	-1.040
3	49	24	1	0	-0.5	0.25	0.800	0.800
4	58	30	1	0	-0.5	0.25	-0.154	-0.154
5	40	17	1	0	-0.5	0.25	0.800	0.800
6	67	66	1	0	-0.5	0.25	-1.040	-1.040
7	30	79	1	0	-0.5	0.25	-1.040	-1.040
8	19	14	1	0	-0.5	0.25	0.800	0.800
9	33	8	1	0	-0.5	0.25	0.800	0.800
10	172	38	1	0	-0.5	0.25	-0.154	-0.154
11	55	37	0	0	0.5	0.25	-0.154	-0.154
12	103	79	0	0	0.5	0.25	-1.040	-1.040
13	68	117	0	0	0.5	0.25	-1.040	-1.040
14	194	17	0	0	0.5	0.25	-0.154	-0.154
15	68	84	0	0	0.5	0.25	-1.040	-1.040
16	48	127	0	0	0.5	0.25	-1.040	-1.040
17	48	200	0	0	0.5	0.25	-1.040	-1.040
18	61	47	0	0	0.5	0.25	-1.040	-1.040
19	37	155	0	0	0.5	0.25	-1.040	-1.040
20	58	84	0	0	0.5	0.25	-1.040	-1.040
21	139	97	0	0	0.5	0.25	-1.040	-1.040
22	99	84	0	0	0.5	0.25	-1.040	-1.040
23	47	50	0	0	0.5	0.25	-1.040	-1.040
24	41	54	0	0	0.5	0.25	-1.040	-1.040
25	29	90	0	0	0.5	0.25	-1.040	-1.040
26	117	31	0	0	0.5	0.25	-0.154	-0.154
27	44	104	0	0	0.5	0.25	-1.040	-1.040
28	34	62	0	0	0.5	0.25	-1.040	-1.040
29	83	20	0	0	0.5	0.25	-0.154	-0.154
30	64	57	0	0	0.5	0.25	-1.040	-1.040

The value of the loss function based on \(y_i\) and \(\hat{y}^1\) at the end of Iteration 1 is 15.15 (reduced from 20.79)


y  <- d[,4]
y1 <- d[,9]

loss1 <- sum(y*log(1+exp(-y1)) + (1-y)*log(1+exp(y1)))
loss1


[1] 15.15075

Iteration 2

The remaining of this analysis is simply a repetition of Iteration 1. At the end of Iteration 1, we developed the first tree in our model. Now, we will develop the second one. First, we have to update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 1.

\[g_i^{(2)} = -\frac{(y_i-1)e^{\hat{y_i}^{(1)}}+y_i}{e^{\hat{y_i}^{(1)}}+1}\]

\[h_i^{(2)} = \frac{e^{\hat{y_i}^{(1)}}}{(e^{\hat{y_i}^{(1)}}+1)^2}\]


d$g2 <- -((y-1)*exp(y1)+y)/(exp(y1)+1)
d$h2 <- exp(y1)/((exp(y1)+1)^2)

t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 2))

					Iteration 1				Iteration 2
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1	\(w^{1}\)	\(\hat{y}^1\)	g2	h2
1	114	36	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249
2	92	121	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193
3	49	24	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214
4	58	30	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249
5	40	17	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214
6	67	66	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193
7	30	79	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193
8	19	14	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214
9	33	8	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214
10	172	38	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249
11	55	37	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249
12	103	79	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
13	68	117	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
14	194	17	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249
15	68	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
16	48	127	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
17	48	200	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
18	61	47	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
19	37	155	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
20	58	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
21	139	97	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
22	99	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
23	47	50	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
24	41	54	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
25	29	90	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
26	117	31	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249
27	44	104	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
28	34	62	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193
29	83	20	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249
30	64	57	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193

The algorithm computes the same search as in the Iteration 1 for all possible splits based on the RT1 and RT2 features using the updated gradient statistics.


g <- d$g2
h <- d$h2

# All Splits based on RT1

gain <- c()
s <- sort(unique(d$RT1))+.5

for(i in 1:length(s)) {
  
  split  <- (d$RT1<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")

# All Splits for RT2

gain <- c()
s <- sort(unique(d$RT2))+.5

for(i in 1:length(s)) {
  
  split <- (d$RT2<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")

pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]

kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
33.5	0.009	RT1
79.5	0.001	RT2
38.5	-0.124	RT2
36.5	-0.248	RT2
40.5	-0.315	RT1
30.5	-0.350	RT2

It seems the best split for the first branch of the second tree is based on a threshold value of 33.5 for the RT1 feature.

Similar to Iteration 1, the algorithm will compute a gain score for all potential splits on each leaf of the second tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).

Let’s first look at all potential splits on the first leaf (RT1 > 33.5).


# All RT1 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT1>33.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT1>33.5 & d$RT1<s[i])
     gr2 <- which(d$RT1>33.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT1>33.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT1>33.5 & d$RT2<s[i])
     gr2 <- which(d$RT1>33.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
38.5	-0.319	RT1
36.5	-0.517	RT1
47.5	-0.567	RT1
66.5	-0.631	RT1
30.5	-0.680	RT1
83.5	-0.702	RT1

The results indicate that there is no split with a positive gain that can be added to the first leaf. Let’s check the second leaf (RT1 < 33.5).


# All RT1 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT1<33.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT1<33.5 & d$RT1<s[i])
     gr2 <- which(d$RT1<33.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT1<33.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT1<33.5 & d$RT2<s[i])
     gr2 <- which(d$RT1<33.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
79.5	-0.610	RT1
29.5	-0.943	RT1
33.5	-1.000	RT1
90.5	-1.000	RT1
19.5	-1.169	RT1
30.5	-1.169	RT1

There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 2 ends. Our second tree has a depth of one. Let’s compute the leaf weights on the second tree structure.


gr1 <- which(d$RT1 > 33.5)
gr2 <- which(d$RT1 < 33.5)

w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)

w1


[1] -0.2951779

w2


[1] 0.4744527

These weights similarly represent how we will improve our predictions at the end of Iteration 2 before we move on to Iteration 3. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to in Tree 2. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^1\)) and obtain the new predictions at the end of Iteration 2 (\(\hat{y}^2\)).


 d$w  <- NA
 d$y2 <- NA
 
 d[gr1,]$w = w1
 d[gr2,]$w = w2
 
 d$y2 <- d[,9] + d$w
 
 colnames(d)[12] <- "$w^{2}$"
 colnames(d)[13] <- "$\\hat{y}^2$"
   
   
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4))

					Iteration 1				Iteration 2
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1	\(w^{1}\)	\(\hat{y}^1\)	g2	h2	\(w^{2}\)	\(\hat{y}^2\)
1	114	36	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449
2	92	121	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335
3	49	24	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505
4	58	30	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449
5	40	17	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505
6	67	66	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335
7	30	79	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	0.474	-0.566
8	19	14	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274
9	33	8	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274
10	172	38	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449
11	55	37	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449
12	103	79	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
13	68	117	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
14	194	17	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449
15	68	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
16	48	127	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
17	48	200	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
18	61	47	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
19	37	155	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
20	58	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
21	139	97	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
22	99	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
23	47	50	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
24	41	54	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
25	29	90	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	0.474	-0.566
26	117	31	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449
27	44	104	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
28	34	62	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335
29	83	20	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449
30	64	57	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335

The value of the loss function based on \(y_i\) and \(\hat{y}^2\) at the end of Iteration 2 is 14.35 (reduced from 15.15).


y  <- d[,4]
y2 <- d[,13]

loss2 <- sum(y*log(1+exp(-y2)) + (1-y)*log(1+exp(y2)))
loss2


[1] 14.34646

Our tree ensemble model at the end of Iteration 2 now looks like this.

Iteration 3

We keep repeating the same cycle. First, we update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 2.

\[g_i^{(3)} = -\frac{(y_i-1)e^{\hat{y_i}^{(2)}}+y_i}{e^{\hat{y_i}^{(2)}}+1}\]

\[h_i^{(3)} = \frac{e^{\hat{y_i}^{(2)}}}{(e^{\hat{y_i}^{(2)}}+1)^2}\]


d$g3 <- -((y-1)*exp(y2)+y)/(exp(y2)+1)
d$h3 <- exp(y2)/((exp(y2)+1)^2)

t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 2))

					Iteration 1				Iteration 2				Iteration 3
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1	\(w^{1}\)	\(\hat{y}^1\)	g2	h2	\(w^{2}\)	\(\hat{y}^2\)	g3	h3
1	114	36	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238
2	92	121	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335	-0.792	0.165
3	49	24	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505	-0.376	0.235
4	58	30	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238
5	40	17	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505	-0.376	0.235
6	67	66	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335	-0.792	0.165
7	30	79	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	0.474	-0.566	-0.638	0.231
8	19	14	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274	-0.218	0.171
9	33	8	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274	-0.218	0.171
10	172	38	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238
11	55	37	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238
12	103	79	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
13	68	117	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
14	194	17	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238
15	68	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
16	48	127	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
17	48	200	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
18	61	47	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
19	37	155	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
20	58	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
21	139	97	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
22	99	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
23	47	50	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
24	41	54	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
25	29	90	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	0.474	-0.566	0.362	0.231
26	117	31	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238
27	44	104	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
28	34	62	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165
29	83	20	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238
30	64	57	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165

The algorithm searches among all possible splits based on the RT1 and RT2 features using the updated gradient statistics.


g <- d$g3
h <- d$h3

# All Splits based on RT1

gain <- c()
s <- sort(unique(d$RT1))+.5

for(i in 1:length(s)) {
  
  split  <- (d$RT1<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")

# All Splits for RT2

gain <- c()
s <- sort(unique(d$RT2))+.5

for(i in 1:length(s)) {
  
  split <- (d$RT2<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")

pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]

kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
79.5	0.116	RT2
38.5	-0.131	RT2
36.5	-0.356	RT2
47.5	-0.382	RT2
66.5	-0.436	RT2
30.5	-0.542	RT2

The best split was found based on a threshold value of 79..5 for the RT2 feature.

The algorithm computes a gain score for all potential splits on each leaf of the third tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).

Let’s first look at all potential splits on the first leaf (RT2 > 79.5).


# All RT1 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT2>79.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2>79.5 & d$RT1<s[i])
     gr2 <- which(d$RT2>79.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 1

    gain  <- c()
    s     <- sort(d[d$RT2>79.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2>79.5 & d$RT2<s[i])
     gr2 <- which(d$RT2>79.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
68.5	-0.399	RT1
68.5	-0.399	RT1
117.5	-0.650	RT1
58.5	-0.846	RT1
104.5	-0.846	RT1
48.5	-0.995	RT1

The results indicate that there is no split with a positive gain that can be added to the first leaf. Let’s check the second leaf (RT2 < 79.5).


# All RT1 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT2<79.5,]$RT1)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2<79.5 & d$RT1<s[i])
     gr2 <- which(d$RT2<79.5 & d$RT1>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp1) <- c("Threshold","Gain","Feature")


# All RT2 Splits on Leaf 2

    gain  <- c()
    s     <- sort(d[d$RT2<79.5,]$RT2)+.5

    for(i in 1:length(s)) {

       gr1 <- which(d$RT2<79.5 & d$RT2<s[i])
     gr2 <- which(d$RT2<79.5 & d$RT2>s[i])

       GL     <- sum(g[gr1])
       GR     <- sum(g[gr2])
       HL     <- sum(h[gr1])
       HR     <- sum(h[gr2])
       gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma            
    }

    pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
    colnames(pp2) <- c("Threshold","Gain","Feature")


    pp <- rbind(pp1,pp2)
  pp <- pp[order(pp$Gain,decreasing=TRUE),]

  kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
172.5	-0.610	RT1
67.5	-0.687	RT1
62.5	-0.743	RT1
114.5	-0.784	RT1
40.5	-0.842	RT1
33.5	-0.879	RT1

There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 3 ends. Our third tree has a depth of one as the second tree. Let’s compute the leaf weights on the third tree structure.


gr1 <- which(d$RT2 > 79.5)
gr2 <- which(d$RT2 < 79.5)

w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)

w1


[1] -0.4275847

w2


[1] 0.3063322

We use these weights to update the predictions at the end of Iteration 3 before we move on to Iteration 4. Each observation in our dataset is assigned to one of these weights depending on which leaf they belong to in Tree 3. Then, we add these estimated leaf weights to the previous predictions (\(\hat{y}^2\)) and obtain the new predictions at the end of Iteration 3 (\(\hat{y}^3\)).


 d$w  <- NA
 d$y3 <- NA
 
 d[gr1,]$w = w1
 d[gr2,]$w = w2
 
 d$y3 <- d[,13] + d$w
 
 colnames(d)[16] <- "$w^{3}$"
 colnames(d)[17] <- "$\\hat{y}^3$"
   
   
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 4))

					Iteration 1				Iteration 2				Iteration 3
Obs	RT1	RT2	y	\(\hat{y}^0\)	g1	h1	\(w^{1}\)	\(\hat{y}^1\)	g2	h2	\(w^{2}\)	\(\hat{y}^2\)	g3	h3	\(w^{3}\)	\(\hat{y}^3\)
1	114	36	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238	0.306	-0.143
2	92	121	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335	-0.792	0.165	-0.428	-1.763
3	49	24	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505	-0.376	0.235	0.306	0.811
4	58	30	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238	0.306	-0.143
5	40	17	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	-0.295	0.505	-0.376	0.235	0.306	0.811
6	67	66	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	-0.295	-1.335	-0.792	0.165	0.306	-1.029
7	30	79	1	0	-0.5	0.25	-1.040	-1.040	-0.739	0.193	0.474	-0.566	-0.638	0.231	0.306	-0.259
8	19	14	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274	-0.218	0.171	0.306	1.581
9	33	8	1	0	-0.5	0.25	0.800	0.800	-0.310	0.214	0.474	1.274	-0.218	0.171	0.306	1.581
10	172	38	1	0	-0.5	0.25	-0.154	-0.154	-0.538	0.249	-0.295	-0.449	-0.610	0.238	0.306	-0.143
11	55	37	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238	0.306	-0.143
12	103	79	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029
13	68	117	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
14	194	17	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238	0.306	-0.143
15	68	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
16	48	127	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
17	48	200	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
18	61	47	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029
19	37	155	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
20	58	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
21	139	97	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
22	99	84	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
23	47	50	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029
24	41	54	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029
25	29	90	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	0.474	-0.566	0.362	0.231	-0.428	-0.993
26	117	31	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238	0.306	-0.143
27	44	104	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	-0.428	-1.763
28	34	62	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029
29	83	20	0	0	0.5	0.25	-0.154	-0.154	0.462	0.249	-0.295	-0.449	0.390	0.238	0.306	-0.143
30	64	57	0	0	0.5	0.25	-1.040	-1.040	0.261	0.193	-0.295	-1.335	0.208	0.165	0.306	-1.029

The value of the loss function based on \(y_i\) and \(\hat{y}^3\) at the end of Iteration 3 is 13.57 (reduced from 14.35).


y  <- d[,4]
y3 <- d[,17]

loss3 <- sum(y*log(1+exp(-y3)) + (1-y)*log(1+exp(y3)))
loss3


[1] 13.5685

Our tree ensemble model at the end of Iteration 3 looks like this.

Iteration 4

We update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 3. As can be seen below, the search process did not find any split with a positive gain score. Therefore, the whole search process ended because there is no fourth tree structure that can be added with an improvement in our predictions. So, our final model remains the structure at the end of Iteration 3.

\[g_i^{(4)} = -\frac{(y_i-1)e^{\hat{y_i}^{(3)}}+y_i}{e^{\hat{y_i}^{(3)}}+1}\]

\[h_i^{(4)} = \frac{e^{\hat{y_i}^{(3)}}}{(e^{\hat{y_i}^{(3)}}+1)^2}\]


d$g4 <- -((y-1)*exp(y3)+y)/(exp(y3)+1)
d$h4 <- exp(y3)/((exp(y3)+1)^2)

g <- d$g4
h <- d$h4

# All Splits based on RT1

gain <- c()
s <- sort(unique(d$RT1))+.5

for(i in 1:length(s)) {
  
  split  <- (d$RT1<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")

# All Splits for RT2

gain <- c()
s <- sort(unique(d$RT2))+.5

for(i in 1:length(s)) {
  
  split <- (d$RT2<s[i])*1
  GL     <- sum(g[which(split==1)])
  GR     <- sum(g[which(split==0)])
  HL     <- sum(h[which(split==1)])
  HR     <- sum(h[which(split==0)])
  
  gain[i]  <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}

pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")

pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]

kable(head(pp),digits=3,row.names=FALSE)

Threshold	Gain	Feature
38.5	-0.662	RT2
33.5	-0.690	RT1
36.5	-0.721	RT2
30.5	-0.769	RT2
40.5	-0.787	RT1
79.5	-0.797	RT2

Using the tree model to predict the outcome for a new observation

Since we just developed (trained) our model, we can now use it to predict the probability of the outcome for a new observation. Suppose a new test taker comes and responds the first item and second item in 80 and 40 seconds, respectively (RT1 = 80, RT2= 40). We first place this examinee on the corresponding leaf for each tree depending on these inputs.

For Tree 1, this person is on the first leaf since RT2 > 38.5, so the corresponding leaf weight is -1.040.
For Tree 2, this person is on the first leaf since RT1 > 33.5, so the corresponding leaf weight is -0.295.
For Tree 3, this person is on the second leaf since RT2 < 79.5, so the corresponding leaf weight is 0.306.

We sum the weights for this person from all trees, \[(-1.040) + (-0.295) + (-0.306)= -1.029.\] Note that this is on the logit scale, and we can transform it to a probability as \[\frac{e^{-1.641}}{e^{-1.641}+1} = 0.263.\] So, a person with this pattern of response time, our model predicts that there is a 26% chance that this person had item preknowledge.

Let’s do another example. Suppose another test taker comes and responds the first item and second item in 25 and 15 seconds, respectively (RT1 = 25, RT2= 15). This pattern corresponds to a weight of 0.800 (RT2 < 38.5 and RT1 < 49.5) for the first tree, a weight of 0.474 (RT1 < 33.5) for the second tree, and a weight of 0.306 (RT < 79.5) for the third tree. The sum of the weights is 1.58, corresponding to a probability of 0.83. So, a person with this particular pattern of response time, our model predicts that there is a 83% chance that this person had item preknowledge.

Notice that the smaller the response time is (and exceeds certain threshold levels in the model), the more likely the predicted probability of item preknowledge is. This makes sense in this context, so the model works as it is supposed to.

Running the analysis using the `xgboost` package in R


require(xgboost)

dtrain <- xgb.DMatrix(data = data.matrix(d[,2:3]), label=d$y)

fit <- xgb.train(data=dtrain,
                 nround = 5,
                 eta = 1,
                 min_child_weight = 0,
                 max_depth = 2,
                 gamma = 1,
                 max_delta_step=0,
                 subsample = 1,
                 colsample_bytree = 1,
                 lambda = 1.5,
                 alpha =0,
                 scale_pos_weight =1,
                 num_parallel_tree = 1,
                 nthread = 1, 
                 objective = "binary:logistic",
                 eval_metric="rmse",predict=TRUE)

First, let me briefly comment on some of the parameters here, and you can find many other webpage to give you more information about them.

As you can see, I set the gamma = 1 and lambda = 1.5. These are the same numbers I set the value for gamma and lambda in my demonstration above.
nround = 5 is the number of maximum iterations you want to run, so this is technically the maximum number of trees you want to have in your tree ensemble model.
eta is the learning parameter. In the above demonstration, I use eta=1 without explicitly saying because I added the exact amount of weight to update the prediction at each iteration. Instead, one can decide to add a smaller fraction of the weight (such as half of it, or 10% of it) to update the prediction. In this case, for instance, you can set eta=.1, and this tells the algorithm to add 10% of the leaf weight estimated at each iteration when updating the predictions.
min_child_weight is the minimum sum of the second-order gradient statistics to add a new branch.
max_depth is the maximum depth of a tree. I set it to 2 as in my demo above.
alpha is another regularization parameter such as lambda and gamma. The original paper does not mention it, so I am not quite sure how it changes the formulas I provided in this tutorial, but it can be in the source code the algorithm is using (https://github.com/dmlc/xgboost/blob/master/src/tree/param.h, see Line 45, Line 142-145, Line 303-310). I set it to 0, so the XGBoost runs the same way in my demonstration.

Now, let’s check the tree structure the original XGBoost algorithm comes up.


xgb.plot.tree(model = fit)

Tree Model Calibrated by the Original XGBoost function

Let’s try to understand the numbers in this model calibrated by the XGboost function and check if it is aligned with the numbers in our illustration above. I will briefly go over the numbers for the first tree and will leave it to you to check the numbers for the other two trees.

Tree 0

There are two numbers listed in the first box of Tree 0.

Cover: 7.5
Gain: 4.51163387

What Cover: 7.5 here refers is the sum of the second order gradient statistics for all 30 observations in the first iteration. So, if you go back to the big table reported above and sum the values of h1, that gives you 7.5. The Gain score reported is 4.5116. If we go back and check the gain score we found, it is 3.512. If you check the formula for the Gain above, you will see the \(\gamma\) is substracted at the end in our computation. So, the difference of 1 (after rounding to the third decimal) between the Gain Score I computed and the Gain Score XGBoost reported is probably due to the \(\gamma\) parameter used, \(\gamma = 1\).

I believe the original XGBoost must be using the regularization term for the search as we specify \(\gamma = 1\) in the function; however, it is probably reporting the Gain Score without substracting \(\gamma\) when it comes to reporting in this plot.

First, the threshold to split the first branch based on RT2 value is reported as 42.5. In my search, I found it as 38.5. So, this is a bit confusing. However, if we sort the values available on RT2, we see that there is no value in the data available between 38 and 47.


sort(d$RT2)


 [1]   8  14  17  17  20  24  30  31  36  37  38  47  50  54  57  62
[17]  66  79  79  84  84  84  90  97 104 117 121 127 155 200

So, practically, there is no difference in splitting this training sample based on a threshold value of 38.5 or a threshold value of 42.5. The number comes up in my computation by design. In my naive search, I added .5 to each unique value in the dataset and then did a brute-force search for every possible score. That’s why 38.5 came up in my search as the threshold value. I am not sure how XGBoost comes up with 42.5, but I guess it is picking the middle of the interval. I don’t think it is coincidence that 42.5 is the mean of 38 and 47. In short, there is no difference between choosing 38.5 as threshold or 42.5 as threshold for this small training sample.

Within the ellipse, there are two values Cover:4.75 and Value:-1.03999996. Cover is again the sum of the second gradient statistics for those observations on this particular leaf (RT2 > 42.5), and Value corresponds to the leaf weight for those observations on this leaf. Note that, the leaf weight we computed above in my computation is -1.04. So, we got the same weight in our computation.

Within the box, there are two values Cover:2.75 and Gain:1.14751124. Cover is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5), and Gain is the gain score obtained for the next additional best split on this leaf. Note that the gain score for this split I computed above is 0.148. Again, there is a difference of 1 (\(\gamma = 1\)) because it seems XGBoost doesn’t care about the regularization parameter when it comes to reporting the gain scores in this plot.

Again, the threshold value for the second split on the RT1 variable is 52. In my naive search, I came up with 49.5. Similarly, there is no value in the data between 49 and 55 for the RT1 variable, and it is no coincidence that XGBoost is reporting the middle of 49 and 55 as threshold. So, practically, there is no difference picking a threshold of 49.5 or 52 for this training sample.


sort(d$RT1)


 [1]  19  29  30  33  34  37  40  41  44  47  48  48  49  55  58  58
[17]  61  64  67  68  68  83  92  99 103 114 117 139 172 194

Within the first ellipse above, we see Cover:1 and Value:0.80000000012. Cover is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 < 52) and Value corresponds to the leaf weight. Note that we computed the same weight in our computation for this leaf.

Within the second ellipse above, we see Cover:1.75 and Value:-0.15384616. Cover is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 > 52) and Value corresponds to the leaf weight. Again, note that we computed the exact same weight for this leaf as well.

Final Word

I hope this detailed illustration helps interested people understand better the mechanics of XGBoost. It certainly helped me. I have a better grasp of what is happening behind the scene whenever I use XGBoost for most complex problems, and the numbers make more sense whenever I look at the calibrated tree ensemble model.

Please feel free to send me any feedback and comments. I am happy to address any errors/typos in the post.

Thanks for reading.

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

Obs	RT1	RT2	y
1	114	36	1
2	92	121	1
3	49	24	1
4	58	30	1
5	40	17	1
6	67	66	1
7	30	79	1
8	19	14	1
9	33	8	1
10	172	38	1
11	55	37	0
12	103	79	0
13	68	117	0
14	194	17	0
15	68	84	0
16	48	127	0
17	48	200	0
18	61	47	0
19	37	155	0
20	58	84	0
21	139	97	0
22	99	84	0
23	47	50	0
24	41	54	0
25	29	90	0
26	117	31	0
27	44	104	0
28	34	62	0
29	83	20	0
30	64	57	0

How Does Extreme Gradient Boosting (XGBoost) Work?