In one of the working papers under review, I am using the Extreme Gradient Boosting (XGBoost) to identify examinees with potential item preknowledge in a certification exam. In the original paper, one of the reviewers asked more description of the XGBoost to help readers get a conceptual understanding of how XGBoost works. After spending a few weeks on the original paper, I finally felt that I had a good grasp of it. In this post, I provide an informal introduction of XGboost using an illustration with accompanying R code.
In one of the working papers under review, I am using the Extreme Gradient Boosting (XGBoost) to identify examinees with potential item preknowledge in a certification exam. In the first draft I submitted, I did not talk much about the technical aspects of XGBoost and went straight to my application. However, one of the reviewers asked more description of the XGBoost to help readers get a conceptual understanding of how XGBoost works.
After spending a few weeks and going over the original XGBoost paper and this presentation so many times, I finally felt that I had a good grasp of it. Then, I came up with a small numerical illustration in the context of the problem I am working on. Here, I provide an informal introduction of this illustration with accompanying R code, so that anyone can play with it.
You can download the illustration dataset from this link.
Suppose, we have 30 observations with three variables, \(RT_1\), \(RT_2\), and \(y\). \(RT_1\) and \(RT_2\) are the response time (in seconds) for two test items and our features (predictors). \(y\) is a binary outcome variable with 0 indicating that the individual did not have item preknowledge and 1 indicating that the individual had item preknowledge. The goal is to develop a tree ensemble model to predict the outcome given the feature variables.
d <- read.csv("data/illustration2.csv")
d$Obs <- 1:30
d <- d[,c(4,1,2,3)]
require(knitr)
require(kableExtra)
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:4,width="1em")
Obs | RT1 | RT2 | y |
---|---|---|---|
1 | 114 | 36 | 1 |
2 | 92 | 121 | 1 |
3 | 49 | 24 | 1 |
4 | 58 | 30 | 1 |
5 | 40 | 17 | 1 |
6 | 67 | 66 | 1 |
7 | 30 | 79 | 1 |
8 | 19 | 14 | 1 |
9 | 33 | 8 | 1 |
10 | 172 | 38 | 1 |
11 | 55 | 37 | 0 |
12 | 103 | 79 | 0 |
13 | 68 | 117 | 0 |
14 | 194 | 17 | 0 |
15 | 68 | 84 | 0 |
16 | 48 | 127 | 0 |
17 | 48 | 200 | 0 |
18 | 61 | 47 | 0 |
19 | 37 | 155 | 0 |
20 | 58 | 84 | 0 |
21 | 139 | 97 | 0 |
22 | 99 | 84 | 0 |
23 | 47 | 50 | 0 |
24 | 41 | 54 | 0 |
25 | 29 | 90 | 0 |
26 | 117 | 31 | 0 |
27 | 44 | 104 | 0 |
28 | 34 | 62 | 0 |
29 | 83 | 20 | 0 |
30 | 64 | 57 | 0 |
First, we need an objective function to optimize. So, we can look back and check if we are improving our model. In the context of predicting a binary outcome, XGBoost is using a logistic loss function.
\[\ell(y_i,\hat{y}_i)=y_i ln(1+e^{-\hat{y}_i})+(1-y_i)ln(1+e^{\hat{y}_i}),\]
where \(y_i\) and \(\hat{y}_i\) are the observed and predicted outcome for the \(i^{th}\) observation. Before we start iterations, XGBoost assigns a predicted value of 0 for each observation as our initial prediction. Let’s call this Iteration 0.
d$y0 <- 0
colnames(d)[5] <- "$\\hat{y}^0$"
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:5,width="1em")
Obs | RT1 | RT2 | y | \(\hat{y}^0\) |
---|---|---|---|---|
1 | 114 | 36 | 1 | 0 |
2 | 92 | 121 | 1 | 0 |
3 | 49 | 24 | 1 | 0 |
4 | 58 | 30 | 1 | 0 |
5 | 40 | 17 | 1 | 0 |
6 | 67 | 66 | 1 | 0 |
7 | 30 | 79 | 1 | 0 |
8 | 19 | 14 | 1 | 0 |
9 | 33 | 8 | 1 | 0 |
10 | 172 | 38 | 1 | 0 |
11 | 55 | 37 | 0 | 0 |
12 | 103 | 79 | 0 | 0 |
13 | 68 | 117 | 0 | 0 |
14 | 194 | 17 | 0 | 0 |
15 | 68 | 84 | 0 | 0 |
16 | 48 | 127 | 0 | 0 |
17 | 48 | 200 | 0 | 0 |
18 | 61 | 47 | 0 | 0 |
19 | 37 | 155 | 0 | 0 |
20 | 58 | 84 | 0 | 0 |
21 | 139 | 97 | 0 | 0 |
22 | 99 | 84 | 0 | 0 |
23 | 47 | 50 | 0 | 0 |
24 | 41 | 54 | 0 | 0 |
25 | 29 | 90 | 0 | 0 |
26 | 117 | 31 | 0 | 0 |
27 | 44 | 104 | 0 | 0 |
28 | 34 | 62 | 0 | 0 |
29 | 83 | 20 | 0 | 0 |
30 | 64 | 57 | 0 | 0 |
The value of the loss function based on \(y_i\) and \(\hat{y}^0\) at Iteration 0 is 20.79442.
y <- d[,4]
y0 <- d[,5]
loss0 <- sum(y*log(1+exp(-y0)) + (1-y)*log(1+exp(y0)))
loss0
[1] 20.79442
To build the first tree, we need the first and second order gradient statistics of the loss function from Iteration 0. Chen and Guestrin (2016) used a second-order Taylor approximation for the loss function and derived the equations for the first- and second-order gradient statistics. In this case, we can compute the first- and second-order gradient statistics for each observation as the following:
\[g_i^{(1)} = -\frac{(y_i-1)e^{\hat{y_i}^{(0)}}+y_i}{e^{\hat{y_i}^{(0)}}+1}\]
\[h_i^{(1)} = \frac{e^{\hat{y_i}^{(0)}}}{(e^{\hat{y_i}^{(0)}}+1)^2}\]
d$g1 <- -((y-1)*exp(y0)+y)/(exp(y0)+1)
d$h1 <- exp(y0)/((exp(y0)+1)^2)
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 2))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 |
---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.2 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.2 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.2 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.2 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.2 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.2 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.2 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.2 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.2 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.2 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.2 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.2 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.2 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.2 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.2 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.2 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.2 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.2 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.2 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.2 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.2 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.2 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.2 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.2 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.2 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.2 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.2 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.2 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.2 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.2 |
Now the tricky part here is to find the best split of these observations that would improve the predictions most (reduce the value of loss function most). The observations can be split based on the RT1 or RT2 features. For instance, suppose we divide the observations into two groups based on whether the RT1 value is below or above 58.5 seconds. Below is a hypothetical tree structure with two leaves. There are 16 observations in the first leaf (\(S_1\)), and 14 observations in the second leaf (\(S_2\)).
There are many potential splits available. We can say the number of possible theoretical splits are \(2^{30}\). However, most of them are not relevant here. In this case, there are only 53 possible splits of interest, 27 based on RT1 and 26 based on RT2 (as many as the number of unique elements for RT1 and RT2).
So, how could we know if this particular split is good or bad? Or, how could we know the best split that would improve our predictions most? To evaluate a tree structure like this, we need a quantity to measure its quality. Chen and Guestrin (2016) derived the Gain score for evaluating a fixed tree structure.
\[ Gain = \left [ \sum_{l=1}^{L} \left ( \frac{\sum_{i \epsilon S_l}g_i)^2}{\sum_{i \epsilon S_l}h_i+\lambda} - \frac{(\sum_{i}g_i)^2}{\sum_{i}h_i+\lambda} \right ) \right ]-\gamma \]
Here, \(l\) indexes the leaf, \(L\) is the number of leaves in the tree structure, and \(S_l\) indicates the set of observations in the \(l^{th}\) leaf. \(\lambda\) and \(\gamma\) are the parameters to penalize the complexity of the tree structure and should be set to specific values. In practice, these values are typically determined through a parameter tuning procedure. For the sake of this illustration, I set the values to 1.5 and 1 for \(\lambda\) and \(\gamma\), respectively.
Now, we will compute the gain score for each possible split based on the RT1 and RT2 features, and then check which one gives the best solution.
g <- d$g1
h <- d$h1
lambda = 1.5
gamma = 1
# All Splits based on RT1
gain <- c()
s <- sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split <- (d$RT1<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All Splits for RT2
gain <- c()
s <- sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split <- (d$RT2<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
38.5 | 3.512 | RT2 |
36.5 | 3.081 | RT2 |
30.5 | 2.742 | RT2 |
47.5 | 2.444 | RT2 |
37.5 | 2.011 | RT2 |
31.5 | 1.651 | RT2 |
As we can see, the split with the largest gain is the one based on a threshold level of 38.5 for the RT2 feature. So, the first branch of the first tree is developed.
The algorithm will be repeated for the observations on each leaf. The algorithm will compute a gain score for all potential splits on each leaf separately to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves). Let’s first look at all potential splits on the first leaf (RT2 < 38.5).
# All RT1 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT2<38.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2<38.5 & d$RT1<s[i])
gr2 <- which(d$RT2<38.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT2<38.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2<38.5 & d$RT2<s[i])
gr2 <- which(d$RT2<38.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
49.5 | 0.148 | RT1 |
58.5 | -0.105 | RT1 |
114.5 | -0.275 | RT1 |
172.5 | -0.387 | RT1 |
40.5 | -0.529 | RT1 |
55.5 | -0.711 | RT1 |
The results indicate that there is only one potential split with a positive gain, a split based on a threshold level of 49.5 for the RT1 feature. Now, let’s check all potential splits on the second leaf (RT2 > 38.5).
# All RT1 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT2>38.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2>38.5 & d$RT1<s[i])
gr2 <- which(d$RT2>38.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT2>38.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2>38.5 & d$RT2<s[i])
gr2 <- which(d$RT2>38.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
30.5 | -0.412 | RT1 |
139.5 | -1.000 | RT1 |
200.5 | -1.000 | RT1 |
34.5 | -1.103 | RT1 |
64.5 | -1.512 | RT1 |
37.5 | -1.598 | RT1 |
It seems none of the possible splits on the second leaf yield a positive gain score. In this case, the search algorithm suggests adding a new branch on the first leaf based on a threshold level of 49.5 for the RT1 feature. Our first tree structure now looks like this.
The depth of our tree is now 2. As you can imagine, we can continue to search, and add even more branches to the tree by picking the best split as long as there is a positive gain. This would increase the depth of the tree. The more depth a tree structure has, the more complex interactions it accounts for when predicting the outcome. For the sake of this illustration, we stop and end Iteration 1 here.
Once, the tree structure is determined, we can now compute the weights for each leaf. Our first tree structure has three leaves now. Chen and Guestrin (2016) derived that the optimal leaf weight for a fixed tree structure is equal to
\[w_l=- \frac{\sum_{i \epsilon S_l}g_i}{\sum_{i \epsilon S_l}h_i+\lambda}\]
Let’s compute the leaf weights on the first tree structure.
gr1 <- which(d$RT2 > 38.5)
gr1
[1] 2 6 7 12 13 15 16 17 18 19 20 21 22 23 24 25 27 28 30
gr2 <- which(d$RT2 <38.5 & d$RT1<49.5)
gr2
[1] 3 5 8 9
gr3 <- which(d$RT2 <38.5 & d$RT1>49.5)
gr3
[1] 1 4 10 11 14 26 29
w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)
w3 = -sum(g[gr3])/(sum(h[gr3])+lambda)
w1
[1] -1.04
w2
[1] 0.8
w3
[1] -0.1538462
These weights represent how we will improve our predictions at the end of Iteration 1 before we move on to Iteration 2. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^0\)) and obtain the new predictions at the end of Iteration 1 (\(\hat{y}^1\)).
d$w <- NA
d$y1 <- NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d[gr3,]$w = w3
d$y1 <- d[,5] + d$w
colnames(d)[8] <- "$w^{1}$"
colnames(d)[9] <- "$\\hat{y}^1$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 | \(w^{1}\) | \(\hat{y}^1\) |
---|---|---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 |
The value of the loss function based on \(y_i\) and \(\hat{y}^1\) at the end of Iteration 1 is 15.15 (reduced from 20.79)
y <- d[,4]
y1 <- d[,9]
loss1 <- sum(y*log(1+exp(-y1)) + (1-y)*log(1+exp(y1)))
loss1
[1] 15.15075
The remaining of this analysis is simply a repetition of Iteration 1. At the end of Iteration 1, we developed the first tree in our model. Now, we will develop the second one. First, we have to update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 1.
\[g_i^{(2)} = -\frac{(y_i-1)e^{\hat{y_i}^{(1)}}+y_i}{e^{\hat{y_i}^{(1)}}+1}\]
\[h_i^{(2)} = \frac{e^{\hat{y_i}^{(1)}}}{(e^{\hat{y_i}^{(1)}}+1)^2}\]
d$g2 <- -((y-1)*exp(y1)+y)/(exp(y1)+1)
d$h2 <- exp(y1)/((exp(y1)+1)^2)
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 2))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 | \(w^{1}\) | \(\hat{y}^1\) | g2 | h2 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 |
The algorithm computes the same search as in the Iteration 1 for all possible splits based on the RT1 and RT2 features using the updated gradient statistics.
g <- d$g2
h <- d$h2
# All Splits based on RT1
gain <- c()
s <- sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split <- (d$RT1<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All Splits for RT2
gain <- c()
s <- sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split <- (d$RT2<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
33.5 | 0.009 | RT1 |
79.5 | 0.001 | RT2 |
38.5 | -0.124 | RT2 |
36.5 | -0.248 | RT2 |
40.5 | -0.315 | RT1 |
30.5 | -0.350 | RT2 |
It seems the best split for the first branch of the second tree is based on a threshold value of 33.5 for the RT1 feature.
Similar to Iteration 1, the algorithm will compute a gain score for all potential splits on each leaf of the second tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).
Let’s first look at all potential splits on the first leaf (RT1 > 33.5).
# All RT1 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT1>33.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT1>33.5 & d$RT1<s[i])
gr2 <- which(d$RT1>33.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT1>33.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT1>33.5 & d$RT2<s[i])
gr2 <- which(d$RT1>33.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
38.5 | -0.319 | RT1 |
36.5 | -0.517 | RT1 |
47.5 | -0.567 | RT1 |
66.5 | -0.631 | RT1 |
30.5 | -0.680 | RT1 |
83.5 | -0.702 | RT1 |
The results indicate that there is no split with a positive gain that can be added to the first leaf. Let’s check the second leaf (RT1 < 33.5).
# All RT1 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT1<33.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT1<33.5 & d$RT1<s[i])
gr2 <- which(d$RT1<33.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT1<33.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT1<33.5 & d$RT2<s[i])
gr2 <- which(d$RT1<33.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
79.5 | -0.610 | RT1 |
29.5 | -0.943 | RT1 |
33.5 | -1.000 | RT1 |
90.5 | -1.000 | RT1 |
19.5 | -1.169 | RT1 |
30.5 | -1.169 | RT1 |
There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 2 ends. Our second tree has a depth of one. Let’s compute the leaf weights on the second tree structure.
gr1 <- which(d$RT1 > 33.5)
gr2 <- which(d$RT1 < 33.5)
w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)
w1
[1] -0.2951779
w2
[1] 0.4744527
These weights similarly represent how we will improve our predictions at the end of Iteration 2 before we move on to Iteration 3. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to in Tree 2. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^1\)) and obtain the new predictions at the end of Iteration 2 (\(\hat{y}^2\)).
d$w <- NA
d$y2 <- NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d$y2 <- d[,9] + d$w
colnames(d)[12] <- "$w^{2}$"
colnames(d)[13] <- "$\\hat{y}^2$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 | \(w^{1}\) | \(\hat{y}^1\) | g2 | h2 | \(w^{2}\) | \(\hat{y}^2\) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | 0.474 | -0.566 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | 0.474 | -0.566 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 |
The value of the loss function based on \(y_i\) and \(\hat{y}^2\) at the end of Iteration 2 is 14.35 (reduced from 15.15).
y <- d[,4]
y2 <- d[,13]
loss2 <- sum(y*log(1+exp(-y2)) + (1-y)*log(1+exp(y2)))
loss2
[1] 14.34646
Our tree ensemble model at the end of Iteration 2 now looks like this.
We keep repeating the same cycle. First, we update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 2.
\[g_i^{(3)} = -\frac{(y_i-1)e^{\hat{y_i}^{(2)}}+y_i}{e^{\hat{y_i}^{(2)}}+1}\]
\[h_i^{(3)} = \frac{e^{\hat{y_i}^{(2)}}}{(e^{\hat{y_i}^{(2)}}+1)^2}\]
d$g3 <- -((y-1)*exp(y2)+y)/(exp(y2)+1)
d$h3 <- exp(y2)/((exp(y2)+1)^2)
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 2))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 | \(w^{1}\) | \(\hat{y}^1\) | g2 | h2 | \(w^{2}\) | \(\hat{y}^2\) | g3 | h3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 | -0.792 | 0.165 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 | -0.376 | 0.235 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 | -0.376 | 0.235 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 | -0.792 | 0.165 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | 0.474 | -0.566 | -0.638 | 0.231 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 | -0.218 | 0.171 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 | -0.218 | 0.171 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | 0.474 | -0.566 | 0.362 | 0.231 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 |
The algorithm searches among all possible splits based on the RT1 and RT2 features using the updated gradient statistics.
g <- d$g3
h <- d$h3
# All Splits based on RT1
gain <- c()
s <- sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split <- (d$RT1<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All Splits for RT2
gain <- c()
s <- sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split <- (d$RT2<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
79.5 | 0.116 | RT2 |
38.5 | -0.131 | RT2 |
36.5 | -0.356 | RT2 |
47.5 | -0.382 | RT2 |
66.5 | -0.436 | RT2 |
30.5 | -0.542 | RT2 |
The best split was found based on a threshold value of 79..5 for the RT2 feature.
The algorithm computes a gain score for all potential splits on each leaf of the third tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).
Let’s first look at all potential splits on the first leaf (RT2 > 79.5).
# All RT1 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT2>79.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2>79.5 & d$RT1<s[i])
gr2 <- which(d$RT2>79.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain <- c()
s <- sort(d[d$RT2>79.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2>79.5 & d$RT2<s[i])
gr2 <- which(d$RT2>79.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
68.5 | -0.399 | RT1 |
68.5 | -0.399 | RT1 |
117.5 | -0.650 | RT1 |
58.5 | -0.846 | RT1 |
104.5 | -0.846 | RT1 |
48.5 | -0.995 | RT1 |
The results indicate that there is no split with a positive gain that can be added to the first leaf. Let’s check the second leaf (RT2 < 79.5).
# All RT1 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT2<79.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2<79.5 & d$RT1<s[i])
gr2 <- which(d$RT2<79.5 & d$RT1>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain <- c()
s <- sort(d[d$RT2<79.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 <- which(d$RT2<79.5 & d$RT2<s[i])
gr2 <- which(d$RT2<79.5 & d$RT2>s[i])
GL <- sum(g[gr1])
GR <- sum(g[gr2])
HL <- sum(h[gr1])
HR <- sum(h[gr2])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
172.5 | -0.610 | RT1 |
67.5 | -0.687 | RT1 |
62.5 | -0.743 | RT1 |
114.5 | -0.784 | RT1 |
40.5 | -0.842 | RT1 |
33.5 | -0.879 | RT1 |
There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 3 ends. Our third tree has a depth of one as the second tree. Let’s compute the leaf weights on the third tree structure.
gr1 <- which(d$RT2 > 79.5)
gr2 <- which(d$RT2 < 79.5)
w1 = -sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = -sum(g[gr2])/(sum(h[gr2])+lambda)
w1
[1] -0.4275847
w2
[1] 0.3063322
We use these weights to update the predictions at the end of Iteration 3 before we move on to Iteration 4. Each observation in our dataset is assigned to one of these weights depending on which leaf they belong to in Tree 3. Then, we add these estimated leaf weights to the previous predictions (\(\hat{y}^2\)) and obtain the new predictions at the end of Iteration 3 (\(\hat{y}^3\)).
d$w <- NA
d$y3 <- NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d$y3 <- d[,13] + d$w
colnames(d)[16] <- "$w^{3}$"
colnames(d)[17] <- "$\\hat{y}^3$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 4))
Obs | RT1 | RT2 | y | \(\hat{y}^0\) | g1 | h1 | \(w^{1}\) | \(\hat{y}^1\) | g2 | h2 | \(w^{2}\) | \(\hat{y}^2\) | g3 | h3 | \(w^{3}\) | \(\hat{y}^3\) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 114 | 36 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 | 0.306 | -0.143 |
2 | 92 | 121 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 | -0.792 | 0.165 | -0.428 | -1.763 |
3 | 49 | 24 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 | -0.376 | 0.235 | 0.306 | 0.811 |
4 | 58 | 30 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 | 0.306 | -0.143 |
5 | 40 | 17 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | -0.295 | 0.505 | -0.376 | 0.235 | 0.306 | 0.811 |
6 | 67 | 66 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | -0.295 | -1.335 | -0.792 | 0.165 | 0.306 | -1.029 |
7 | 30 | 79 | 1 | 0 | -0.5 | 0.25 | -1.040 | -1.040 | -0.739 | 0.193 | 0.474 | -0.566 | -0.638 | 0.231 | 0.306 | -0.259 |
8 | 19 | 14 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 | -0.218 | 0.171 | 0.306 | 1.581 |
9 | 33 | 8 | 1 | 0 | -0.5 | 0.25 | 0.800 | 0.800 | -0.310 | 0.214 | 0.474 | 1.274 | -0.218 | 0.171 | 0.306 | 1.581 |
10 | 172 | 38 | 1 | 0 | -0.5 | 0.25 | -0.154 | -0.154 | -0.538 | 0.249 | -0.295 | -0.449 | -0.610 | 0.238 | 0.306 | -0.143 |
11 | 55 | 37 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 | 0.306 | -0.143 |
12 | 103 | 79 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
13 | 68 | 117 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
14 | 194 | 17 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 | 0.306 | -0.143 |
15 | 68 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
16 | 48 | 127 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
17 | 48 | 200 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
18 | 61 | 47 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
19 | 37 | 155 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
20 | 58 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
21 | 139 | 97 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
22 | 99 | 84 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
23 | 47 | 50 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
24 | 41 | 54 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
25 | 29 | 90 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | 0.474 | -0.566 | 0.362 | 0.231 | -0.428 | -0.993 |
26 | 117 | 31 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 | 0.306 | -0.143 |
27 | 44 | 104 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | -0.428 | -1.763 |
28 | 34 | 62 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
29 | 83 | 20 | 0 | 0 | 0.5 | 0.25 | -0.154 | -0.154 | 0.462 | 0.249 | -0.295 | -0.449 | 0.390 | 0.238 | 0.306 | -0.143 |
30 | 64 | 57 | 0 | 0 | 0.5 | 0.25 | -1.040 | -1.040 | 0.261 | 0.193 | -0.295 | -1.335 | 0.208 | 0.165 | 0.306 | -1.029 |
The value of the loss function based on \(y_i\) and \(\hat{y}^3\) at the end of Iteration 3 is 13.57 (reduced from 14.35).
y <- d[,4]
y3 <- d[,17]
loss3 <- sum(y*log(1+exp(-y3)) + (1-y)*log(1+exp(y3)))
loss3
[1] 13.5685
Our tree ensemble model at the end of Iteration 3 looks like this.
We update our first- and second-order gradient statistics based on the new predictions at the end of Iteration 3. As can be seen below, the search process did not find any split with a positive gain score. Therefore, the whole search process ended because there is no fourth tree structure that can be added with an improvement in our predictions. So, our final model remains the structure at the end of Iteration 3.
\[g_i^{(4)} = -\frac{(y_i-1)e^{\hat{y_i}^{(3)}}+y_i}{e^{\hat{y_i}^{(3)}}+1}\]
\[h_i^{(4)} = \frac{e^{\hat{y_i}^{(3)}}}{(e^{\hat{y_i}^{(3)}}+1)^2}\]
d$g4 <- -((y-1)*exp(y3)+y)/(exp(y3)+1)
d$h4 <- exp(y3)/((exp(y3)+1)^2)
g <- d$g4
h <- d$h4
# All Splits based on RT1
gain <- c()
s <- sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split <- (d$RT1<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) <- c("Threshold","Gain","Feature")
# All Splits for RT2
gain <- c()
s <- sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split <- (d$RT2<s[i])*1
GL <- sum(g[which(split==1)])
GR <- sum(g[which(split==0)])
HL <- sum(h[which(split==1)])
HR <- sum(h[which(split==0)])
gain[i] <- ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda)) - (((GL+GR)^2)/(HL+HR+lambda))) - gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) <- c("Threshold","Gain","Feature")
pp <- rbind(pp1,pp2)
pp <- pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold | Gain | Feature |
---|---|---|
38.5 | -0.662 | RT2 |
33.5 | -0.690 | RT1 |
36.5 | -0.721 | RT2 |
30.5 | -0.769 | RT2 |
40.5 | -0.787 | RT1 |
79.5 | -0.797 | RT2 |
Since we just developed (trained) our model, we can now use it to predict the probability of the outcome for a new observation. Suppose a new test taker comes and responds the first item and second item in 80 and 40 seconds, respectively (RT1 = 80, RT2= 40). We first place this examinee on the corresponding leaf for each tree depending on these inputs.
We sum the weights for this person from all trees, \[(-1.040) + (-0.295) + (-0.306)= -1.029.\] Note that this is on the logit scale, and we can transform it to a probability as \[\frac{e^{-1.641}}{e^{-1.641}+1} = 0.263.\] So, a person with this pattern of response time, our model predicts that there is a 26% chance that this person had item preknowledge.
Let’s do another example. Suppose another test taker comes and responds the first item and second item in 25 and 15 seconds, respectively (RT1 = 25, RT2= 15). This pattern corresponds to a weight of 0.800 (RT2 < 38.5 and RT1 < 49.5) for the first tree, a weight of 0.474 (RT1 < 33.5) for the second tree, and a weight of 0.306 (RT < 79.5) for the third tree. The sum of the weights is 1.58, corresponding to a probability of 0.83. So, a person with this particular pattern of response time, our model predicts that there is a 83% chance that this person had item preknowledge.
Notice that the smaller the response time is (and exceeds certain threshold levels in the model), the more likely the predicted probability of item preknowledge is. This makes sense in this context, so the model works as it is supposed to.
xgboost
package in R
require(xgboost)
dtrain <- xgb.DMatrix(data = data.matrix(d[,2:3]), label=d$y)
fit <- xgb.train(data=dtrain,
nround = 5,
eta = 1,
min_child_weight = 0,
max_depth = 2,
gamma = 1,
max_delta_step=0,
subsample = 1,
colsample_bytree = 1,
lambda = 1.5,
alpha =0,
scale_pos_weight =1,
num_parallel_tree = 1,
nthread = 1,
objective = "binary:logistic",
eval_metric="rmse",predict=TRUE)
First, let me briefly comment on some of the parameters here, and you can find many other webpage to give you more information about them.
As you can see, I set the gamma = 1
and lambda = 1.5
. These are the same numbers I set the value for gamma and lambda in my demonstration above.
nround = 5
is the number of maximum iterations you want to run, so this is technically the maximum number of trees you want to have in your tree ensemble model.
eta
is the learning parameter. In the above demonstration, I use eta=1
without explicitly saying because I added the exact amount of weight to update the prediction at each iteration. Instead, one can decide to add a smaller fraction of the weight (such as half of it, or 10% of it) to update the prediction. In this case, for instance, you can set eta=.1
, and this tells the algorithm to add 10% of the leaf weight estimated at each iteration when updating the predictions.
min_child_weight
is the minimum sum of the second-order gradient statistics to add a new branch.
max_depth
is the maximum depth of a tree. I set it to 2 as in my demo above.
alpha
is another regularization parameter such as lambda and gamma. The original paper does not mention it, so I am not quite sure how it changes the formulas I provided in this tutorial, but it can be in the source code the algorithm is using (https://github.com/dmlc/xgboost/blob/master/src/tree/param.h, see Line 45, Line 142-145, Line 303-310). I set it to 0, so the XGBoost runs the same way in my demonstration.
Now, let’s check the tree structure the original XGBoost algorithm comes up.
xgb.plot.tree(model = fit)
Let’s try to understand the numbers in this model calibrated by the XGboost function and check if it is aligned with the numbers in our illustration above. I will briefly go over the numbers for the first tree and will leave it to you to check the numbers for the other two trees.
There are two numbers listed in the first box of Tree 0.
Cover: 7.5
Gain: 4.51163387
What Cover: 7.5
here refers is the sum of the second order gradient statistics for all 30 observations in the first iteration. So, if you go back to the big table reported above and sum the values of h1, that gives you 7.5. The Gain score reported is 4.5116. If we go back and check the gain score we found, it is 3.512. If you check the formula for the Gain above, you will see the \(\gamma\) is substracted at the end in our computation. So, the difference of 1 (after rounding to the third decimal) between the Gain Score I computed and the Gain Score XGBoost reported is probably due to the \(\gamma\) parameter used, \(\gamma = 1\).
I believe the original XGBoost must be using the regularization term for the search as we specify \(\gamma = 1\) in the function; however, it is probably reporting the Gain Score without substracting \(\gamma\) when it comes to reporting in this plot.
First, the threshold to split the first branch based on RT2 value is reported as 42.5. In my search, I found it as 38.5. So, this is a bit confusing. However, if we sort the values available on RT2, we see that there is no value in the data available between 38 and 47.
sort(d$RT2)
[1] 8 14 17 17 20 24 30 31 36 37 38 47 50 54 57 62
[17] 66 79 79 84 84 84 90 97 104 117 121 127 155 200
So, practically, there is no difference in splitting this training sample based on a threshold value of 38.5 or a threshold value of 42.5. The number comes up in my computation by design. In my naive search, I added .5 to each unique value in the dataset and then did a brute-force search for every possible score. That’s why 38.5 came up in my search as the threshold value. I am not sure how XGBoost comes up with 42.5, but I guess it is picking the middle of the interval. I don’t think it is coincidence that 42.5 is the mean of 38 and 47. In short, there is no difference between choosing 38.5 as threshold or 42.5 as threshold for this small training sample.
Within the ellipse, there are two values Cover:4.75
and Value:-1.03999996
. Cover
is again the sum of the second gradient statistics for those observations on this particular leaf (RT2 > 42.5), and Value
corresponds to the leaf weight for those observations on this leaf. Note that, the leaf weight we computed above in my computation is -1.04. So, we got the same weight in our computation.
Within the box, there are two values Cover:2.75
and Gain:1.14751124
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5), and Gain is the gain score obtained for the next additional best split on this leaf. Note that the gain score for this split I computed above is 0.148. Again, there is a difference of 1 (\(\gamma = 1\)) because it seems XGBoost doesn’t care about the regularization parameter when it comes to reporting the gain scores in this plot.
Again, the threshold value for the second split on the RT1 variable is 52. In my naive search, I came up with 49.5. Similarly, there is no value in the data between 49 and 55 for the RT1 variable, and it is no coincidence that XGBoost is reporting the middle of 49 and 55 as threshold. So, practically, there is no difference picking a threshold of 49.5 or 52 for this training sample.
sort(d$RT1)
[1] 19 29 30 33 34 37 40 41 44 47 48 48 49 55 58 58
[17] 61 64 67 68 68 83 92 99 103 114 117 139 172 194
Within the first ellipse above, we see Cover:1
and Value:0.80000000012
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 < 52) and Value
corresponds to the leaf weight. Note that we computed the same weight in our computation for this leaf.
Within the second ellipse above, we see Cover:1.75
and Value:-0.15384616
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 > 52) and Value
corresponds to the leaf weight. Again, note that we computed the exact same weight for this leaf as well.
I hope this detailed illustration helps interested people understand better the mechanics of XGBoost. It certainly helped me. I have a better grasp of what is happening behind the scene whenever I use XGBoost for most complex problems, and the numbers make more sense whenever I look at the calibrated tree ensemble model.
Please feel free to send me any feedback and comments. I am happy to address any errors/typos in the post.
Thanks for reading.
For attribution, please cite this work as
Zopluoglu (2019, Jan. 15). Cengiz Zopluoglu: How Does Extreme Gradient Boosting (XGBoost) Work?. Retrieved from https://github.com/czopluoglu/website/tree/master/docs/posts/extreme-gradient-boosting/
BibTeX citation
@misc{zopluoglu2019how, author = {Zopluoglu, Cengiz}, title = {Cengiz Zopluoglu: How Does Extreme Gradient Boosting (XGBoost) Work?}, url = {https://github.com/czopluoglu/website/tree/master/docs/posts/extreme-gradient-boosting/}, year = {2019} }