Quick Reference · gradient-boosted trees · single source of truth

xgboost cheat sheet

Every workflow moves a model through four stages: your data, the DMatrix + parameters you configure, the Booster you train, and the predictions/artifacts you deploy. Learn the pipeline once and the parameters stop being a list to memorize. Covers the native Booster API, the scikit-learn wrapper, every booster type, constraints, custom objectives, distributed training, and best practices — current through XGBoost 3.2 (Feb 2026).

data DMatrix & params training predict / deploy gotcha most common

Distilled & cross-checked across: xgboost.readthedocs.io (parameter reference · Python API · tutorials · release notes 3.0–3.2) · xgboost.ai · arXiv:1603.02754 · github.com/dmlc/xgboost · geeksforgeeks.org · wikipedia.org

The pipeline & the calls that move a model through it
Data NumPy / pandas / Arrow / Polars / SciPy / cuDF X, y, categories, NaN DMatrix + params binned & wrapped — "the index" max_depth, eta, objective… Booster trained tree ensemble bst.best_iteration Predictions scores · SHAP · saved model bst.predict(dtest) DMatrix(X,y) xgb.train() .predict() feature_importances_ xgb.cv() / early_stopping_rounds (tunes params from held-out score) BEFORE TRAINING AFTER TRAINING

Underneath every round: minimize Σ l(yᵢ,ŷᵢ) + Σ Ω(fₖ) — a loss term plus a regularization term Ω(f) = γT + ½λ‖w‖² + α‖w‖₁, solved via 2nd-order (gradient + Hessian) Taylor expansion each boosting round.

Data in — containers, formats, categories, missing values

Everything that happens before a single tree is grown.

01Install & Importget set up
02Global Configurationlibrary-wide switches
03DMatrixthe native container
04QuantileDMatrix & External Memorybig-data containers
05Accepted Input Formatswhat X can be
06Categorical Datano manual encoding
07Missing Values & Sparsitysparsity-aware splits

Configure — booster types, tree shape, sampling, constraints

The parameter dict: what each knob does and its default.

08Parameter Dict & Booster Typewhat you configure
09Tree Growth Controlshape of each tree
10Sampling — Bagging Rows & Colsfights overfitting
11Learning Rate, Rounds & Forestsstep size × count
12tree_method, Device & Updatershow splits are found
13Regularizationpenalize complexity
14Monotone & Interaction Constraintsdomain knowledge in
15DART Boosterdropout for trees
16gblinear Boosterlinear base learner
17Multi-Target & Vector Leafmany outputs, one model

Objectives, metrics & custom losses

What the trees minimize, and how progress is scored.

18Objectives — Classification & Regressionthe everyday set
19Objectives — Specializedrank · survival · counts
20eval_metrichow to score rounds
21Custom Objective & Metricbring your own loss

Train — native loop, sklearn wrappers, stopping, CV, continuation

Both APIs, and everything that controls the boosting loop.

22xgb.train() — Native Loopthe core API
23Early Stoppingstop at the right round
24Cross-Validationrobust round count
25Callbackshook into training
26Training Continuationwarm starts
27XGBClassifiersklearn API
28XGBRegressorsklearn API
29Ranker & Random-Forest Variantsspecialized estimators
30Scikit-learn Interopplugs into the ecosystem

Predict, explain, inspect & persist

Everything you do with a trained Booster.

31predict() Optionsshape the output
32inplace_predict()serving-path inference
33Model Inspectionopen the box
34Booster Slicing & Attributesthe model as an object
35Feature Importancewhat mattered
36SHAP Valuesexplain a prediction
37Plottingsee the trees
38Save / Load Modelspersistence

Scale out & best practices

GPU, Dask, Spark — and the tuning playbook that ties it together.

39GPU Trainingdevice='cuda'
40Dask — xgboost.daskmulti-node Python
41Spark — xgboost.sparkPySpark pipelines
42Ecosystem & Beyondwho plays well with it
43Tuning Playbookofficial guidance, condensed
44Performance Best Practicesspeed & memory
Most-Used Defaultsat a glance
objective → default eval_metricquick lookup
Version Milestoneswhat changed when

The math under the hood — from loss to split, in six steps

The whole algorithm in six formulae, straight from the paper and the official "Introduction to Boosted Trees" tutorial. Each parameter you tune appears in exactly one of them.

1 · the model is a sum of trees

Prediction is additive — K trees, each added one round at a time and shrunk by eta. No tree is ever re-fit; new trees fix what's left.

ŷᵢ = Σₖ₌₁..ᴷ fₖ(xᵢ) ŷ⁽ᵗ⁾ = ŷ⁽ᵗ⁻¹⁾ + η·fₜ(xᵢ) η = eta / learning_rate · t = boosting round · K = num_boost_round

2 · the regularized objective

Loss + complexity penalty. This Ω term is the "regularized" in the paper's title — most of your anti-overfit knobs live inside it.

obj = Σᵢ l(yᵢ, ŷᵢ) + Σₖ Ω(fₖ) Ω(f) = γT + ½λ‖w‖² + α‖w‖₁ T = leaves (gamma) · w = leaf weights (reg_lambda, reg_alpha)

3 · second-order Taylor expansion

Each round, the loss is approximated with gradients and Hessians (Newton boosting) — which is exactly why any twice-differentiable custom loss plugs straight in.

obj⁽ᵗ⁾ ≈ Σᵢ [gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ)] + Ω(fₜ) gᵢ = ∂ŷ l(yᵢ,ŷ⁽ᵗ⁻¹⁾) hᵢ = ∂²ŷ l(yᵢ,ŷ⁽ᵗ⁻¹⁾) a custom objective returns exactly these: (grad, hess) — card 21

4 · optimal leaf weight, closed form

For a fixed tree shape, the best value of every leaf — and the score of the whole structure — have exact solutions. λ sits in the denominator, shrinking weights.

w*ⱼ = − Gⱼ / (Hⱼ + λ) obj* = −½ Σⱼ Gⱼ²/(Hⱼ+λ) + γT Gⱼ = Σ gᵢ, Hⱼ = Σ hᵢ over samples in leaf j · min_child_weight caps Hⱼ

5 · the split gain

Every candidate split is scored by how much the structure score improves. γ is subtracted at the end — a split that doesn't beat γ never happens.

Gain = ½[ G_L²/(H_L+λ) + G_R²/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ) ] − γ L R split only if Gain > 0 → gamma is a pre-pruning threshold, this is gain in feature importance

6 · margin → prediction (the link)

Trees always sum to a raw margin; the objective's link function turns it into the number you see. Custom objectives skip this step — apply the link yourself.

margin F(x) = Σ ηfₖ link p = 1 / (1 + e^(−F(x))) binary:logistic → sigmoid · count:poisson / reg:gamma → exp · reg:squarederror → identity output_margin=True returns F(x) itself — and custom objectives ALWAYS return F(x): apply the link yourself

Four levers, one goal: generalize

Different mechanisms, same purpose — stop the ensemble from memorizing the training set. Complementary, not mutually exclusive.

eta (shrinkage)

Small steps + many rounds beats big steps + few rounds — each tree corrects only a fraction of the remaining error.

eta=0.05, 400 rounds eta=0.3, 30 rounds

subsample / colsample

Each tree only sees a random slice of rows and columns, so no single tree can over-fit one quirky sample or feature.

full data (rows × cols) → one tree's sampled subset subsample=0.8 colsample_bytree=0.8

max_depth / min_child_weight

Shallow trees with a higher child-weight floor generalize better than deep trees chasing every training point.

max_depth=3 generalizes better max_depth=12 memorizes noise

gamma / lambda / alpha

Structural penalty (γ) prunes weak splits outright; L2/L1 (λ/α) shrink surviving leaf weights toward zero.

raw leaf weight before λ, α shrunk weight after λ, α (one pruned by γ)

Mental models & caveats, visually

Four pictures that explain most XGBoost debugging sessions — why hist is fast, where missing values go, what early stopping actually keeps, and why importance rankings disagree.

histogram binning (why hist is fast)

Continuous features are quantized into ≤ max_bin buckets once; split search then scans bin edges, not raw values. Fewer bins = faster + leaner, slightly coarser splits.

raw values (thousands of candidates) quantile sketch binned (≤ max_bin=256 candidates) splits land only on bin edges — QuantileDMatrix stores just the bins

sparsity-aware default direction

During training both branches are tried for the missing bucket; the higher-gain side becomes that node's learned default direction. Imputing beforehand erases this signal.

x<7? yes no NaN learned default direction chosen per node, by gain — missing values are information, not noise

early stopping keeps the LAST model

Training halts early_stopping_rounds after the best round — so the returned Booster contains those extra, worse trees unless you slice or use save_best=True.

train eval best_iteration stop these trees stay in the model! fix: iteration_range=(0, best_iteration+1) · or EarlyStopping(save_best=True)

importance types disagree

weight counts splits, gain averages loss reduction, cover averages samples touched — the same three features can rank three different ways.

weight gain cover same 3 features, 3 different rankings high-cardinality features inflate weight — prefer SHAP for decisions

Worth memorizing

2 APIs, 2 defaultsnative num_boost_round=10 vs. sklearn n_estimators=100
hist is defaultsince 2.0 — required for categorical & GPU training
device replaced gpu_idgpu_id/gpu_hist/use_gpu are gone; use device=
last eval_metric winsonly the final metric in the list drives early stopping
train() keeps last iternot the best — slice with iteration_range or save_best=True
sklearn ES in constructorearly_stopping_rounds / eval_metric moved off fit() in 2.0
trees skip scalingthreshold splits mean no need to standardize numeric features
NaN handled nativelymissing values learn a default split direction — no imputation
categorical re-codingsince 3.1, the Booster stores & auto re-codes training categories
QDM val needs ref=QuantileDMatrix validation sets must reference the train matrix
lower eta ⇒ more roundsshrinking eta needs a proportionally larger num_boost_round
save as JSON/UBJSONlegacy binary .model is gone; pickle isn't version-safe
weight ≠ gain ≠ coverthree importance types can rank features differently
inplace_predict servesthread-safe, no DMatrix — predict() holds a per-Booster lock