Python · gradient boosting

XGBoost Cheat Sheet

eXtreme Gradient Boosting — a scalable, regularized implementation of gradient-boosted decision trees. Additive ensemble: each new tree corrects the residual error of all trees before it. Covers the native Booster API, the scikit-learn wrapper, tree/regularization parameters, early stopping, and explainability — current through XGBoost 3.2 (Feb 2026).

📘 Official docs ⚙️ Parameter reference 🐍 Python API 🌐 xgboost.ai 📄 Chen & Guestrin, 2016 💻 GitHub ℹ️ Wikipedia

Signature model   boosting is additive, not voting

F₀(x) base_score + η·f₁(x) fits residual of F₀ + η·f₂(x) + ··· + η·f_M(x) num_boost_round = M = F_M(x) bst.predict(dtest) final ensemble score → link fn → prediction Every round, minimize: Σ l(yᵢ, ŷᵢ) + Σ Ω(fₖ) l = loss (objective param) Ω(f) = γT + ½λ‖w‖² + α‖w‖₁ 2nd-order Taylor (grad + hess) approximates the loss each round η = eta / learning_rate · T = leaves per tree · γ = gamma · λ = reg_lambda · α = reg_alpha
Setup & Data Native API (core) Scikit-learn API Tree / Booster params Regularization & objectives Training control Post-training ★ most-used / everyday
Setup & data
01
Install & importsetup
  • pip install xgboostfull package incl. GPU (CUDA) support
  • pip install xgboost-cpusmaller wheel, no GPU / federated learning
  • import xgboost as xgbconventional alias
  • xgb.__version__check installed version (current: 3.2)
  • conda install -c conda-forge py-xgboostconda auto-detects GPU variant
02
DMatrix — native data containersetup
  • xgb.DMatrix(X, label=y)core input; wraps NumPy / pandas / SciPy sparse
  • DMatrix(X, weight=w, base_margin=m)per-sample weights, custom initial score
  • xgb.QuantileDMatrix(X, y, ref=None)pre-binned; faster + lower memory init for hist
  • ExtMemQuantileDMatrix(iterator)out-of-core / external-memory, TB-scale (3.0+)
  • dtrain.save_binary('train.dmatrix')cache preprocessed matrix to disk
03
Categorical & missing valuessetup
  • DMatrix(X, enable_categorical=True)auto-detects pandas category dtype
  • tree_method='hist'required for native categorical splits
  • missing=np.nandefault; NaN handled natively (learned split dir.)
  • Auto-recoding (3.1+)Booster stores training categories, re-codes new/unseen values at inference automatically
Native API (core)
04
Parameters dict & booster typecore
  • params = {'max_depth':6,'eta':0.3,'objective':'binary:logistic'}plain dict or list of (key, value) pairs
  • booster: 'gbtree'default — additive regression trees
  • booster: 'dart'gbtree + dropout of trees each round
  • booster: 'gblinear'linear base learner (deprecated since 3.3)
  • device: 'cpu' | 'cuda' | 'cuda:0'replaces removed gpu_id/gpu_hist
05
xgb.train() — training loopcore
  • bst = xgb.train(params, dtrain, num_boost_round=100)returns a fitted Booster
  • evals=[(dtrain,'train'), (dval,'eval')]watchlist, printed / logged each round
  • early_stopping_rounds=20stop if last eval metric stalls
  • evals_result={}dict populated with per-round metric history
  • verbose_eval=10print every N rounds (False to silence)
06
Booster methodscore
  • bst.predict(dtest)inference on a DMatrix
  • bst.save_model('model.json')JSON/UBJSON — see card 25
  • bst.get_score(importance_type='gain')per-feature importance dict
  • bst.num_boosted_rounds()trees actually built
  • bst.best_iteration / bst.best_scoreset when early stopping triggers
07
predict() optionscore
  • output_margin=Trueraw score, before the link function
  • pred_contribs=TrueSHAP values, one per feature + bias
  • pred_interactions=Truepairwise SHAP interaction values
  • iteration_range=(0, bst.best_iteration+1)use only first N trees
Scikit-learn API
08
XGBClassifiersklearn
  • clf = XGBClassifier(n_estimators=300, max_depth=6,
      learning_rate=0.05, tree_method='hist')
    sklearn-compatible estimator
  • clf.fit(X_tr, y_tr, eval_set=[(X_val,y_val)])early_stopping_rounds passed in constructor
  • clf.predict(X_test)class labels
  • clf.predict_proba(X_test)class probabilities
09
XGBRegressorsklearn
  • reg = XGBRegressor(objective='reg:squarederror')default squared-error regression
  • reg.fit(X, y)standard sklearn fit/predict
  • objective='reg:absoluteerror'MAE — robust to outliers
  • objective='reg:quantileerror'with quantile_alpha=[0.1,0.5,0.9]
  • objective='reg:pseudohubererror'smooth Huber-style loss
10
Ranker & random-forest variantssklearn
  • XGBRanker(objective='rank:ndcg')needs qid or group per query
  • XGBRFClassifier() / XGBRFRegressor()single-round bagged forest, not boosting
  • subsample=0.8, colsample_bynode=0.8RF defaults, bootstrap-style sampling
  • n_estimatorshere = trees in the forest (no shrinkage)
11
Scikit-learn interopsklearn
  • Pipeline([('sc',StandardScaler()),('xgb',clf)])drop-in step; trees don't need scaling though
  • GridSearchCV(clf, param_grid, cv=5)standard hyperparameter search
  • clf.get_booster()underlying native Booster
  • clf.feature_importances_array aligned to gain by default
Tree / booster parameters
12
Tree growth controltree
  • max_depth = 6default; deeper → more complex, more overfit risk
  • min_child_weight = 1min sum Hessian in a child; ↑ = more conservative
  • gamma / min_split_loss = 0min loss reduction required to split a leaf
  • max_leaves = 0cap on leaves, used with grow_policy='lossguide'
  • grow_policy: 'depthwise' | 'lossguide'level-wise (default) vs best-gain-first
13
Sampling (prevents overfitting)tree
  • subsample = 1row-sample ratio per tree; try 0.6–0.9
  • colsample_bytree = 1column-sample once per tree
  • colsample_bylevel / colsample_bynodere-sample columns per depth level / per split
  • sampling_method: 'uniform' | 'gradient_based'gradient-based needs GPU (hist)
14
Learning-rate controltree
  • eta / learning_rate = 0.3shrinkage per round; typical range 0.01–0.3
  • num_boost_round / n_estimatorstree count — pair a lower eta with a higher count
  • max_delta_step = 0helps logistic regression on imbalanced classes
15
tree_method & devicetree
  • tree_method: 'hist'default since 2.0; fast histogram binning, needed for categorical/GPU
  • tree_method: 'exact' | 'approx'exact = greedy, no binning; approx = global sketch
  • device: 'cuda' / 'cuda:0'single/multi-GPU; combine with tree_method='hist'
  • multi_strategy: 'one_output_per_tree' | 'multi_output_tree'vector-leaf multi-target trees
Regularization & objectives
16
Regularizationreg
  • reg_lambda / lambda = 1L2 penalty on leaf weights (Ω term)
  • reg_alpha / alpha = 0L1 penalty; pushes weak leaves toward 0
  • scale_pos_weightimbalanced binary classes ≈ #neg / #pos
  • base_scoreinitial prediction; auto-estimated per objective since 3.1
17
Objective functionsreg
  • binary:logisticbinary classification → probability
  • multi:softmax / multi:softprobmulticlass; needs num_class
  • reg:squarederror / reg:absoluteerrorMSE / MAE regression
  • rank:ndcg / rank:pairwise / rank:maplearning-to-rank objectives
  • survival:cox / survival:aftsurvival analysis
  • count:poisson · reg:tweediecount data / zero-inflated continuous
18
eval_metricreg
  • rmse · mae · logloss · error · mloglossstandard regression / classification metrics
  • auc · aucpr · ndcg@k · mapranking-oriented metrics
  • eval_metric=['auc','logloss']multiple metrics; last one drives early stopping
  • custom_metric=fn(y_pred, dtrain)returns (name, value); use with maximize=
Training control
19
Early stoppingtrain
  • early_stopping_rounds=20stop after N rounds with no improvement
  • bst.best_iteration / clf.best_iterationround to use at inference time
  • requires evals / eval_setat least one validation watch pair
  • maximize=Trueflip direction for metrics like AUC/NDCG
20
Cross-validationtrain
  • xgb.cv(params, dtrain, num_boost_round=500,
      nfold=5, early_stopping_rounds=20, seed=42)
    returns a DataFrame — train/test mean ± std per round
  • stratified=Truepreserve class ratios across folds
  • cross_val_score(clf, X, y, cv=5)via the sklearn wrapper instead
21
Callbackstrain
  • xgb.callback.EarlyStopping(rounds=20, save_best=True)object form of early stopping
  • xgb.callback.LearningRateScheduler(fn)vary eta across rounds
  • xgb.callback.TrainingCheckPoint(directory=...)periodic model snapshots
  • callbacks=[...]pass list to train() or .fit()
Post-training
22
Feature importancepost
  • bst.get_score(importance_type='gain')types: weight, gain, cover, total_gain, total_cover
  • clf.feature_importances_sklearn API array, default gain
  • xgb.plot_importance(bst, max_num_features=15)quick bar-chart view
23
SHAP valuespost
  • bst.predict(dtest, pred_contribs=True)native, exact SHAP per feature + bias column
  • shap.TreeExplainer(bst).shap_values(X)via the shap library — richer plots
  • pred_interactions=Truepairwise SHAP interaction values
24
Plottingpost
  • xgb.plot_tree(bst, num_trees=0)renders one tree — needs graphviz
  • xgb.to_graphviz(bst, num_trees=0)returns a Graphviz Source object
  • xgb.plot_importance(bst)matplotlib importance bar chart
25
Save / load modelspost
  • bst.save_model('model.json')JSON (or .ubj, default binary) — portable across bindings
  • bst.load_model('model.json')re-hydrate a Booster
  • pickle.dump(clf, f)sklearn wrapper — Python-only, less portable
  • legacy .model binary formatremoved — use JSON/UBJSON going forward
26
GPU & distributedpost
  • device='cuda', tree_method='hist'single-GPU training
  • xgboost.dask · xgboost.sparkdistributed multi-node / multi-GPU training
  • DataIter + ExtMemQuantileDMatrixstream TB-scale data via external memory (3.0+)
Quick reference
Most-used defaults, at a glancequick read
  • max_depth=6 · eta=0.3 · min_child_weight=1defaults — usually the first three to tune
  • subsample=1 · colsample_bytree=1defaults — lower for regularization
  • gamma=0 · reg_lambda=1 · reg_alpha=0defaults — raise to fight overfitting
  • n_estimators=100 (sklearn) · num_boost_round=10 (native)watch out — the two APIs default differently
objective → default eval_metricquick read
  • binary:logistic → logloss
  • reg:squarederror → rmse
  • multi:softmax / softprob → mlogloss
  • rank:ndcg → ndcg
  • survival:cox → cox-nloglik

Why the defaults are shaped the way they are

Four mechanics behind XGBoost's speed and regularization story.

1 · split gain

G,H G_L,H_L G_R,H_R Gain = ½[ G_L²/(H_L+λ) + G_R²/(H_R+λ)       − G²/(H+λ) ] − γ split only if Gain > 0 → gamma prunes weak splits

Every candidate split is scored with 1st & 2nd-order gradient stats (Newton boosting) — this is what makes hist binning so fast.

2 · shrinkage vs. λ

small η, many trees large η, few trees boosting round → leaf weight w* = −G/(H+λ) ← λ shrinks w toward 0

Low eta + more rounds generalizes better than high eta + few rounds; reg_lambda shrinks leaf weights the same direction.

3 · early stopping

train eval best_iteration boosting round → eval loss rises past best_iteration → overfitting

early_stopping_rounds halts training N rounds after the watched metric last improved, keeping best_iteration.

4 · importance types disagree

weight gain cover same 3 features, 3 different rankings

weight = split count, gain = avg. loss reduction, cover = avg. samples affected — pick the one matching your question.

Worth memorizing

The handful of facts that explain most XGBoost debugging sessions.

Two APIs, two defaults
Native xgb.train defaults num_boost_round=10; sklearn's XGBRegressor/Classifier defaults n_estimators=100. Don't assume they match.
hist is the default tree_method
Since 2.0, hist is used unless you set otherwise — it's required for categorical features and GPU training.
device replaced gpu_id
gpu_id, gpu_hist, and use_gpu were removed; use device='cuda' / 'cuda:0' instead.
Last eval_metric drives early stopping
When passing a list, only the final metric is used to decide when to stop — order matters.
Trees don't need feature scaling
Splits are threshold-based, so standardizing/normalizing numeric features is unnecessary (unlike linear models or gblinear).
NaN is handled natively
Missing values learn a default split direction per node during training — no imputation required for tree boosters.
Categorical re-coding since 3.1
The Booster now stores the training category mapping and auto re-codes new data at inference — no more silent mismatches.
Lower eta ⇒ more rounds
Shrinking eta almost always needs a proportionally larger num_boost_round to reach the same fit.
Save as JSON/UBJSON
The legacy binary .model format is gone; UBJSON is the current default and preserves feature names across bindings.