XGBoost (Python) Cheat Sheet

01Install & Importget set up

pip install xgboost★
Full package, incl. GPU (CUDA) support.
pip install xgboost-cpu
Smaller wheel — no GPU / federated learning.
import xgboost as xgb★
Conventional alias.
xgb.__version__
Check installed version (current: 3.2).
conda install -c conda-forge py-xgboost
Conda auto-detects the GPU variant.

02Global Configurationlibrary-wide switches

with xgb.config_context(verbosity=0): ...
Scoped global config (silence warnings, etc.).
xgb.set_config(verbosity=2) / xgb.get_config()
Set / inspect global configuration.
verbosity: 0–3
Per-booster: silent, warning, info, debug.
validate_parameters=True
Warn on unknown params (default in Python).
seed / random_state · seed_per_iteration
Reproducibility; re-seed sampling each round.
nthread / n_jobs
Thread count; default = all available cores.

03DMatrixthe native container

xgb.DMatrix(X, label=y)★
Core input; wraps NumPy / pandas / SciPy sparse.
DMatrix(X, weight=w, base_margin=m)
Per-sample weights, custom initial score.
DMatrix(X, feature_names=[...], feature_types=[...])
Explicit column labels/types (auto from pandas).
DMatrix(X, feature_weights=fw)
Per-feature selection probability for colsample.
dtrain.num_row() · dtrain.num_col() · dtrain.slice(idx)
Inspect dimensions; row-subset a DMatrix.
dtrain.save_binary('train.dmatrix')
Cache the preprocessed matrix to disk.
DMatrix('train.dmatrix')
Reload the cached binary — skips re-parsing.

04QuantileDMatrix & External Memorybig-data containers

xgb.QuantileDMatrix(X, y)★
Pre-binned for hist; faster + far lower memory.
QuantileDMatrix(X_val, y_val, ref=dtrain)gotcha
Validation sets MUST pass ref= — 3.0+ errors without it.
ExtMemQuantileDMatrix(data_iter)3.0+
Out-of-core streaming; TB-scale on one machine.
class MyIter(xgb.DataIter): def next(self, input_data)...
Custom batch iterator feeding external memory.
max_quantile_batches · min_cache_page_bytes
Sketching / cache-page tuning knobs.
cache_host_ratio3.1+
Split GPU external-memory cache between host/device RAM.

05Accepted Input Formatswhat X can be

np.ndarray · pd.DataFrame★
The everyday inputs; pandas dtypes are honored.
scipy.sparse.csr_matrix / csc_matrix
Sparse input — zeros are NOT treated as missing in dense, but sparse missing = absent entries.
pyarrow.Table
Zero-copy Arrow ingestion.
polars.DataFrame / LazyFrame3.0+
Native Polars support (categoricals from 3.1).
cudf.DataFrame · cupy.ndarray
GPU-resident inputs — no host round-trip.
text files (libsvm/csv URIs)warns
Text input is discouraged since 3.1 — load via a DataFrame library instead.

06Categorical Datano manual encoding

DMatrix(X, enable_categorical=True)★
Auto-detects pandas/Polars category dtype.
XGBClassifier(enable_categorical=True)
Same switch on the sklearn wrapper.
tree_method must be 'hist'/'approx'gotcha
Native categorical splits need histogram methods.
max_cat_to_onehot
Below this cardinality, use one-hot style splits.
max_cat_threshold
Cap categories considered per partition split.
Auto-recoding3.1+
Booster stores training categories (incl. strings), re-codes new/unseen values at inference automatically.
bst.get_categories() / dtrain.get_categories()3.1+
Export the stored category index (Arrow-friendly).

07Missing Values & Sparsitysparsity-aware splits

missing=np.nan★
Default; NaN handled natively — no imputation needed.
DMatrix(X, missing=-999)
Treat a sentinel value as missing instead.
learned default direction
Each split learns which branch missing values take — the paper's "sparsity-aware" algorithm.
imputing to 0 or mean firstanti-pattern
Usually hurts — it hides the missingness signal XGBoost would otherwise exploit.

08Parameter Dict & Booster Typewhat you configure

params = {'max_depth':6, 'eta':0.3, 'objective':'binary:logistic'}★
Plain dict, or a list of (key, value) pairs.
booster: 'gbtree'
Default — additive regression trees.
booster: 'dart'
gbtree + dropout of trees each round (card 14).
booster: 'gblinear'
Linear base learner (deprecated since 3.3, card 15).
device: 'cpu' | 'cuda' | 'cuda:0'
Replaces the removed gpu_id/gpu_hist.
disable_default_eval_metric=True
Suppress the objective's built-in metric (custom-metric workflows).

09Tree Growth Controlshape of each tree

max_depth = 6★
Default; deeper → more complex, more overfit risk.
min_child_weight = 1★
Min sum-Hessian in a child; ↑ = more conservative.
gamma / min_split_loss = 0
Min loss reduction required to split a leaf.
max_leaves = 0
Cap on leaves; used with grow_policy='lossguide'.
grow_policy: 'depthwise' | 'lossguide'
Level-wise (default) vs. best-gain-first (LightGBM-style).
max_bin = 256
Histogram bins per feature (hist); ↑ = finer splits, slower.

10Sampling — Bagging Rows & Colsfights overfitting

subsample = 1★
Row-sample ratio per tree; try 0.6–0.9.
colsample_bytree = 1★
Column-sample once per tree.
colsample_bylevel / colsample_bynode
Re-sample columns per depth level / per split.
colsample_* multiplygotcha
All three compound: 0.5 × 0.5 × 0.5 leaves ~12.5% of columns per split.
sampling_method: 'uniform' | 'gradient_based'
gradient_based requires GPU (hist); allows tiny subsample.

11Learning Rate, Rounds & Forestsstep size × count

eta / learning_rate = 0.3★
Shrinkage per round; typical range 0.01–0.3.
num_boost_round / n_estimators
Tree count — pair a lower eta with a higher count.
num_parallel_tree = 1
Trees per round → boosted random forest when >1.
max_delta_step = 0
Helps logistic regression on imbalanced classes.

12tree_method, Device & Updatershow splits are found

tree_method: 'hist'★
Default since 2.0; fast histogram binning.
tree_method: 'exact' | 'approx'
Exact = greedy, no binning; approx = global sketch.
device: 'cuda' / 'cuda:0'
Single/multi-GPU; pair with tree_method='hist'.
updater: 'refresh', process_type: 'update'
Refresh leaf stats/values on new data — no new trees.
max_cached_hist_node
Histogram-cache cap for deep trees.
gpu_id, gpu_hist, use_gpuremoved
Deprecated params removed in 3.1 — use device=.

13Regularizationpenalize complexity

reg_lambda / lambda = 1★
L2 penalty on leaf weights (the Ω term).
reg_alpha / alpha = 0★
L1 penalty; pushes weak leaves toward zero.
scale_pos_weight
Imbalanced binary classes ≈ #neg / #pos.
base_score
Initial prediction; auto-estimated per objective since 3.1.

14Monotone & Interaction Constraintsdomain knowledge in

monotone_constraints=(1,-1,0)
Force ↑ / ↓ / free relation per feature — key for credit/risk models.
monotone_constraints={'age':1}
Dict-by-name form on the sklearn wrapper.
interaction_constraints=[[0,1],[2,3,4]]
Only features in the same group may co-occur on a path.
hist + monotone → shallow treescaveat
Constraints can wipe out all bin candidates; raise max_bin to compensate.

15DART Boosterdropout for trees

rate_drop = 0.0
Fraction of existing trees dropped per round.
skip_drop = 0.0
Probability of skipping dropout entirely that round.
one_drop = 0
Guarantee ≥1 tree dropped (binomial-plus-one).
sample_type: 'uniform'|'weighted' · normalize_type: 'tree'|'forest'
How victims are picked; how new trees are re-weighted.
predict() on non-training datagotcha
DART inference uses all trees by default; set training=True only to reproduce dropout behavior.

16gblinear Boosterlinear base learner

updater: 'coord_descent' | 'shotgun'
Deterministic vs. parallel (non-deterministic) solver.
feature_selector: 'cyclic'|'shuffle'|'greedy'|'thrifty'
Coordinate selection strategy.
top_k
Top features per round for greedy/thrifty selectors.
clf.coef_ · clf.intercept_
Linear coefficients — only defined for gblinear.
booster='gblinear'deprecated
Deprecated in 3.3 — prefer sklearn's linear models going forward.

17Multi-Target & Vector Leafmany outputs, one model

multi_strategy='one_output_per_tree'
Default — independent tree set per target.
multi_strategy='multi_output_tree'3.2
Vector-leaf trees capture cross-target correlation; major expansion in 3.2 (still evolving).
XGBRegressor().fit(X, Y_2d)
Multi-output regression: just pass a 2-D target.
multi-label classification
2-D 0/1 label matrix works the same way.

18Objectives — Classification & Regressionthe everyday set

binary:logistic★
Binary classification → probability.
binary:logitraw · binary:hinge
Raw margin output; hinge (0/1, no probabilities).
multi:softmax / multi:softprob★
Multiclass; needs num_class; softprob → per-class probs.
reg:squarederror / reg:absoluteerror★
MSE / MAE regression.
reg:logistic
Regression on [0,1] targets via logistic loss.
reg:squaredlogerror
RMSLE-style loss; targets must be > −1.

19Objectives — Specializedrank · survival · counts

rank:ndcg / rank:pairwise / rank:map
LambdaMART learning-to-rank; ndcg supports position debiasing.
lambdarank_pair_method: 'topk'|'mean' · lambdarank_num_pair_per_sample
Pair construction for ranking objectives (2.0+).
survival:cox / survival:aft
Survival analysis; AFT takes label bounds in the DMatrix.
aft_loss_distribution: 'normal'|'logistic'|'extreme' + _scale
AFT error distribution and its scale.
count:poisson · reg:tweedie · reg:gamma
Counts / zero-inflated / gamma-distributed (insurance severity); tune tweedie_variance_power.
reg:quantileerror + quantile_alpha=[0.1,0.5,0.9]
Pinball loss → prediction intervals in one model.
reg:pseudohubererror + huber_slope
Smooth Huber-style robust loss.

20eval_metrichow to score rounds

rmse · mae · rmsle · mape · logloss · error · mlogloss
Standard regression / classification metrics.
auc · aucpr · pre@k · ndcg@k · map
Ranking-oriented; aucpr better for heavy imbalance.
poisson-nloglik · gamma-nloglik · cox-nloglik · aft-nloglik
Likelihood metrics for the specialized objectives.
eval_metric=['auc','logloss']gotcha
Multiple metrics OK — but the last one drives early stopping.
ndcg- / map- (trailing dash)
Score empty lists as 0 instead of 1 in ranking eval.

21Custom Objective & Metricbring your own loss

def obj(preds, dtrain): return grad, hess
Native API: supply 1st & 2nd derivatives of your loss.
xgb.train(params, dtrain, obj=obj, custom_metric=fn)
custom_metric returns (name, value).
XGBRegressor(objective=my_obj, eval_metric=mean_absolute_error)
Sklearn wrapper accepts callables — even sklearn metrics directly.
disable_default_eval_metric=True
Silence the objective's built-in metric when using your own.
custom obj outputs raw margingotcha
With a custom objective, predict() returns untransformed scores — apply the link (e.g. sigmoid) yourself.

22xgb.train() — Native Loopthe core API

bst = xgb.train(params, dtrain, num_boost_round=100)★
Returns a fitted Booster.
evals=[(dtrain,'train'), (dval,'eval')]★
Watchlist, printed / logged each round.
early_stopping_rounds=20★
Stop if last eval metric stalls.
evals_result={}
Dict populated with per-round metric history.
verbose_eval=10
Print every N rounds (False to silence).

23Early Stoppingstop at the right round

early_stopping_rounds=20★
Stop after N rounds with no improvement.
bst.best_iteration / clf.best_iteration
Round to use at inference time.
requires evals / eval_setgotcha
Silently does nothing without a validation watch pair.
train() returns the LAST iterationgotcha
Not the best — use iteration_range at predict, or EarlyStopping(save_best=True).
maximize=True
Flip direction for metrics like AUC/NDCG.
sklearn: predict()/score()/apply() auto-use best_iteration
The wrapper handles the truncation for you.

24Cross-Validationrobust round count

xgb.cv(params, dtrain, num_boost_round=500, nfold=5, early_stopping_rounds=20)★
Returns a DataFrame — train/test mean ± std per round.
stratified=True★
Preserve class ratios across folds.
folds=KFold(...) or explicit index tuples
Custom splitters — time-series folds, group folds.
metrics=['auc'], shuffle=True, seed=42
Override watched metrics; reproducible shuffling.
cross_val_score(clf, X, y, cv=5)
Via the sklearn wrapper instead.

25Callbackshook into training

xgb.callback.EarlyStopping(rounds=20, save_best=True)
Object form — save_best keeps the best model, not the last.
EarlyStopping(metric_name='logloss', data_name='validation_0')
Pin which metric + eval set decides stopping.
xgb.callback.LearningRateScheduler(fn)
Vary eta across rounds.
xgb.callback.TrainingCheckPoint(directory=...)
Periodic model snapshots.
class Cb(xgb.callback.TrainingCallback): def after_iteration(...)
Custom hooks; return True to stop training.
callbacks are statefulgotcha
Re-initialize callback objects for every training run — they can't be reused.

26Training Continuationwarm starts

xgb.train(params, dnew, xgb_model=bst)
Add more rounds on top of an existing model (path or Booster).
clf.fit(X, y, xgb_model=prev.get_booster())
Same warm start through the sklearn wrapper.
process_type='update', updater='refresh', refresh_leaf=1
Keep tree structure; refresh leaf values on new data.
continuation ≠ online learningcaveat
Old trees are frozen; new trees fit new-data residuals only.

27XGBClassifiersklearn API

clf = XGBClassifier(n_estimators=300, max_depth=6, learning_rate=0.05, tree_method='hist')★
Sklearn-compatible estimator.
clf.fit(X_tr, y_tr, eval_set=[(X_val,y_val)])★
early_stopping_rounds/eval_metric/callbacks go in the constructor since 2.0, not fit().
clf.predict(X_test) · clf.predict_proba(X_test)
Class labels / class probabilities.
clf.evals_result()
Per-round metric history for every eval set.
clf.apply(X)
Leaf index per tree — tree-embedding features.

28XGBRegressorsklearn API

reg = XGBRegressor(objective='reg:squarederror')★
Default squared-error regression.
reg.fit(X, y)
Standard sklearn fit/predict.
objective='reg:absoluteerror'
MAE — robust to outliers.
objective='reg:quantileerror'
With quantile_alpha=[0.1,0.5,0.9].
objective='reg:pseudohubererror'
Smooth Huber-style loss.

29Ranker & Random-Forest Variantsspecialized estimators

XGBRanker(objective='rank:ndcg')
Needs qid or group per query.
rnk.fit(X, y, qid=qid)
Data must be sorted by query group.
XGBRFClassifier() / XGBRFRegressor()
Single-round bagged forest, not boosting.
subsample=0.8, colsample_bynode=0.8
RF defaults — bootstrap-style sampling.
n_estimators
Here = trees in the forest (no shrinkage).

30Scikit-learn Interopplugs into the ecosystem

Pipeline([('sc',StandardScaler()),('xgb',clf)])
Drop-in step (trees don't need scaling though).
GridSearchCV(clf, param_grid, cv=5)
Standard hyperparameter search; RandomizedSearchCV too.
clf.get_booster()
Underlying native Booster.
clf.feature_importances_
Array aligned to gain by default.
n_jobs thrashingcaveat
XGBoost uses all threads by default — set sklearn's search n_jobs=1 or XGBoost's, not both.

31predict() Optionsshape the output

output_margin=True
Raw score, before the link function.
pred_contribs=True★
SHAP values, one per feature + bias.
pred_interactions=True
Pairwise SHAP interaction values.
pred_leaf=True
Leaf index per tree (native-API twin of apply()).
iteration_range=(0, bst.best_iteration+1)
Use only the first N trees.
strict_shape=True
Always return (n_samples, n_groups) — stable shapes for pipelines.

32inplace_predict()serving-path inference

bst.inplace_predict(X)★
Predict straight from NumPy/pandas — no DMatrix build.
thread-safe & lock-free
Safe to call concurrently — the recommended serving path.
bst.inplace_predict(cupy_array)
GPU input → GPU output (CuPy) when device='cuda'.
bst.predict() is NOT thread-safegotcha
Per-Booster lock; use inplace_predict or bst.copy() per thread.

33Model Inspectionopen the box

bst.trees_to_dataframe()
All trees as a tidy DataFrame — splits, gains, covers.
bst.dump_model('dump.txt', with_stats=True)
Human-readable text/JSON/dot dump (+ feature map).
bst.get_dump(dump_format='json')
Same dump as in-memory list of strings.
bst.get_split_value_histogram('age')
Where a feature's split thresholds land.
bst.save_config() / bst.load_config()
Full internal parameter config as JSON.
bst.num_boosted_rounds() · bst.num_features()
Trees actually built; feature count.

34Booster Slicing & Attributesthe model as an object

sub = bst[10:20]
Slice a tree range into a new Booster (1.3+); best_* attrs dropped.
bst.attr('k') · bst.set_attr(k='v') · bst.attributes()
String metadata saved with JSON/UBJSON models.
bst.copy()
Deep copy — e.g. one Booster per serving thread.
bst.eval(dtest) · bst.eval_set(evals)
One-off metric evaluation on a dataset.
bst.reset()3.0+
Release training data caches — shrink a kept Booster.

35Feature Importancewhat mattered

bst.get_score(importance_type='gain')★
Types: weight, gain, cover, total_gain, total_cover.
clf.feature_importances_
Sklearn API array, default gain; set via importance_type=.
xgb.plot_importance(bst, max_num_features=15)
Quick bar-chart view.
importance is biasedcaveat
Split-based importance inflates high-cardinality features — prefer SHAP or permutation importance for decisions.

36SHAP Valuesexplain a prediction

bst.predict(dtest, pred_contribs=True)★
Native, exact TreeSHAP per feature + bias column.
shap.TreeExplainer(bst).shap_values(X)
Via the shap library — richer plots.
pred_interactions=True
Pairwise SHAP interaction values.
approx_contribs=True
Faster approximate contributions (Saabas method).
SHAP rows sum to the margin
Contributions + bias = raw score — sanity-check your pipeline.

37Plottingsee the trees

xgb.plot_tree(bst, num_trees=0)
Renders one tree — needs graphviz.
xgb.to_graphviz(bst, num_trees=0)
Returns a Graphviz Source object.
xgb.plot_importance(bst)
Matplotlib importance bar chart.
pd.DataFrame(evals_result['eval']).plot()
Learning curves from the recorded metric history.

38Save / Load Modelspersistence

bst.save_model('model.ubj')★
UBJSON (default) or .json — portable across bindings.
bst.load_model('model.ubj')
Re-hydrate a Booster.
raw = bst.save_raw() · Booster(model_file=raw)
In-memory buffer round-trip — for DBs / blob stores.
clf.save_model('clf.json')
Sklearn wrapper persists estimator config too.
pickle.dump(clf, f)caveat
Python-only and NOT guaranteed across XGBoost versions — prefer save_model.
legacy .model binary formatremoved
Use JSON/UBJSON going forward.

39GPU Trainingdevice='cuda'

device='cuda', tree_method='hist'★
Single-GPU training; works on both APIs.
QuantileDMatrix(cudf_df, y)
GPU-resident data end-to-end — no host copies.
use_rmm=True / CUDA async pool3.2
Pooled GPU memory allocation (RMM plugin or built-in async pool).
sampling_method='gradient_based', subsample≈0.1
GPU-only gradient-based sampling — tiny subsamples stay accurate.
ExtMemQuantileDMatrix + device='cuda'3.0+
Host RAM as GPU cache — TB-scale on one Grace-Hopper-class node.

40Dask — xgboost.daskmulti-node Python

from xgboost import dask as dxgb
The distributed module (Dask arrays / DataFrames in).
dtrain = dxgb.DaskDMatrix(client, X, y)
Distributed DMatrix; DaskQuantileDMatrix for hist/GPU.
out = dxgb.train(client, params, dtrain); out['booster']
Functional API returns booster + metric history dict.
dxgb.DaskXGBClassifier() / DaskXGBRegressor()
Sklearn-style wrappers with a .client attribute.
dxgb.inplace_predict(client, booster, X)
Distributed no-DMatrix inference.

41Spark — xgboost.sparkPySpark pipelines

from xgboost.spark import SparkXGBClassifier
Also SparkXGBRegressor, SparkXGBRanker.
SparkXGBClassifier(features_col='features', label_col='y')
Drop-in PySpark ML Estimator — fits in a Pipeline.
num_workers=4, device='cuda'
Distributed CPU/GPU training across executors.
external memory on GPU3.0+
Spark package can spill to host memory when device is GPU.

42Ecosystem & Beyondwho plays well with it

optuna / hyperopt / FLAML
Bayesian & adaptive tuning — pair with pruning callbacks.
shap · sklearn.inspection
Explainability beyond built-ins (dependence, beeswarm).
ONNX (onnxmltools) · Treelite
Compile models for portable, low-latency serving.
federated learning plugin
Cross-silo training without sharing raw data (full wheel only).
R · JVM/Scala · Ruby · Julia · Swift · C/C++
Same JSON/UBJSON model files work across all bindings.

43Tuning Playbookofficial guidance, condensed

1. fix eta≈0.1, tune tree shape
max_depth (3–10), min_child_weight (1–10) via CV.
2. tune sampling
subsample, colsample_bytree ∈ [0.6, 1.0].
3. tune regularization
gamma, reg_lambda, reg_alpha — log-scale search.
4. drop eta, raise rounds, early-stop
e.g. eta=0.01–0.05 with early_stopping_rounds finding N.
overfit? ↓depth ↓eta ↑gamma ↑min_child_weight ↓subsample
The five levers, in the order they usually help.
imbalance? scale_pos_weight + eval_metric='aucpr'
Or max_delta_step≈1 for calibrated probabilities.

44Performance Best Practicesspeed & memory

QuantileDMatrix for hist training
Skips the raw-data copy — big memory win.
inplace_predict for serving
No DMatrix per request; thread-safe.
save_binary() heavy DMatrices
Parse once, reload instantly across experiments.
lower max_bin before lowering data
128 bins often matches 256 accuracy at less memory.
external memory before downsampling
ExtMemQuantileDMatrix keeps all rows when RAM is short.
deep trees eat memorycaveat
Memory grows fast with max_depth — the docs warn beyond ~12.

★Most-Used Defaultsat a glance

max_depth=6 · eta=0.3 · min_child_weight=1
Usually the first three to tune.
subsample=1 · colsample_bytree=1 · max_bin=256
Lower the first two for regularization.
gamma=0 · reg_lambda=1 · reg_alpha=0
Raise to fight overfitting.
n_estimators=100 (sklearn) vs num_boost_round=10 (native)
Watch out — the two APIs default differently.

★objective → default eval_metricquick lookup

binary:logistic → logloss
reg:squarederror → rmse
multi:softmax / softprob → mlogloss
rank:ndcg → ndcg · rank:map → map
count:poisson → poisson-nloglik
survival:cox → cox-nloglik · survival:aft → aft-nloglik

★Version Milestoneswhat changed when

2.0 (2023)
device= param, hist default, multi-target trees, constructor-based sklearn early stopping.
3.0 (Feb 2025)
ExtMemQuantileDMatrix, distributed external memory, Polars input, bst.reset().
3.1 (Sep 2025)
Categorical re-coder (+strings), categorical support no longer experimental, gpu_id/gpu_hist removed, vector intercept.
3.2 (Feb 2026)
Vector-leaf multi-target expansion, CUDA async memory pool, adaptive external-memory cache, CLI removed.

xgboost cheat sheet

Data in — containers, formats, categories, missing values

Configure — booster types, tree shape, sampling, constraints

Objectives, metrics & custom losses

Train — native loop, sklearn wrappers, stopping, CV, continuation

Predict, explain, inspect & persist

Scale out & best practices

The math under the hood — from loss to split, in six steps

1 · the model is a sum of trees

2 · the regularized objective

3 · second-order Taylor expansion

4 · optimal leaf weight, closed form

5 · the split gain

6 · margin → prediction (the link)

Four levers, one goal: generalize

eta (shrinkage)

subsample / colsample

max_depth / min_child_weight

gamma / lambda / alpha

Mental models & caveats, visually

histogram binning (why hist is fast)

sparsity-aware default direction

early stopping keeps the LAST model

importance types disagree

Worth memorizing