Quick Reference v2 · machine learning in Python · single source of truth

scikit-learn cheat sheet v2

Every algorithm in scikit-learn shares one shape: an estimator with .fit() and .predict() (or .transform()), assembled from the same moving parts — preprocessing, a model, cross-validation, a metric — chained into a Pipeline. v2 covers the full modern surface: HistGradientBoosting, set_output DataFrames, TargetEncoder, threshold tuning, calibration, outlier detection, the Display plotting API, inspection, out-of-core learning, GPU/Array-API, callbacks, and safe persistence.

load & split preprocess estimators / models fit & inspect select & tune evaluate gotcha most common 1.5+ version-gated ext outside sklearn

Targets scikit-learn 1.9 (June 2026); features new since 1.3 carry version badges. Distilled & cross-checked across: scikit-learn.org official docs, user guide, API reference, release highlights 1.4–1.9 & the "Choosing the right estimator" map · the official DataCamp PDF cheat sheet · ODSC's estimator-selection guide · skops & imbalanced-learn docs.

Choosing the right estimator — simplified from the official scikit-learn map
START >50 samples? NO → get more data what's the goal? Classification (labeled) LogisticRegression / LinearSVC try next KNeighbors / SVC (rbf) try next HistGradientBoosting / Forest Regression (quantity) Ridge / Lasso / ElasticNet try next SVR (linear → rbf) try next HistGradientBoosting / Forest Clustering (no labels) KMeans (k known) try next HDBSCAN / DBSCAN try next GaussianMixture (GMM) Dimensionality Reduction PCA (TruncatedSVD if sparse) try next t-SNE (visualization) try next Isomap / LLE / UMAP (ext) orange "try next" arrows: if an estimator underperforms, follow the arrow — full map: scikit-learn.org/stable/machine_learning_map.html
01Setup, Import & Configload & split
02Loading Dataload & split
03Splitting & CV Splittersload & split
04The Estimator APIfit & inspect
05Preprocessing: Scalingpreprocess
06Non-linear Transforms & Discretizationpreprocess
07Encoding Categorical Datapreprocess
08Missing Valuespreprocess
09Feature Selectionselect & tune
10Text Feature Extractionpreprocess
11Pipelines & ColumnTransformerestimators / models
12DataFrame In / DataFrame Outpreprocess
13Linear Classifiersestimators / models
14Linear, Robust & GLM Regressionestimators / models
15Trees & HistGradientBoostingestimators / models
16Ensemble Meta-estimatorsestimators / models
17SVM, KNN, Naive Bayes & MLPestimators / models
18Clusteringestimators / models
19Dimensionality Reductionestimators / models
20Manifold Learning (Visualization)estimators / models
21Outlier & Novelty Detectionestimators / models
22Probability Calibrationselect & tune
23Decision-Threshold Tuningselect & tune
24Cross-Validation Functionsselect & tune
25Hyperparameter Searchselect & tune
26Classification Metricsevaluate
27Regression Metricsevaluate
28Clustering Metricsevaluate
29Scorers & the scoring Parameterevaluate
30Plotting: the Display APIevaluate
31Model Inspection & Explainabilityfit & inspect
32Imbalanced Classification Playbookselect & tune
33Multiclass, Multilabel & Multioutputestimators / models
34Semi-supervised Learningestimators / models
35Gaussian Processes & Kernel Approximationestimators / models
36Big Data & Out-of-Core Learningestimators / models
37Performance, GPU & Callbackshandle with care
38Model Persistence & Deploymentload & split
39Baselines & Sanity Checksevaluate
40Metadata Routingfit & inspect
41Version Highlights 1.3 → 1.9fit & inspect
42Top Gotchashandle with care
Common Fitted Attributesafter .fit()
Which Metric, When?quick-read
Which Scaler / Encoder, When?quick-read

The ML workflow, visually

Six visuals behind almost every scikit-learn project: the split, rotating CV folds, Pipeline chaining, per-column routing with ColumnTransformer, the complexity sweet spot, and the movable decision threshold.

train_test_split() ★

The dataset is split once — the model never sees the test rows until final evaluation.

X_train, y_train (80%) test (20%) test rows are held out until final .score() / metrics

K-Fold Cross-Validation ★

Each row is one fold's split; the validation block (amber) rotates so every sample is validated on exactly once.

fold 1 fold 2 fold 3 fold 4 fold 5 blue = train · amber = validation · cross_val_score averages all 5 scores

Pipeline chaining ★

One .fit() call runs every step in order; the same chain applies identically at predict time.

raw data Imputer Scaler Model pred Pipeline([('impute', ...), ('scale', ...), ('clf', ...)]) refit correctly inside every cross-validation fold

ColumnTransformer routing

Numeric and categorical columns take different preprocessing paths, then rejoin into one matrix for the model.

DataFramenum + cat impute+scale OneHotEncoder hstackfeatures Model num cols cat cols ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])

Under- vs. overfitting ★

Training error always falls with complexity; validation error is U-shaped — the gap between them is what to watch.

error model complexity → train error val error underfit sweet spot overfit

Tuning the decision threshold 1.5+

Sliding the cut-off along the score axis trades precision against recall — TunedThresholdClassifierCV picks the spot that maximizes your metric.

predicted score / probability → class 0 class 1 threshold ← recall ↑ precision ↑ →

Worth memorizing

fit / transform / predictfit learns, transform reshapes data, predict outputs labels
fit on train onlytransform test/validation data with the already-fit object
Pipeline prevents leakagepreprocessing is refit correctly inside each CV fold
random_state everywherereproducible splits, shuffling, and stochastic models
scale for distance/gradient modelsKNN, SVM, linear models & PCA need it; trees don't
HistGradientBoosting firstthe modern tabular default — NaN + categoricals handled natively
Grid vs Randomized searchexhaustive vs. sampled — randomized scales better on big grids
accuracy misleads on imbalancePR-AUC / F1 + confusion matrix; tune the threshold (1.5+)
cross_val_score defaultsstratified k-fold automatically for classifiers
step__param syntaxtune anything nested inside a Pipeline or ColumnTransformer
set_output(transform='pandas')named columns out of every transformer
beat DummyClassifier firsta model that can't beat majority-vote learned nothing
joblib for yourself, skops to sharepickles execute code on load — never load untrusted files
error metrics are negated scorers'neg_root_mean_squared_error' — greater is always better