Scikit-learn Cheat Sheet v2

01Setup, Import & Configload & split

pip install scikit-learn
Install (imports as sklearn); conda: conda install -c conda-forge scikit-learn.
import sklearn; sklearn.show_versions()
Version + full dependency report for bug reports.
from sklearn.model_selection import train_test_split★
The single most-imported function in sklearn.
sklearn.set_config(display='diagram')
Rich HTML repr of pipelines in notebooks (default). 1.9 adds a fitted-attributes table.
sklearn.config_context(assume_finite=True)
Skip NaN/inf validation inside the block — faster, riskier.
random_state=42★
Set anywhere randomness appears (splits, shuffles, stochastic models).

02Loading Dataload & split

load_iris(return_X_y=True)★
Built-in toy datasets (sklearn.datasets); as_frame=True for DataFrames.
fetch_california_housing() fetch_20newsgroups()
Larger real-world datasets, downloaded on first use.
fetch_openml('titanic', version=1, as_frame=True)★
Any of thousands of OpenML datasets by name or data_id.
make_classification(n_samples=200, n_features=4)★
Synthetic classification data; weights=[.9,.1] for imbalance.
make_regression() make_blobs() make_moons()
Synthetic regression / clustering / non-linear demo data.

03Splitting & CV Splittersload & split

train_test_split(X, y, test_size=0.2, random_state=42)★
One-shot hold-out split.
train_test_split(..., stratify=y)★
Preserve class proportions — always for classification.
KFold(n_splits=5, shuffle=True) StratifiedKFold(...)★
Plain / class-balanced folds. Classifiers get stratified folds by default in CV helpers.
GroupKFold(...) StratifiedGroupKFold(...)
Keep all rows of a group (patient, user, site) in one fold — prevents group leakage.
TimeSeriesSplit(n_splits=5, gap=0)★
Expanding-window splits; train always precedes test.
ShuffleSplit() RepeatedStratifiedKFold()
Random resampling / repeated k-fold for tighter score estimates.
shuffled CV on time seriesleakage
Future rows leak into training — use TimeSeriesSplit.

04The Estimator APIfit & inspect

model = Estimator(**params)★
Every algorithm is a class configured by hyperparameters.
model.fit(X_train, y_train)★
Learn from training data — the universal verb. Returns self, so calls chain.
model.predict(X_test)★
Labels / values for new data.
model.predict_proba(X) model.decision_function(X)
Class probabilities / raw margin scores (classifiers).
model.score(X_test, y_test)★
Built-in default metric: accuracy (classifiers), R² (regressors).
transformer.fit_transform(X_train)
Fit + transform in one call — training data only.
model.get_params() model.set_params(C=1.0)
Read / change hyperparameters; works on nested step__param too.
fitted attributes end with an underscore: coef_
Anything learned from data is suffixed _; params are not.

05Preprocessing: Scalingpreprocess

StandardScaler()★
Zero mean, unit variance — the default choice.
MinMaxScaler(feature_range=(0,1))
Squash into a fixed range; sensitive to outliers.
RobustScaler()★
Median/IQR based — resistant to outliers.
MaxAbsScaler()
Scale by max |value|; preserves sparsity (no centering).
Normalizer()
Scale each row to unit norm (not each column) — e.g. text vectors.
scaler.fit(X_train) → .transform(X_test)fit once
Fit only on train — never refit on test.
scale for distance/gradient models; trees don't care
KNN, SVM, linear models, PCA, MLP need it; trees/forests/HGB don't.

06Non-linear Transforms & Discretizationpreprocess

PowerTransformer(method='yeo-johnson')★
Make skewed features Gaussian-like; 'box-cox' needs positive data.
QuantileTransformer(output_distribution='normal')
Rank-based map to uniform/normal — crushes outliers.
KBinsDiscretizer(n_bins=5, strategy='quantile')
Bin continuous features; strategies: uniform / quantile / kmeans.
SplineTransformer(degree=3, n_knots=5)
B-spline basis — smooth non-linearity for linear models.
PolynomialFeatures(degree=2, interaction_only=False)
Polynomial & interaction terms; feature count explodes fast.
FunctionTransformer(np.log1p)
Wrap any function as a pipeline-compatible transformer.

07Encoding Categorical Datapreprocess

OneHotEncoder(handle_unknown='ignore')★
One binary column per category; unseen categories become all-zeros. 'warn' option since 1.6.
OneHotEncoder(min_frequency=10, max_categories=20)
Group rare levels into one "infrequent" bucket.
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
Integer codes — for genuinely ordered categories, or as tree-model input.
TargetEncoder()★1.3+
Mean-target encoding with internal cross-fitting to prevent leakage — best for high-cardinality features.
LabelEncoder on feature columnsmisuse
LabelEncoder is for the target y only; use OrdinalEncoder for X.
pd.get_dummies() — quick one-hot outside a Pipeline
Fine for exploration; use OneHotEncoder inside pipelines to handle unseen categories.

08Missing Valuespreprocess

SimpleImputer(strategy='median')★
mean / median / most_frequent / constant fill.
KNNImputer(n_neighbors=5)
Fill using similar rows' values; scale features first.
IterativeImputer()
Models each feature from the others (MICE-style); still experimental — needs enable_iterative_imputer import.
SimpleImputer(add_indicator=True) MissingIndicator()
Add binary "was missing" columns — missingness is often informative.
HistGradientBoosting & trees handle NaN natively
HGB (always) and DecisionTree/RandomForest (recent versions) can skip imputation entirely.

09Feature Selectionselect & tune

VarianceThreshold(threshold=0.0)
Drop near-constant, uninformative features first.
SelectKBest(f_classif, k=10)★
Univariate scoring; also mutual_info_classif, chi2 (non-negative X).
SelectFromModel(Lasso(alpha=.01))★
Keep features an L1 / tree model rates important.
RFE(estimator, n_features_to_select=5) RFECV(...)
Recursively drop the weakest feature; RFECV picks the count by CV.
SequentialFeatureSelector(model, direction='forward')
Greedy forward/backward selection by CV score — slow but model-agnostic.
selecting features on the full datasetleakage
Selection is learning — do it inside the Pipeline/CV.

10Text Feature Extractionpreprocess

CountVectorizer(ngram_range=(1,2), min_df=2)★
Token counts → sparse matrix; fit learns the vocabulary.
TfidfVectorizer(stop_words='english')★
Counts re-weighted by inverse document frequency — the text baseline.
HashingVectorizer(n_features=2**20)
Stateless (no fit needed) — streaming / out-of-core text.
DictVectorizer()
List-of-dicts → feature matrix.
pair TfidfVectorizer with LinearSVC / MultinomialNB / SGD
Linear models shine on high-dimensional sparse text.

11Pipelines & ColumnTransformerestimators / models

Pipeline([('scale', StandardScaler()), ('clf', model)])★
Chain preprocessing + model into one estimator; prevents leakage by construction.
make_pipeline(SimpleImputer(), StandardScaler(), model)
Same thing with auto-named steps.
ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])★
Different preprocessing per column group; remainder='passthrough' keeps the rest.
make_column_selector(dtype_include=object)★
Select columns by dtype or regex instead of hard-coded lists.
FeatureUnion([('pca', PCA()), ('kbest', SelectKBest())])
Concatenate outputs of parallel transformers.
pipe.named_steps['clf'] pipe[-1] pipe[:-1]
Index into steps; slicing returns a sub-pipeline.
param_grid = {'clf__C': [...]}★
Tune nested params with step__param syntax.
Pipeline(..., memory='./cache')
Cache fitted transformers across grid-search candidates.

12DataFrame In / DataFrame Outpreprocess

transformer.set_output(transform='pandas')★
Transformers return DataFrames with named columns instead of bare arrays.
set_output(transform='polars')1.4+
Polars output; 1.9 uses narwhals internally for broader dataframe interop.
sklearn.set_config(transform_output='pandas')
Same, globally for every transformer.
pipe[:-1].get_feature_names_out()★
Post-transform feature names — align coefficients/importances to columns.
model.feature_names_in_
Column names seen at fit when X was a DataFrame; predict-time names must match.

13Linear Classifiersestimators / models

LogisticRegression(C=1.0, max_iter=1000)★
Fast, calibrated-ish, interpretable baseline; smaller C = stronger regularization.
LogisticRegression(penalty='l1', solver='liblinear')
Sparse coefficients — built-in feature selection.
LinearSVC()
Linear SVM, scales to wide sparse data (text).
SGDClassifier(loss='log_loss')
Any linear model via SGD — huge data, supports partial_fit.
RidgeClassifier() Perceptron()
Least-squares / classic online linear classifiers.
always scale features for linear models
Regularization assumes comparable feature scales.

14Linear, Robust & GLM Regressionestimators / models

LinearRegression()★
Ordinary least squares baseline.
Ridge(alpha=1.0) Lasso() ElasticNet()★
L2 / L1 / both; RidgeCV, LassoCV tune alpha internally.
HuberRegressor() RANSACRegressor() TheilSenRegressor()
Robust to outliers: soft down-weighting / inlier consensus / median-of-slopes.
QuantileRegressor(quantile=0.9)
Predict conditional quantiles — pinball loss.
PoissonRegressor() GammaRegressor() TweedieRegressor()
GLMs for counts, positive-skew, and insurance-style targets.
SVR(kernel='rbf') KernelRidge()
Kernel non-linear regression; KernelRidge fits faster, SVR predicts faster.

15Trees & HistGradientBoostingestimators / models

HistGradientBoostingClassifier() / ...Regressor()★
The modern default for tabular data — LightGBM-style, fast on 10k+ rows.
HistGradientBoosting...(categorical_features='from_dtype')
Native categorical splits from pandas category dtype — no one-hot needed.
HistGradientBoosting...(early_stopping=True)
Auto validation-based stopping; handles NaN natively too.
HistGradientBoosting...(monotonic_cst={'price': 1})
Force monotone feature-response relationships.
RandomForestClassifier(n_estimators=300, oob_score=True)★
Bagged trees — strong low-tuning default; OOB is a free validation score.
ExtraTreesClassifier() DecisionTreeClassifier(ccp_alpha=.01)
Extra-random ensembles; single trees need pruning (ccp_alpha).
legacy GradientBoosting* — prefer Hist* except tiny data
Old GBM is exact but far slower; Hist* bins features (≤255 bins).

16Ensemble Meta-estimatorsestimators / models

VotingClassifier(estimators=[...], voting='soft')★
Combine models by (probability-weighted) vote.
StackingClassifier(estimators=[...], final_estimator=LogisticRegression())
Meta-model learns from base models' CV predictions.
BaggingClassifier(estimator, n_estimators=10)
Any model on bootstrap samples — variance reduction.
AdaBoostClassifier()
Sequentially reweights hard examples.
XGBoost / LightGBM / CatBoost use the sklearn APIext
Drop into Pipelines, GridSearchCV, and cross_val_score unchanged.

17SVM, KNN, Naive Bayes & MLPestimators / models

SVC(kernel='rbf', C=1, gamma='scale')★
Max-margin, non-linear boundaries; O(n²) — avoid on >20k rows.
SVC(probability=True)
Enables predict_proba via internal CV — slower fit.
KNeighborsClassifier(n_neighbors=5)
Vote among k nearest points — scale features first.
GaussianNB() MultinomialNB() ComplementNB()
Probabilistic baselines: continuous / counts (text) / imbalanced text.
MLPClassifier(hidden_layer_sizes=(100,), early_stopping=True)
Small feed-forward nets; for deep learning use PyTorch/JAX instead.
LinearDiscriminantAnalysis() QuadraticDiscriminantAnalysis()
Gaussian class-conditional classifiers; LDA doubles as supervised projection.

18Clusteringestimators / models

KMeans(n_clusters=3, n_init='auto')★
Fast, spherical clusters, k chosen in advance; scale features.
MiniBatchKMeans() BisectingKMeans()
KMeans for big data / hierarchical top-down variant.
HDBSCAN(min_cluster_size=5)★1.3+
Density-based, variable-density clusters + noise labels (-1); no eps to tune.
DBSCAN(eps=0.5, min_samples=5) OPTICS()
Classic density clustering; OPTICS sweeps eps ranges.
AgglomerativeClustering(linkage='ward')
Hierarchical bottom-up merging; dendrogram-friendly.
SpectralClustering() Birch() MeanShift()
Graph-based / streaming / mode-seeking alternatives.
GaussianMixture(n_components=3).bic(X)
Soft (probabilistic) clustering; pick components by lowest BIC/AIC.

19Dimensionality Reductionestimators / models

PCA(n_components=0.95)★
Float = keep 95% of variance; int = exact count. Check explained_variance_ratio_.
TruncatedSVD(n_components=100)
PCA-like for sparse matrices (LSA on tf-idf).
IncrementalPCA(batch_size=1000)
PCA in mini-batches — larger-than-memory data.
KernelPCA(kernel='rbf')
Non-linear PCA via the kernel trick.
NMF(n_components=10)
Non-negative parts-based factorization — topics, spectra.
LinearDiscriminantAnalysis(n_components=2)
Supervised projection that maximizes class separation.
fitting PCA before the train/test splitleakage
PCA learns from data — fit on train only, inside the Pipeline.

20Manifold Learning (Visualization)estimators / models

TSNE(n_components=2, perplexity=30)★
2D/3D visualization only — no transform for new data; distances between clusters aren't meaningful.
Isomap() LocallyLinearEmbedding() MDS()
Geodesic / local-linear / distance-preserving embeddings.
SpectralEmbedding()
Graph-Laplacian embedding.
umap-learn: UMAP(n_neighbors=15)ext
Faster than t-SNE, supports transform on new data — sklearn-compatible API.
PCA to ~50 dims first, then t-SNE/UMAP
Standard recipe: denoise + speed up the manifold step.

21Outlier & Novelty Detectionestimators / models

IsolationForest(contamination=0.05)★
Random-split isolation — fast, high-dimensional-friendly; predicts +1/-1.
LocalOutlierFactor(n_neighbors=20)
Local density comparison; novelty=True to score unseen data.
OneClassSVM(nu=0.05)
Learn the boundary of "normal" — novelty detection.
EllipticEnvelope()
Robust Gaussian fit — assumes elliptical normal data.
est.decision_function(X) est.score_samples(X)
Continuous anomaly scores instead of hard labels.

22Probability Calibrationselect & tune

CalibratedClassifierCV(model, method='isotonic', cv=5)★
Make predict_proba honest; 'sigmoid' (Platt) for small data, isotonic for large.
CalibratedClassifierCV(model, method='temperature')1.8+
Temperature scaling — single-parameter, preserves accuracy/ranking.
calibration_curve(y_test, proba, n_bins=10)
Reliability diagram data: predicted vs. actual frequency.
brier_score_loss(y_test, proba)
Proper scoring rule for probability quality (lower = better).
calibrate when probabilities drive decisions
SVMs, boosted trees and forests are often over/under-confident.

23Decision-Threshold Tuningselect & tune

TunedThresholdClassifierCV(model, scoring='f1')★1.5+
Post-tune the 0.5 cut-off by internal CV to maximize any metric — the clean fix for cost-sensitive / imbalanced problems.
FixedThresholdClassifier(model, threshold=0.2)1.5+
Set a business-chosen threshold explicitly.
FrozenEstimator(fitted_model)1.6+
Wrap an already-fitted model so meta-estimators won't refit it.
tuned.best_threshold_ tuned.best_score_
The chosen cut-off and its CV score.
tuning the threshold on training dataoverfit
With cv='prefit', tune on held-out data only.

24Cross-Validation Functionsselect & tune

cross_val_score(pipe, X, y, cv=5, scoring='f1')★
k-fold scores in one call — pass the Pipeline, not the bare model.
cross_validate(pipe, X, y, scoring=['f1','roc_auc'], return_train_score=True)
Multiple metrics + fit/score times; train-vs-test gap reveals overfitting.
cross_val_predict(pipe, X, y, cv=5)
Out-of-fold predictions — honest confusion matrices & stacking inputs.
learning_curve(pipe, X, y)
Score vs. training size — "would more data help?"
validation_curve(pipe, X, y, param_name='clf__C', param_range=...)
Score vs. one hyperparameter — under/overfit sweep.
permutation_test_score(pipe, X, y)
p-value: is the score better than label-shuffled chance?

25Hyperparameter Searchselect & tune

GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)★
Exhaustive search; refits the best config on all training data.
RandomizedSearchCV(pipe, param_distributions, n_iter=50)★
Sampled search — use scipy.stats.loguniform(1e-3, 1e2) for scale params.
HalvingGridSearchCV / HalvingRandomSearchCV
Successive halving: eliminate losers early on small budgets (experimental import).
search.best_params_ .best_score_ .best_estimator_★
Winning config, its CV score, and the refit model.
pd.DataFrame(search.cv_results_)
Full per-candidate results table for analysis.
reporting best_score_ as the final performanceoptimistic
Selection bias — confirm on a held-out test set or nested CV.

26Classification Metricsevaluate

accuracy_score(y_test, y_pred)★
Fraction correct — misleading if imbalanced.
precision_score / recall_score / f1_score(..., average='macro')★
Per-class quality; average = binary / micro / macro / weighted.
balanced_accuracy_score() matthews_corrcoef()
Imbalance-robust single numbers; MCC uses the whole confusion matrix.
confusion_matrix(y_test, y_pred) classification_report(...)★
Counts per true/predicted class; report = P/R/F1 for every class.
roc_auc_score(y_test, y_scores)★
Ranking quality across thresholds; multi_class='ovr' for multiclass.
average_precision_score(y_test, y_scores)
PR-AUC — preferred over ROC-AUC on heavy imbalance.
log_loss(y_test, proba) brier_score_loss(...)
Probability-quality metrics (lower = better).
cohen_kappa_score() class_likelihood_ratios()
Chance-corrected agreement; LR+/LR− for diagnostic tests.

27Regression Metricsevaluate

root_mean_squared_error(y_test, y_pred)★1.4+
Same units as target; replaces mean_squared_error(squared=False).
mean_absolute_error(y_test, y_pred)★
Average absolute error — robust to outliers.
r2_score(y_test, y_pred)★
Variance explained; can be negative for bad models.
mean_absolute_percentage_error(...) median_absolute_error(...)
Relative error (beware near-zero targets) / outlier-immune median.
mean_pinball_loss(y, pred, alpha=0.9)
Quantile-regression loss.
d2_absolute_error_score() mean_poisson_deviance()
R²-style generalizations / GLM deviances for count targets.

28Clustering Metricsevaluate

silhouette_score(X, labels)★
Cohesion vs. separation, [-1, 1] — no ground truth needed.
davies_bouldin_score(X, labels) calinski_harabasz_score(...)
Internal indices: lower DB / higher CH = better.
adjusted_rand_score(y_true, labels)
Agreement with known labels, chance-corrected.
adjusted_mutual_info_score() v_measure_score()
Information-theoretic agreement; V = homogeneity × completeness.
elbow (inertia) + silhouette to choose k
Plot both across k; agreement is a good sign.

29Scorers & the scoring Parameterevaluate

scoring='f1_macro' 'neg_root_mean_squared_error'★
String scorers for CV/search; error metrics are negated so greater = better.
sklearn.metrics.get_scorer_names()
List every built-in scoring string.
make_scorer(fbeta_score, beta=2)
Wrap any metric (plus kwargs) into a scorer.
make_scorer(my_loss, greater_is_better=False)
Custom losses get auto-negated for maximization.
scoring={'f1': 'f1', 'auc': 'roc_auc'}, refit='f1'
Multi-metric search — pick which one selects the winner.

30Plotting: the Display APIevaluate

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)★
One-line confusion-matrix plot; normalize='true' for rates.
RocCurveDisplay.from_predictions(y_test, y_scores)★
ROC from stored scores; every Display has both from_estimator / from_predictions.
PrecisionRecallDisplay DetCurveDisplay CalibrationDisplay
PR curve, detection-error tradeoff, reliability diagram.
LearningCurveDisplay ValidationCurveDisplay
Plot learning/validation curves without manual matplotlib.
PredictionErrorDisplay.from_estimator(reg, X, y)
Predicted-vs-actual and residual plots for regression.
DecisionBoundaryDisplay.from_estimator(clf, X)
2D decision-surface plot.

31Model Inspection & Explainabilityfit & inspect

permutation_importance(model, X_test, y_test, n_repeats=10)★
Model-agnostic importance on held-out data — the trustworthy default.
PartialDependenceDisplay.from_estimator(model, X, ['age'])
Average feature-response curves; kind='individual' adds ICE lines.
trusting feature_importances_ blindlybias
Impurity (MDI) importance inflates high-cardinality features and uses train data only.
coef_ on scaled features ≈ comparable effect sizes
Unscaled coefficients are not comparable across features.
shap.TreeExplainer(model)ext
Per-prediction attributions for tree models — the SHAP library.

32Imbalanced Classification Playbookselect & tune

class_weight='balanced'★
Reweight classes inversely to frequency — first thing to try.
model.fit(X, y, sample_weight=w)
Per-row weights — cost-sensitive learning, most estimators.
TunedThresholdClassifierCV(model, scoring='f1')1.5+
Move the decision threshold instead of distorting the data.
imblearn: SMOTE().fit_resample(X_train, y_train)ext
Synthetic minority oversampling — imbalanced-learn; resample training folds only (use imblearn's own Pipeline).
resampling before the split / inside plain Pipelineleakage
Synthetic points contaminate validation folds.
evaluate with PR-AUC, F1, balanced accuracy — not accuracy
Stratify every split.

33Multiclass, Multilabel & Multioutputestimators / models

OneVsRestClassifier(est) OneVsOneClassifier(est)
Force a binary strategy; most sklearn classifiers are natively multiclass already.
MultiOutputClassifier(est) MultiOutputRegressor(est)
One independent model per target column.
ClassifierChain(est) RegressorChain(est)
Feed earlier targets into later ones — captures label correlations.
TransformedTargetRegressor(reg, func=np.log1p, inverse_func=np.expm1)
Train on transformed y, predict back in original units.
MultiLabelBinarizer() LabelBinarizer()
Label sets / labels ↔ indicator matrices.

34Semi-supervised Learningestimators / models

SelfTrainingClassifier(base_clf, threshold=0.75)
Pseudo-label confident unlabeled rows iteratively.
LabelPropagation() LabelSpreading()
Graph-based label diffusion; Spreading is noise-tolerant.
mark unlabeled samples with y = -1
The sklearn convention for semi-supervised targets.

35Gaussian Processes & Kernel Approximationestimators / models

GaussianProcessRegressor(kernel=RBF() + WhiteKernel())
Bayesian regression with composable kernels — small data.
gpr.predict(X, return_std=True)
Predictions with uncertainty — GP's superpower.
Nystroem(kernel='rbf', n_components=300)
Approximate kernel features + linear model ≈ fast kernel SVM.
RBFSampler(gamma=1.0)
Random Fourier features — same trick, cheaper.

36Big Data & Out-of-Core Learningestimators / models

est.partial_fit(X_batch, y_batch, classes=all_classes)★
Incremental learning; pass the full class list on the first call.
SGDClassifier / SGDRegressor / PassiveAggressive*
Linear learners built for streaming batches.
MiniBatchKMeans IncrementalPCA MiniBatchNMF
Batched unsupervised counterparts.
HashingVectorizer + SGDClassifier
Classic out-of-core text pipeline — no vocabulary state.
X = np.load('X.npy', mmap_mode='r')
Memory-mapped arrays let estimators read larger-than-RAM data.

37Performance, GPU & Callbackshandle with care

n_jobs=-1safe
All CPU cores where supported (forests, KNN, CV, search).
set_config(array_api_dispatch=True)1.8+
Pass PyTorch / CuPy arrays to supported estimators — computation runs on GPU.
model.set_callbacks(ProgressBar(), ScoringMonitor(scoring=...))1.9
Progress bars & per-iteration metric logging (experimental, sklearn.callback).
from sklearnex import patch_sklearn; patch_sklearn()ext
Intel extension — order-of-magnitude speedups on CPU, same API.
float32 input ≈ half the memory, often enough precision
Downcast with X.astype(np.float32) for big matrices.
free-threaded CPython supported1.8+
nogil builds let n_jobs use threads instead of processes.

38Model Persistence & Deploymentload & split

joblib.dump(pipe, 'model.joblib') joblib.load(...)★
Standard save/load — persist the whole Pipeline, not just the model.
skops.io.dump(pipe, 'model.skops')ext
Security-audited format — loads without executing arbitrary code.
loading pickles from untrusted sourcesunsafe
Pickle/joblib can execute arbitrary code on load — use skops for sharing.
loading across sklearn versionsfragile
InconsistentVersionWarning — pin the training version in requirements.
skl2onnx.to_onnx(pipe, X[:1])ext
Export to ONNX for cross-language, dependency-free inference.

39Baselines & Sanity Checksevaluate

DummyClassifier(strategy='most_frequent')★
Majority-class baseline — beat this before celebrating.
DummyRegressor(strategy='mean')
Predict-the-mean baseline (R² = 0 by definition).
suspiciously perfect scores → hunt for leakage
Duplicated rows across splits, target-derived features, post-outcome features.
start simple: Dummy → linear → HistGradientBoosting
Each step must justify its complexity.

40Metadata Routingfit & inspect

sklearn.set_config(enable_metadata_routing=True)1.6
Route sample_weight, groups etc. through meta-estimators; rollout largely complete in 1.6.
model.set_fit_request(sample_weight=True)
Declare which metadata an inner estimator consumes.
scorer.set_score_request(sample_weight=True)
Weighted scoring inside CV/search.
cross_val_score(pipe, X, y, params={'sample_weight': w})
Pass routed metadata via params once routing is on.

41Version Highlights 1.3 → 1.9fit & inspect

1.3 HDBSCAN, TargetEncoder, ValidationCurveDisplay
Plus sample_weight in KMeans init.
1.4 root_mean_squared_error, polars in set_output
Native categorical dtype in HGB; PCA on sparse data.
1.5 TunedThresholdClassifierCV, FixedThresholdClassifier
Post-hoc decision-threshold tuning arrives.
1.6 FrozenEstimator, metadata-routing rollout
Missing-value support in ExtraTrees; handle_unknown='warn'.
1.7 richer HTML repr with parameter values
Non-default params highlighted; copy-ready step__param names.
1.8 Array-API/GPU estimators, temperature scaling
Free-threaded CPython support.
1.9 callbacks (ProgressBar, ScoringMonitor), narwhals
Fitted-attributes in HTML repr; sparse_interface='sparray' config.

42Top Gotchashandle with care

fit scaler/encoder/selector on all dataleakage
Fit on train, transform both — or better, put it in the Pipeline.
preprocessing outside cross-validationleakage
CV must refit preprocessing per fold — Pipelines do this automatically.
fit_transform() on test dataleakage
Test data gets transform() only.
imputing/oversampling before the splitleakage
Test-set statistics bleed into training.
always set random_statesafe
Reproducible splits, shuffles, and stochastic fits.
ConvergenceWarning ignoredunfit
Raise max_iter and/or scale features — the model didn't finish learning.
predict columns in a different order than fitsilent
Arrays are positional; DataFrames are checked against feature_names_in_.

★Common Fitted Attributesafter .fit()

model.coef_ model.intercept_★
Learned weights (linear models).
model.feature_importances_★
Impurity-based importance (trees/forests) — see MDI caveat, card 31.
model.classes_ model.n_features_in_ model.feature_names_in_
Label order, feature count, and column names seen at fit.
pca.explained_variance_ratio_ kmeans.cluster_centers_ kmeans.inertia_
Variance kept / centroids / within-cluster SSE.
forest.oob_score_ search.cv_results_
Out-of-bag validation score; full search results table.
check_is_fitted(model)
Raises if the estimator hasn't been fit (sklearn.utils.validation).

★Which Metric, When?quick-read

balanced classification★
Accuracy is fine; add a confusion matrix anyway.
imbalanced classification★
PR-AUC / F1 / balanced accuracy; tune the threshold (card 23).
probability outputs matter
log_loss + Brier + calibration curve.
regression★
RMSE/MAE for error size, R² for variance explained; MAE if outliers.
clustering
Silhouette (no labels) / ARI & AMI (labels available).
ranking / retrieval
ROC-AUC, average precision, top_k_accuracy_score, ndcg_score.

★Which Scaler / Encoder, When?quick-read

default numeric★
StandardScaler; RobustScaler if outliers; MinMax for bounded inputs (NN).
skewed numeric
PowerTransformer (yeo-johnson) or log via FunctionTransformer.
low-cardinality categorical★
OneHotEncoder(handle_unknown='ignore').
high-cardinality categorical
TargetEncoder (1.3+) or hashing; one-hot explodes width.
tree / HGB models
No scaling needed; OrdinalEncoder or native categorical dtype suffices.

The ML workflow, visually

train_test_split() ★

K-Fold Cross-Validation ★

Pipeline chaining ★

ColumnTransformer routing

Under- vs. overfitting ★

Tuning the decision threshold 1.5+

Worth memorizing