Quick Reference v2 · machine learning in Python · single source of truth
scikit-learn cheat sheet v2
Every algorithm in scikit-learn shares one shape: an estimator with .fit() and .predict() (or .transform()), assembled from the same moving parts — preprocessing, a model, cross-validation, a metric — chained into a Pipeline. v2 covers the full modern surface: HistGradientBoosting, set_output DataFrames, TargetEncoder, threshold tuning, calibration, outlier detection, the Display plotting API, inspection, out-of-core learning, GPU/Array-API, callbacks, and safe persistence.
load & split
preprocess
estimators / models
fit & inspect
select & tune
evaluate
gotcha
★ most common
1.5+ version-gated
ext outside sklearn
Targets scikit-learn 1.9 (June 2026); features new since 1.3 carry version badges. Distilled & cross-checked across: scikit-learn.org official docs, user guide, API reference, release highlights 1.4–1.9 & the "Choosing the right estimator" map · the official DataCamp PDF cheat sheet · ODSC's estimator-selection guide · skops & imbalanced-learn docs.
01Setup, Import & Configload & split
pip install scikit-learnInstall (imports as sklearn); conda: conda install -c conda-forge scikit-learn.
import sklearn; sklearn.show_versions()Version + full dependency report for bug reports.
from sklearn.model_selection import train_test_split★The single most-imported function in sklearn.
sklearn.set_config(display='diagram')Rich HTML repr of pipelines in notebooks (default). 1.9 adds a fitted-attributes table.
sklearn.config_context(assume_finite=True)Skip NaN/inf validation inside the block — faster, riskier.
random_state=42★Set anywhere randomness appears (splits, shuffles, stochastic models).
02Loading Dataload & split
load_iris(return_X_y=True)★Built-in toy datasets (sklearn.datasets); as_frame=True for DataFrames.
fetch_california_housing() fetch_20newsgroups()Larger real-world datasets, downloaded on first use.
fetch_openml('titanic', version=1, as_frame=True)★Any of thousands of OpenML datasets by name or data_id.
make_classification(n_samples=200, n_features=4)★Synthetic classification data; weights=[.9,.1] for imbalance.
make_regression() make_blobs() make_moons()Synthetic regression / clustering / non-linear demo data.
03Splitting & CV Splittersload & split
train_test_split(X, y, test_size=0.2, random_state=42)★One-shot hold-out split.
train_test_split(..., stratify=y)★Preserve class proportions — always for classification.
KFold(n_splits=5, shuffle=True) StratifiedKFold(...)★Plain / class-balanced folds. Classifiers get stratified folds by default in CV helpers.
GroupKFold(...) StratifiedGroupKFold(...)Keep all rows of a group (patient, user, site) in one fold — prevents group leakage.
TimeSeriesSplit(n_splits=5, gap=0)★Expanding-window splits; train always precedes test.
ShuffleSplit() RepeatedStratifiedKFold()Random resampling / repeated k-fold for tighter score estimates.
shuffled CV on time seriesleakageFuture rows leak into training — use TimeSeriesSplit.
04The Estimator APIfit & inspect
model = Estimator(**params)★Every algorithm is a class configured by hyperparameters.
model.fit(X_train, y_train)★Learn from training data — the universal verb. Returns self, so calls chain.
model.predict(X_test)★Labels / values for new data.
model.predict_proba(X) model.decision_function(X)Class probabilities / raw margin scores (classifiers).
model.score(X_test, y_test)★Built-in default metric: accuracy (classifiers), R² (regressors).
transformer.fit_transform(X_train)Fit + transform in one call — training data only.
model.get_params() model.set_params(C=1.0)Read / change hyperparameters; works on nested step__param too.
fitted attributes end with an underscore: coef_Anything learned from data is suffixed _; params are not.
05Preprocessing: Scalingpreprocess
StandardScaler()★Zero mean, unit variance — the default choice.
MinMaxScaler(feature_range=(0,1))Squash into a fixed range; sensitive to outliers.
RobustScaler()★Median/IQR based — resistant to outliers.
MaxAbsScaler()Scale by max |value|; preserves sparsity (no centering).
Normalizer()Scale each row to unit norm (not each column) — e.g. text vectors.
scaler.fit(X_train) → .transform(X_test)fit onceFit only on train — never refit on test.
scale for distance/gradient models; trees don't careKNN, SVM, linear models, PCA, MLP need it; trees/forests/HGB don't.
06Non-linear Transforms & Discretizationpreprocess
PowerTransformer(method='yeo-johnson')★Make skewed features Gaussian-like; 'box-cox' needs positive data.
QuantileTransformer(output_distribution='normal')Rank-based map to uniform/normal — crushes outliers.
KBinsDiscretizer(n_bins=5, strategy='quantile')Bin continuous features; strategies: uniform / quantile / kmeans.
SplineTransformer(degree=3, n_knots=5)B-spline basis — smooth non-linearity for linear models.
PolynomialFeatures(degree=2, interaction_only=False)Polynomial & interaction terms; feature count explodes fast.
FunctionTransformer(np.log1p)Wrap any function as a pipeline-compatible transformer.
07Encoding Categorical Datapreprocess
OneHotEncoder(handle_unknown='ignore')★One binary column per category; unseen categories become all-zeros. 'warn' option since 1.6.
OneHotEncoder(min_frequency=10, max_categories=20)Group rare levels into one "infrequent" bucket.
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)Integer codes — for genuinely ordered categories, or as tree-model input.
TargetEncoder()★1.3+Mean-target encoding with internal cross-fitting to prevent leakage — best for high-cardinality features.
LabelEncoder on feature columnsmisuseLabelEncoder is for the target y only; use OrdinalEncoder for X.
pd.get_dummies() — quick one-hot outside a PipelineFine for exploration; use OneHotEncoder inside pipelines to handle unseen categories.
08Missing Valuespreprocess
SimpleImputer(strategy='median')★mean / median / most_frequent / constant fill.
KNNImputer(n_neighbors=5)Fill using similar rows' values; scale features first.
IterativeImputer()Models each feature from the others (MICE-style); still experimental — needs enable_iterative_imputer import.
SimpleImputer(add_indicator=True) MissingIndicator()Add binary "was missing" columns — missingness is often informative.
HistGradientBoosting & trees handle NaN nativelyHGB (always) and DecisionTree/RandomForest (recent versions) can skip imputation entirely.
09Feature Selectionselect & tune
VarianceThreshold(threshold=0.0)Drop near-constant, uninformative features first.
SelectKBest(f_classif, k=10)★Univariate scoring; also mutual_info_classif, chi2 (non-negative X).
SelectFromModel(Lasso(alpha=.01))★Keep features an L1 / tree model rates important.
RFE(estimator, n_features_to_select=5) RFECV(...)Recursively drop the weakest feature; RFECV picks the count by CV.
SequentialFeatureSelector(model, direction='forward')Greedy forward/backward selection by CV score — slow but model-agnostic.
selecting features on the full datasetleakageSelection is learning — do it inside the Pipeline/CV.
10Text Feature Extractionpreprocess
CountVectorizer(ngram_range=(1,2), min_df=2)★Token counts → sparse matrix; fit learns the vocabulary.
TfidfVectorizer(stop_words='english')★Counts re-weighted by inverse document frequency — the text baseline.
HashingVectorizer(n_features=2**20)Stateless (no fit needed) — streaming / out-of-core text.
DictVectorizer()List-of-dicts → feature matrix.
pair TfidfVectorizer with LinearSVC / MultinomialNB / SGDLinear models shine on high-dimensional sparse text.
11Pipelines & ColumnTransformerestimators / models
Pipeline([('scale', StandardScaler()), ('clf', model)])★Chain preprocessing + model into one estimator; prevents leakage by construction.
make_pipeline(SimpleImputer(), StandardScaler(), model)Same thing with auto-named steps.
ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])★Different preprocessing per column group; remainder='passthrough' keeps the rest.
make_column_selector(dtype_include=object)★Select columns by dtype or regex instead of hard-coded lists.
FeatureUnion([('pca', PCA()), ('kbest', SelectKBest())])Concatenate outputs of parallel transformers.
pipe.named_steps['clf'] pipe[-1] pipe[:-1]Index into steps; slicing returns a sub-pipeline.
param_grid = {'clf__C': [...]}★Tune nested params with step__param syntax.
Pipeline(..., memory='./cache')Cache fitted transformers across grid-search candidates.
12DataFrame In / DataFrame Outpreprocess
transformer.set_output(transform='pandas')★Transformers return DataFrames with named columns instead of bare arrays.
set_output(transform='polars')1.4+Polars output; 1.9 uses narwhals internally for broader dataframe interop.
sklearn.set_config(transform_output='pandas')Same, globally for every transformer.
pipe[:-1].get_feature_names_out()★Post-transform feature names — align coefficients/importances to columns.
model.feature_names_in_Column names seen at fit when X was a DataFrame; predict-time names must match.
13Linear Classifiersestimators / models
LogisticRegression(C=1.0, max_iter=1000)★Fast, calibrated-ish, interpretable baseline; smaller C = stronger regularization.
LogisticRegression(penalty='l1', solver='liblinear')Sparse coefficients — built-in feature selection.
LinearSVC()Linear SVM, scales to wide sparse data (text).
SGDClassifier(loss='log_loss')Any linear model via SGD — huge data, supports partial_fit.
RidgeClassifier() Perceptron()Least-squares / classic online linear classifiers.
always scale features for linear modelsRegularization assumes comparable feature scales.
14Linear, Robust & GLM Regressionestimators / models
LinearRegression()★Ordinary least squares baseline.
Ridge(alpha=1.0) Lasso() ElasticNet()★L2 / L1 / both; RidgeCV, LassoCV tune alpha internally.
HuberRegressor() RANSACRegressor() TheilSenRegressor()Robust to outliers: soft down-weighting / inlier consensus / median-of-slopes.
QuantileRegressor(quantile=0.9)Predict conditional quantiles — pinball loss.
PoissonRegressor() GammaRegressor() TweedieRegressor()GLMs for counts, positive-skew, and insurance-style targets.
SVR(kernel='rbf') KernelRidge()Kernel non-linear regression; KernelRidge fits faster, SVR predicts faster.
15Trees & HistGradientBoostingestimators / models
HistGradientBoostingClassifier() / ...Regressor()★The modern default for tabular data — LightGBM-style, fast on 10k+ rows.
HistGradientBoosting...(categorical_features='from_dtype')Native categorical splits from pandas category dtype — no one-hot needed.
HistGradientBoosting...(early_stopping=True)Auto validation-based stopping; handles NaN natively too.
HistGradientBoosting...(monotonic_cst={'price': 1})Force monotone feature-response relationships.
RandomForestClassifier(n_estimators=300, oob_score=True)★Bagged trees — strong low-tuning default; OOB is a free validation score.
ExtraTreesClassifier() DecisionTreeClassifier(ccp_alpha=.01)Extra-random ensembles; single trees need pruning (ccp_alpha).
legacy GradientBoosting* — prefer Hist* except tiny dataOld GBM is exact but far slower; Hist* bins features (≤255 bins).
16Ensemble Meta-estimatorsestimators / models
VotingClassifier(estimators=[...], voting='soft')★Combine models by (probability-weighted) vote.
StackingClassifier(estimators=[...], final_estimator=LogisticRegression())Meta-model learns from base models' CV predictions.
BaggingClassifier(estimator, n_estimators=10)Any model on bootstrap samples — variance reduction.
AdaBoostClassifier()Sequentially reweights hard examples.
XGBoost / LightGBM / CatBoost use the sklearn APIextDrop into Pipelines, GridSearchCV, and cross_val_score unchanged.
17SVM, KNN, Naive Bayes & MLPestimators / models
SVC(kernel='rbf', C=1, gamma='scale')★Max-margin, non-linear boundaries; O(n²) — avoid on >20k rows.
SVC(probability=True)Enables predict_proba via internal CV — slower fit.
KNeighborsClassifier(n_neighbors=5)Vote among k nearest points — scale features first.
GaussianNB() MultinomialNB() ComplementNB()Probabilistic baselines: continuous / counts (text) / imbalanced text.
MLPClassifier(hidden_layer_sizes=(100,), early_stopping=True)Small feed-forward nets; for deep learning use PyTorch/JAX instead.
LinearDiscriminantAnalysis() QuadraticDiscriminantAnalysis()Gaussian class-conditional classifiers; LDA doubles as supervised projection.
18Clusteringestimators / models
KMeans(n_clusters=3, n_init='auto')★Fast, spherical clusters, k chosen in advance; scale features.
MiniBatchKMeans() BisectingKMeans()KMeans for big data / hierarchical top-down variant.
HDBSCAN(min_cluster_size=5)★1.3+Density-based, variable-density clusters + noise labels (-1); no eps to tune.
DBSCAN(eps=0.5, min_samples=5) OPTICS()Classic density clustering; OPTICS sweeps eps ranges.
AgglomerativeClustering(linkage='ward')Hierarchical bottom-up merging; dendrogram-friendly.
SpectralClustering() Birch() MeanShift()Graph-based / streaming / mode-seeking alternatives.
GaussianMixture(n_components=3).bic(X)Soft (probabilistic) clustering; pick components by lowest BIC/AIC.
19Dimensionality Reductionestimators / models
PCA(n_components=0.95)★Float = keep 95% of variance; int = exact count. Check explained_variance_ratio_.
TruncatedSVD(n_components=100)PCA-like for sparse matrices (LSA on tf-idf).
IncrementalPCA(batch_size=1000)PCA in mini-batches — larger-than-memory data.
KernelPCA(kernel='rbf')Non-linear PCA via the kernel trick.
NMF(n_components=10)Non-negative parts-based factorization — topics, spectra.
LinearDiscriminantAnalysis(n_components=2)Supervised projection that maximizes class separation.
fitting PCA before the train/test splitleakagePCA learns from data — fit on train only, inside the Pipeline.
20Manifold Learning (Visualization)estimators / models
TSNE(n_components=2, perplexity=30)★2D/3D visualization only — no transform for new data; distances between clusters aren't meaningful.
Isomap() LocallyLinearEmbedding() MDS()Geodesic / local-linear / distance-preserving embeddings.
SpectralEmbedding()Graph-Laplacian embedding.
umap-learn: UMAP(n_neighbors=15)extFaster than t-SNE, supports transform on new data — sklearn-compatible API.
PCA to ~50 dims first, then t-SNE/UMAPStandard recipe: denoise + speed up the manifold step.
21Outlier & Novelty Detectionestimators / models
IsolationForest(contamination=0.05)★Random-split isolation — fast, high-dimensional-friendly; predicts +1/-1.
LocalOutlierFactor(n_neighbors=20)Local density comparison; novelty=True to score unseen data.
OneClassSVM(nu=0.05)Learn the boundary of "normal" — novelty detection.
EllipticEnvelope()Robust Gaussian fit — assumes elliptical normal data.
est.decision_function(X) est.score_samples(X)Continuous anomaly scores instead of hard labels.
22Probability Calibrationselect & tune
CalibratedClassifierCV(model, method='isotonic', cv=5)★Make predict_proba honest; 'sigmoid' (Platt) for small data, isotonic for large.
CalibratedClassifierCV(model, method='temperature')1.8+Temperature scaling — single-parameter, preserves accuracy/ranking.
calibration_curve(y_test, proba, n_bins=10)Reliability diagram data: predicted vs. actual frequency.
brier_score_loss(y_test, proba)Proper scoring rule for probability quality (lower = better).
calibrate when probabilities drive decisionsSVMs, boosted trees and forests are often over/under-confident.
23Decision-Threshold Tuningselect & tune
TunedThresholdClassifierCV(model, scoring='f1')★1.5+Post-tune the 0.5 cut-off by internal CV to maximize any metric — the clean fix for cost-sensitive / imbalanced problems.
FixedThresholdClassifier(model, threshold=0.2)1.5+Set a business-chosen threshold explicitly.
FrozenEstimator(fitted_model)1.6+Wrap an already-fitted model so meta-estimators won't refit it.
tuned.best_threshold_ tuned.best_score_The chosen cut-off and its CV score.
tuning the threshold on training dataoverfitWith cv='prefit', tune on held-out data only.
24Cross-Validation Functionsselect & tune
cross_val_score(pipe, X, y, cv=5, scoring='f1')★k-fold scores in one call — pass the Pipeline, not the bare model.
cross_validate(pipe, X, y, scoring=['f1','roc_auc'], return_train_score=True)Multiple metrics + fit/score times; train-vs-test gap reveals overfitting.
cross_val_predict(pipe, X, y, cv=5)Out-of-fold predictions — honest confusion matrices & stacking inputs.
learning_curve(pipe, X, y)Score vs. training size — "would more data help?"
validation_curve(pipe, X, y, param_name='clf__C', param_range=...)Score vs. one hyperparameter — under/overfit sweep.
permutation_test_score(pipe, X, y)p-value: is the score better than label-shuffled chance?
25Hyperparameter Searchselect & tune
GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)★Exhaustive search; refits the best config on all training data.
RandomizedSearchCV(pipe, param_distributions, n_iter=50)★Sampled search — use scipy.stats.loguniform(1e-3, 1e2) for scale params.
HalvingGridSearchCV / HalvingRandomSearchCVSuccessive halving: eliminate losers early on small budgets (experimental import).
search.best_params_ .best_score_ .best_estimator_★Winning config, its CV score, and the refit model.
pd.DataFrame(search.cv_results_)Full per-candidate results table for analysis.
reporting best_score_ as the final performanceoptimisticSelection bias — confirm on a held-out test set or nested CV.
26Classification Metricsevaluate
accuracy_score(y_test, y_pred)★Fraction correct — misleading if imbalanced.
precision_score / recall_score / f1_score(..., average='macro')★Per-class quality; average = binary / micro / macro / weighted.
balanced_accuracy_score() matthews_corrcoef()Imbalance-robust single numbers; MCC uses the whole confusion matrix.
confusion_matrix(y_test, y_pred) classification_report(...)★Counts per true/predicted class; report = P/R/F1 for every class.
roc_auc_score(y_test, y_scores)★Ranking quality across thresholds; multi_class='ovr' for multiclass.
average_precision_score(y_test, y_scores)PR-AUC — preferred over ROC-AUC on heavy imbalance.
log_loss(y_test, proba) brier_score_loss(...)Probability-quality metrics (lower = better).
cohen_kappa_score() class_likelihood_ratios()Chance-corrected agreement; LR+/LR− for diagnostic tests.
27Regression Metricsevaluate
root_mean_squared_error(y_test, y_pred)★1.4+Same units as target; replaces mean_squared_error(squared=False).
mean_absolute_error(y_test, y_pred)★Average absolute error — robust to outliers.
r2_score(y_test, y_pred)★Variance explained; can be negative for bad models.
mean_absolute_percentage_error(...) median_absolute_error(...)Relative error (beware near-zero targets) / outlier-immune median.
mean_pinball_loss(y, pred, alpha=0.9)Quantile-regression loss.
d2_absolute_error_score() mean_poisson_deviance()R²-style generalizations / GLM deviances for count targets.
28Clustering Metricsevaluate
silhouette_score(X, labels)★Cohesion vs. separation, [-1, 1] — no ground truth needed.
davies_bouldin_score(X, labels) calinski_harabasz_score(...)Internal indices: lower DB / higher CH = better.
adjusted_rand_score(y_true, labels)Agreement with known labels, chance-corrected.
adjusted_mutual_info_score() v_measure_score()Information-theoretic agreement; V = homogeneity × completeness.
elbow (inertia) + silhouette to choose kPlot both across k; agreement is a good sign.
29Scorers & the scoring Parameterevaluate
scoring='f1_macro' 'neg_root_mean_squared_error'★String scorers for CV/search; error metrics are negated so greater = better.
sklearn.metrics.get_scorer_names()List every built-in scoring string.
make_scorer(fbeta_score, beta=2)Wrap any metric (plus kwargs) into a scorer.
make_scorer(my_loss, greater_is_better=False)Custom losses get auto-negated for maximization.
scoring={'f1': 'f1', 'auc': 'roc_auc'}, refit='f1'Multi-metric search — pick which one selects the winner.
30Plotting: the Display APIevaluate
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)★One-line confusion-matrix plot; normalize='true' for rates.
RocCurveDisplay.from_predictions(y_test, y_scores)★ROC from stored scores; every Display has both from_estimator / from_predictions.
PrecisionRecallDisplay DetCurveDisplay CalibrationDisplayPR curve, detection-error tradeoff, reliability diagram.
LearningCurveDisplay ValidationCurveDisplayPlot learning/validation curves without manual matplotlib.
PredictionErrorDisplay.from_estimator(reg, X, y)Predicted-vs-actual and residual plots for regression.
DecisionBoundaryDisplay.from_estimator(clf, X)2D decision-surface plot.
31Model Inspection & Explainabilityfit & inspect
permutation_importance(model, X_test, y_test, n_repeats=10)★Model-agnostic importance on held-out data — the trustworthy default.
PartialDependenceDisplay.from_estimator(model, X, ['age'])Average feature-response curves; kind='individual' adds ICE lines.
trusting feature_importances_ blindlybiasImpurity (MDI) importance inflates high-cardinality features and uses train data only.
coef_ on scaled features ≈ comparable effect sizesUnscaled coefficients are not comparable across features.
shap.TreeExplainer(model)extPer-prediction attributions for tree models — the SHAP library.
32Imbalanced Classification Playbookselect & tune
class_weight='balanced'★Reweight classes inversely to frequency — first thing to try.
model.fit(X, y, sample_weight=w)Per-row weights — cost-sensitive learning, most estimators.
TunedThresholdClassifierCV(model, scoring='f1')1.5+Move the decision threshold instead of distorting the data.
imblearn: SMOTE().fit_resample(X_train, y_train)extSynthetic minority oversampling — imbalanced-learn; resample training folds only (use imblearn's own Pipeline).
resampling before the split / inside plain PipelineleakageSynthetic points contaminate validation folds.
evaluate with PR-AUC, F1, balanced accuracy — not accuracyStratify every split.
33Multiclass, Multilabel & Multioutputestimators / models
OneVsRestClassifier(est) OneVsOneClassifier(est)Force a binary strategy; most sklearn classifiers are natively multiclass already.
MultiOutputClassifier(est) MultiOutputRegressor(est)One independent model per target column.
ClassifierChain(est) RegressorChain(est)Feed earlier targets into later ones — captures label correlations.
TransformedTargetRegressor(reg, func=np.log1p, inverse_func=np.expm1)Train on transformed y, predict back in original units.
MultiLabelBinarizer() LabelBinarizer()Label sets / labels ↔ indicator matrices.
34Semi-supervised Learningestimators / models
SelfTrainingClassifier(base_clf, threshold=0.75)Pseudo-label confident unlabeled rows iteratively.
LabelPropagation() LabelSpreading()Graph-based label diffusion; Spreading is noise-tolerant.
mark unlabeled samples with y = -1The sklearn convention for semi-supervised targets.
35Gaussian Processes & Kernel Approximationestimators / models
GaussianProcessRegressor(kernel=RBF() + WhiteKernel())Bayesian regression with composable kernels — small data.
gpr.predict(X, return_std=True)Predictions with uncertainty — GP's superpower.
Nystroem(kernel='rbf', n_components=300)Approximate kernel features + linear model ≈ fast kernel SVM.
RBFSampler(gamma=1.0)Random Fourier features — same trick, cheaper.
36Big Data & Out-of-Core Learningestimators / models
est.partial_fit(X_batch, y_batch, classes=all_classes)★Incremental learning; pass the full class list on the first call.
SGDClassifier / SGDRegressor / PassiveAggressive*Linear learners built for streaming batches.
MiniBatchKMeans IncrementalPCA MiniBatchNMFBatched unsupervised counterparts.
HashingVectorizer + SGDClassifierClassic out-of-core text pipeline — no vocabulary state.
X = np.load('X.npy', mmap_mode='r')Memory-mapped arrays let estimators read larger-than-RAM data.
37Performance, GPU & Callbackshandle with care
n_jobs=-1safeAll CPU cores where supported (forests, KNN, CV, search).
set_config(array_api_dispatch=True)1.8+Pass PyTorch / CuPy arrays to supported estimators — computation runs on GPU.
model.set_callbacks(ProgressBar(), ScoringMonitor(scoring=...))1.9Progress bars & per-iteration metric logging (experimental, sklearn.callback).
from sklearnex import patch_sklearn; patch_sklearn()extIntel extension — order-of-magnitude speedups on CPU, same API.
float32 input ≈ half the memory, often enough precisionDowncast with X.astype(np.float32) for big matrices.
free-threaded CPython supported1.8+nogil builds let n_jobs use threads instead of processes.
38Model Persistence & Deploymentload & split
joblib.dump(pipe, 'model.joblib') joblib.load(...)★Standard save/load — persist the whole Pipeline, not just the model.
skops.io.dump(pipe, 'model.skops')extSecurity-audited format — loads without executing arbitrary code.
loading pickles from untrusted sourcesunsafePickle/joblib can execute arbitrary code on load — use skops for sharing.
loading across sklearn versionsfragileInconsistentVersionWarning — pin the training version in requirements.
skl2onnx.to_onnx(pipe, X[:1])extExport to ONNX for cross-language, dependency-free inference.
39Baselines & Sanity Checksevaluate
DummyClassifier(strategy='most_frequent')★Majority-class baseline — beat this before celebrating.
DummyRegressor(strategy='mean')Predict-the-mean baseline (R² = 0 by definition).
suspiciously perfect scores → hunt for leakageDuplicated rows across splits, target-derived features, post-outcome features.
start simple: Dummy → linear → HistGradientBoostingEach step must justify its complexity.
40Metadata Routingfit & inspect
sklearn.set_config(enable_metadata_routing=True)1.6Route sample_weight, groups etc. through meta-estimators; rollout largely complete in 1.6.
model.set_fit_request(sample_weight=True)Declare which metadata an inner estimator consumes.
scorer.set_score_request(sample_weight=True)Weighted scoring inside CV/search.
cross_val_score(pipe, X, y, params={'sample_weight': w})Pass routed metadata via params once routing is on.
41Version Highlights 1.3 → 1.9fit & inspect
1.3 HDBSCAN, TargetEncoder, ValidationCurveDisplayPlus sample_weight in KMeans init.
1.4 root_mean_squared_error, polars in set_outputNative categorical dtype in HGB; PCA on sparse data.
1.5 TunedThresholdClassifierCV, FixedThresholdClassifierPost-hoc decision-threshold tuning arrives.
1.6 FrozenEstimator, metadata-routing rolloutMissing-value support in ExtraTrees; handle_unknown='warn'.
1.7 richer HTML repr with parameter valuesNon-default params highlighted; copy-ready step__param names.
1.8 Array-API/GPU estimators, temperature scalingFree-threaded CPython support.
1.9 callbacks (ProgressBar, ScoringMonitor), narwhalsFitted-attributes in HTML repr; sparse_interface='sparray' config.
42Top Gotchashandle with care
fit scaler/encoder/selector on all dataleakageFit on train, transform both — or better, put it in the Pipeline.
preprocessing outside cross-validationleakageCV must refit preprocessing per fold — Pipelines do this automatically.
fit_transform() on test dataleakageTest data gets transform() only.
imputing/oversampling before the splitleakageTest-set statistics bleed into training.
always set random_statesafeReproducible splits, shuffles, and stochastic fits.
ConvergenceWarning ignoredunfitRaise max_iter and/or scale features — the model didn't finish learning.
predict columns in a different order than fitsilentArrays are positional; DataFrames are checked against feature_names_in_.
★Common Fitted Attributesafter .fit()
model.coef_ model.intercept_★Learned weights (linear models).
model.feature_importances_★Impurity-based importance (trees/forests) — see MDI caveat, card 31.
model.classes_ model.n_features_in_ model.feature_names_in_Label order, feature count, and column names seen at fit.
pca.explained_variance_ratio_ kmeans.cluster_centers_ kmeans.inertia_Variance kept / centroids / within-cluster SSE.
forest.oob_score_ search.cv_results_Out-of-bag validation score; full search results table.
check_is_fitted(model)Raises if the estimator hasn't been fit (sklearn.utils.validation).
★Which Metric, When?quick-read
balanced classification★Accuracy is fine; add a confusion matrix anyway.
imbalanced classification★PR-AUC / F1 / balanced accuracy; tune the threshold (card 23).
probability outputs matterlog_loss + Brier + calibration curve.
regression★RMSE/MAE for error size, R² for variance explained; MAE if outliers.
clusteringSilhouette (no labels) / ARI & AMI (labels available).
ranking / retrievalROC-AUC, average precision, top_k_accuracy_score, ndcg_score.
★Which Scaler / Encoder, When?quick-read
default numeric★StandardScaler; RobustScaler if outliers; MinMax for bounded inputs (NN).
skewed numericPowerTransformer (yeo-johnson) or log via FunctionTransformer.
low-cardinality categorical★OneHotEncoder(handle_unknown='ignore').
high-cardinality categoricalTargetEncoder (1.3+) or hashing; one-hot explodes width.
tree / HGB modelsNo scaling needed; OrdinalEncoder or native categorical dtype suffices.
Worth memorizing
fit / transform / predictfit learns, transform reshapes data, predict outputs labels
fit on train onlytransform test/validation data with the already-fit object
Pipeline prevents leakagepreprocessing is refit correctly inside each CV fold
random_state everywherereproducible splits, shuffling, and stochastic models
scale for distance/gradient modelsKNN, SVM, linear models & PCA need it; trees don't
HistGradientBoosting firstthe modern tabular default — NaN + categoricals handled natively
Grid vs Randomized searchexhaustive vs. sampled — randomized scales better on big grids
accuracy misleads on imbalancePR-AUC / F1 + confusion matrix; tune the threshold (1.5+)
cross_val_score defaultsstratified k-fold automatically for classifiers
step__param syntaxtune anything nested inside a Pipeline or ColumnTransformer
set_output(transform='pandas')named columns out of every transformer
beat DummyClassifier firsta model that can't beat majority-vote learned nothing
joblib for yourself, skops to sharepickles execute code on load — never load untrusted files
error metrics are negated scorers'neg_root_mean_squared_error' — greater is always better