Quick Reference · machine learning in Python

scikit-learn cheat sheet

Every algorithm in scikit-learn shares one shape: an estimator object with .fit(), .predict() (or .transform()), built from the same handful of moving parts — preprocessing, a model, cross-validation, and a metric. Chain them with a Pipeline and the rest is choosing the right piece for the job.

load & split preprocess estimators / models fit & inspect select & tune evaluate gotcha most common

Distilled & cross-checked across: scikit-learn.org official docs & the "Choosing the right estimator" flowchart · the official DataCamp PDF cheat sheet · ODSC's estimator-selection guide · GeeksforGeeks · pickl.ai

Choosing the right estimator — simplified from the official scikit-learn map
START >50 samples? NO → get more data what's the goal? Classification (labeled) Linear SVC try next KNeighbors / SVC (rbf) try next RandomForest / NaiveBayes Regression (quantity) Ridge / Lasso try next SVR (linear → rbf) try next Ensemble (RF / GBoost) Clustering (no labels) KMeans (k known) try next DBSCAN / MeanShift try next GaussianMixture (GMM) Dimensionality Reduction PCA try next t-SNE (visualization) try next Isomap / LLE orange "try next" arrows: if an estimator underperforms, follow the arrow — full map: scikit-learn.org/stable/machine_learning_map.html
01Setup & Importload & split
02Loading & Splitting Dataload & split
03The Estimator APIfit & inspect
04Preprocessing: Scalingpreprocess
05Preprocessing: Encoding Categorical Datapreprocess
06Preprocessing: Missing Valuespreprocess
07Feature Engineering & Selectionselect & tune
08Pipelines & ColumnTransformerestimators / models
09Classification Modelsestimators / models
10Regression Modelsestimators / models
11Clustering Modelsestimators / models
12Dimensionality Reductionestimators / models
13Ensemble Methodsestimators / models
14Model Selection & Cross-Validationselect & tune
15Hyperparameter Tuningselect & tune
16Classification Metricsevaluate
17Regression Metricsevaluate
18Clustering Metricsevaluate
19Model Persistenceload & split
20Performance Tips & Gotchashandle with care
Common Estimator Attributesafter .fit()
Which Metric, When?quick-read

The ML workflow, visually

Four visuals behind almost every scikit-learn project: how data gets split, how cross-validation folds rotate, how a Pipeline chains steps, and why model complexity has a sweet spot.

train_test_split() ★

The dataset is split once — the model never sees the test rows until final evaluation.

X_train, y_train (80%) test (20%) test rows are held out until final .score() / metrics

K-Fold Cross-Validation ★

Each row is one fold's split; the validation block (amber) rotates so every sample is validated on exactly once.

fold 1 fold 2 fold 3 fold 4 fold 5 blue = train · amber = validation · cross_val_score averages all 5 scores

Pipeline chaining ★

One .fit() call runs every step in order; the same chain applies identically at predict time.

raw data Imputer Scaler Model pred Pipeline([('impute', ...), ('scale', ...), ('clf', ...)]) refit correctly inside every cross-validation fold

Under- vs. overfitting ★

Training error always falls with complexity; validation error is U-shaped — the gap between them is what to watch.

error model complexity → train error val error underfit sweet spot overfit

Worth memorizing

fit / transform / predictfit learns, transform reshapes data, predict outputs labels
fit on train onlytransform test/validation data with the already-fit object
Pipeline prevents leakagepreprocessing is refit correctly inside each CV fold
random_state everywherefor reproducible splits, shuffling, and stochastic models
scale for distance/gradient modelsKNN, SVM, linear models & PCA need it; trees don't
Grid vs Randomized searchexhaustive vs. sampled — randomized scales better on big grids
accuracy misleads on imbalancecheck the confusion matrix and F1 too
fit_transform is for training datanever call it on test/validation data — use transform
joblib over picklefaster for NumPy arrays inside fitted sklearn models
cross_val_score defaultsstratified k-fold automatically for classifiers