Scikit-learn Cheat Sheet

01Setup & Importload & split

from sklearn.model_selection import train_test_split★
The single most-imported function in sklearn.
import numpy as np, pandas as pd
sklearn expects numeric arrays or DataFrames.
pip install scikit-learn
Install (imports as sklearn).
import sklearn; sklearn.__version__
Check the installed version.
random_state=42★
Set it anywhere randomness appears, for reproducibility.

02Loading & Splitting Dataload & split

from sklearn.datasets import load_iris★
Small built-in toy datasets for practice.
fetch_california_housing()
Larger real-world datasets, downloaded on first use.
make_classification(n_samples=200, n_features=4)★
Synthetic classification data for testing.
make_regression(...) make_blobs(...)
Synthetic regression / clustering data.
train_test_split(X, y, test_size=0.2, random_state=42)★
Split into training and test sets.
train_test_split(..., stratify=y)★
Preserve class proportions in the split.

03The Estimator APIfit & inspect

model = Estimator(**params)★
Every algorithm is a class configured by hyperparameters.
model.fit(X_train, y_train)★
Learn from training data — the universal verb.
model.predict(X_test)★
Generate predictions on new data.
model.predict_proba(X_test)
Class probabilities (classifiers only).
model.score(X_test, y_test)★
Quick built-in metric (accuracy / R²).
transformer.fit_transform(X)
Fit + transform in one call (preprocessors).

04Preprocessing: Scalingpreprocess

StandardScaler()★
Zero mean, unit variance — the default choice.
MinMaxScaler()
Scale into a fixed [0, 1] range.
RobustScaler()★
Uses median/IQR — resistant to outliers.
Normalizer()
Scale each row to unit norm (not each column).
scaler.fit(X_train) → .transform(X_test)fit once
Fit only on train — never refit on test.

05Preprocessing: Encoding Categorical Datapreprocess

OneHotEncoder(handle_unknown='ignore')★
One binary column per category.
OrdinalEncoder()
Integer codes — only for genuinely ordered categories.
LabelEncoder()
Encodes the target y only, not feature columns.
pd.get_dummies() — pandas alternative
Quick one-hot outside a Pipeline.

06Preprocessing: Missing Valuespreprocess

SimpleImputer(strategy='mean')★
Fill with mean / median / most_frequent / constant.
KNNImputer(n_neighbors=5)
Fill using similar rows' values.
IterativeImputer()
Models each feature from the others (experimental).

07Feature Engineering & Selectionselect & tune

PolynomialFeatures(degree=2)
Add interaction & polynomial terms.
SelectKBest(f_classif, k=10)★
Keep the k highest-scoring features.
RFE(estimator, n_features_to_select=5)
Recursively drop the weakest feature.
VarianceThreshold()
Drop near-constant, uninformative features.
SelectFromModel(estimator)★
Keep features an already-fit model rates important.

08Pipelines & ColumnTransformerestimators / models

Pipeline([('scaler', StandardScaler()), ('clf', model)])★
Chain preprocessing + model as one estimator.
make_pipeline(StandardScaler(), model)
Same thing, with auto-named steps.
ColumnTransformer([('num', StandardScaler(), num_cols), ...])★
Different preprocessing per column group.
pipe.named_steps['clf']
Reach into a specific pipeline step.
param_grid = {'clf__C': [...]}★
Tune inside a pipeline with step__param.

09Classification Modelsestimators / models

LogisticRegression()★
Fast, interpretable linear baseline.
KNeighborsClassifier()
Vote among the k nearest training points.
SVC(kernel='rbf')
Max-margin classifier; rbf for non-linear boundaries.
DecisionTreeClassifier()
Interpretable, prone to overfitting alone.
RandomForestClassifier()★
Bagged trees — strong, low-tuning default.
GaussianNB()
Fast probabilistic baseline, good for text/small data.

10Regression Modelsestimators / models

LinearRegression()★
Ordinary least squares baseline.
Ridge() Lasso() ElasticNet()★
L2 / L1 / both — regularized linear models.
SVR(kernel='rbf')
Support vector regression.
RandomForestRegressor()★
Bagged trees for continuous targets.
GradientBoostingRegressor()
Sequential boosted trees — strong but slower.

11Clustering Modelsestimators / models

KMeans(n_clusters=3)★
Fast, needs k chosen in advance.
DBSCAN(eps=0.5, min_samples=5)★
Density-based; finds arbitrary shapes + noise.
AgglomerativeClustering()
Hierarchical, bottom-up merging.
MeanShift()
Finds cluster count automatically, slower.

12Dimensionality Reductionestimators / models

PCA(n_components=2)★
Linear projection onto directions of max variance.
TruncatedSVD()
PCA variant that works on sparse data.
TSNE(n_components=2)★
Nonlinear — for 2D/3D visualization only.
LinearDiscriminantAnalysis()
Supervised projection that separates classes.

13Ensemble Methodsestimators / models

VotingClassifier(estimators=[...])★
Combine predictions by (weighted) vote.
StackingClassifier(estimators=[...], final_estimator=...)
A meta-model learns from the base models.
BaggingClassifier(estimator, n_estimators=10)
Many models on bootstrap samples.
AdaBoostClassifier()
Sequentially reweights hard examples.

14Model Selection & Cross-Validationselect & tune

cross_val_score(model, X, y, cv=5)★
k-fold accuracy/R² in one call.
KFold(n_splits=5) StratifiedKFold(...)★
Plain / class-balanced folds.
cross_validate(model, X, y, scoring=[...])
Multiple metrics + fit times, one call.
learning_curve(model, X, y)
Score vs. training-set size — diagnoses under/overfitting.

15Hyperparameter Tuningselect & tune

GridSearchCV(estimator, param_grid, cv=5)★
Exhaustive search over a parameter grid.
RandomizedSearchCV(estimator, param_distributions, n_iter=20)★
Sample the grid — cheaper at scale.
grid.best_params_ .best_score_ .best_estimator_★
Winning config, its score, and the refit model.
HalvingGridSearchCV(...)
Successive halving — faster for large grids.

16Classification Metricsevaluate

accuracy_score(y_test, y_pred)★
Fraction correct — misleading if imbalanced.
precision_score / recall_score / f1_score★
Per-class quality on imbalanced data.
confusion_matrix(y_test, y_pred)★
True vs. predicted class counts.
classification_report(y_test, y_pred)★
Precision/recall/F1 for every class at once.
roc_auc_score(y_test, y_scores)
Ranking quality across all thresholds.

17Regression Metricsevaluate

mean_squared_error(y_test, y_pred)★
Penalizes large errors heavily.
root_mean_squared_error(y_test, y_pred)
Same units as the target (sklearn 1.4+).
mean_absolute_error(y_test, y_pred)★
Average absolute error, robust to outliers.
r2_score(y_test, y_pred)★
Variance explained — 1.0 is a perfect fit.

18Clustering Metricsevaluate

silhouette_score(X, labels)★
Cohesion vs. separation — no ground truth needed.
adjusted_rand_score(y_true, labels)
Agreement with known labels, chance-corrected.
calinski_harabasz_score(X, labels)
Ratio of between/within-cluster dispersion.

19Model Persistenceload & split

import joblib★
Preferred over pickle for NumPy-heavy sklearn models.
joblib.dump(model, 'model.pkl')★
Save a fitted model/pipeline to disk.
joblib.load('model.pkl')★
Reload it later, ready to .predict().
persist the whole Pipeline, not just the model
Keeps preprocessing bundled with it.

20Performance Tips & Gotchashandle with care

fit scaler/encoder on train, transform bothleakage
Fitting on all data leaks test information.
always set random_statesafe
Makes splits, shuffling & models reproducible.
class_weight='balanced'
Reweight classes automatically for imbalance.
n_jobs=-1
Use all CPU cores where supported (forests, CV, search).
preprocess outside cross-validationleakage
Put preprocessing inside the Pipeline you cross-validate.

★Common Estimator Attributesafter .fit()

model.coef_ model.intercept_★
Learned weights (linear models).
model.feature_importances_★
Split-based importance (trees / forests).
model.classes_
The label order used internally.
model.cluster_centers_
Centroid coordinates (KMeans).
model.n_features_in_
Number of features seen during fit.

★Which Metric, When?quick-read

balanced classification★
Accuracy is fine.
imbalanced classification★
Use F1, precision/recall, or ROC-AUC instead.
regression★
RMSE/MAE for error size, R² for variance explained.
clustering
Silhouette score — works without labels.

scikit-learn cheat sheet

The ML workflow, visually

train_test_split() ★

K-Fold Cross-Validation ★

Pipeline chaining ★

Under- vs. overfitting ★

Worth memorizing