Post

🌊 Scikit-learn: Deep Dive & Best Practices

Concise, clear, and validated revision notes on Docker technologies — containers, Dockerfile patterns, docker-compose, and practical best practices for beginners and practitioners.

🌊 Scikit-learn: Deep Dive & Best Practices

Table of Contents

  1. Introduction
  2. Core Concepts
  3. The Estimator API
  4. Data Preprocessing
  5. Feature Engineering
  6. Pipelines
  7. Model Selection and Evaluation
  8. Cross-Validation
  9. Hyperparameter Tuning
  10. Best Practices
  11. Common Pitfalls
  12. Jargon Tables

Introduction

Scikit-learn (sklearn) is a comprehensive, open-source machine learning library for Python. Built on top of NumPy, SciPy, and Matplotlib, it provides simple and efficient tools for data mining, data analysis, and predictive modeling. Scikit-learn offers a consistent interface across diverse machine learning algorithms including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Key Features

  • Unified API: Consistent interface across all estimators
  • Extensive Algorithms: Classification, regression, clustering, dimensionality reduction
  • Built-in Preprocessing: Data transformation and feature engineering tools
  • Model Selection: Cross-validation, grid search, and evaluation metrics
  • Robust Documentation: Comprehensive guides and examples
  • Production Ready: Efficient implementations suitable for real-world applications

Core Concepts

What is an Estimator?

An estimator is any object that learns from data by implementing a fit() method. Estimators can be classifiers, regressors, clusterers, or transformers. All scikit-learn estimators follow a consistent API pattern.

Fundamental Principles

Consistency

All objects share a uniform interface with limited, well-documented methods:

  • fit(X, y): Learn parameters from training data
  • predict(X): Make predictions on new data
  • transform(X): Transform data (for transformers)
  • score(X, y): Evaluate model performance

Inspection

All learned parameters are accessible as public attributes with trailing underscores (e.g., coef_, feature_importances_).

Non-proliferation of Classes

Datasets are represented as NumPy arrays or SciPy sparse matrices. Hyperparameters are standard Python strings or numbers.

Composition

Machine learning algorithms are expressed as sequences of fundamental operations. Scikit-learn reuses existing building blocks whenever possible.

Sensible Defaults

Models provide reasonable default parameter values to enable quick prototyping.


The Estimator API

Core Methods

fit(X, y=None)

Trains or fits the model to data X (and target y, if applicable). Returns self to enable method chaining.

1
2
3
4
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)  # Learns parameters from training data

predict(X)

Makes predictions on new data X using learned parameters. Used in supervised learning (classifiers and regressors).

1
y_pred = model.predict(X_test)  # Returns predicted values

Important: predict() cannot be called before fit(). Attempting to do so raises NotFittedError.

transform(X)

Transforms input data X. Used by transformers (scalers, encoders, dimensionality reducers).

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)  # Fit and transform in one step
X_test_scaled = scaler.transform(X_test)  # Apply same transformation to test data

fit_transform(X, y=None)

Convenience method combining fit() and transform(). More efficient than calling them separately. Primarily used for transformers.

1
2
3
4
5
6
# Instead of:
scaler.fit(X_train)
X_scaled = scaler.transform(X_train)

# Use:
X_scaled = scaler.fit_transform(X_train)

fit_predict(X, y=None)

Fits the model and returns predictions on training data. Relevant for unsupervised learning (clustering algorithms).

1
2
3
4
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)  # Fit and predict cluster labels

score(X, y)

Evaluates model performance. Returns different metrics based on estimator type:

  • Classifiers: Accuracy score
  • Regressors: R² coefficient of determination
  • Clusterers: Not typically available
1
2
accuracy = classifier.score(X_test, y_test)
r2_score = regressor.score(X_test, y_test)

Estimator Types

Classifiers

Inherit from ClassifierMixin. Predict discrete class labels.

Key Methods:

  • fit(X, y): Train on features X and labels y
  • predict(X): Return predicted class labels
  • predict_proba(X): Return probability estimates for each class
  • predict_log_proba(X): Return log-probabilities
  • decision_function(X): Return confidence scores
  • score(X, y): Return accuracy score (default metric)

Examples: LogisticRegression, RandomForestClassifier, SVC

1
2
3
4
5
6
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)

Regressors

Inherit from RegressorMixin. Predict continuous target values.

Key Methods:

  • fit(X, y): Train on features X and continuous targets y
  • predict(X): Return predicted values
  • score(X, y): Return R² score (default metric)

Examples: LinearRegression, RandomForestRegressor, SVR

1
2
3
4
5
6
from sklearn.linear_model import Ridge

reg = Ridge(alpha=1.0)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
r2 = reg.score(X_test, y_test)

Transformers

Inherit from TransformerMixin. Transform input data without changing the number of samples.

Key Methods:

  • fit(X, y=None): Learn transformation parameters
  • transform(X): Apply transformation
  • fit_transform(X, y=None): Fit and transform in one step
  • inverse_transform(X): Reverse transformation (when applicable)

Examples: StandardScaler, PCA, OneHotEncoder

1
2
3
4
5
6
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_original = scaler.inverse_transform(X_test_scaled)  # Reverse scaling

Clusterers

Inherit from ClusterMixin. Group similar samples together.

Key Methods:

  • fit(X, y=None): Learn cluster structure (y is ignored if provided)
  • predict(X): Assign cluster labels to new data (when applicable)
  • fit_predict(X): Fit and assign labels in one step

Learned Attributes:

  • labels_: Cluster labels for each training sample

Examples: KMeans, DBSCAN, AgglomerativeClustering

1
2
3
4
5
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
cluster_labels = kmeans.fit_predict(X)
new_labels = kmeans.predict(X_new)  # Assign new samples to clusters

Data Preprocessing

Data preprocessing transforms raw data into a format suitable for machine learning algorithms. Scikit-learn provides comprehensive preprocessing tools.

Scaling and Normalization

StandardScaler

Standardizes features by removing mean and scaling to unit variance (z-score normalization).

Formula: z = (x - μ) / σ

Use When: Features have Gaussian distribution or different scales.

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Learned parameters
print(scaler.mean_)  # Mean of each feature
print(scaler.scale_)  # Standard deviation of each feature

MinMaxScaler

Scales features to a specified range (default [0, 1]).

Formula: X_scaled = (X - X_min) / (X_max - X_min)

Use When: Features need to be bounded within a specific range.

1
2
3
4
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X_train)

RobustScaler

Uses statistics robust to outliers (median and interquartile range).

Use When: Data contains many outliers.

1
2
3
4
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)

MaxAbsScaler

Scales each feature by its maximum absolute value. Preserves sparsity.

Use When: Data is already centered at zero or is sparse.

1
2
3
4
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X_train)

Normalizer

Normalizes samples individually to unit norm (L1, L2, or max norm).

Use When: Interested in the direction of feature vectors, not magnitude.

1
2
3
4
from sklearn.preprocessing import Normalizer

normalizer = Normalizer(norm='l2')
X_normalized = normalizer.fit_transform(X)

Encoding Categorical Variables

OneHotEncoder

Creates binary columns for each category (one-hot encoding/dummy variables).

1
2
3
4
5
6
7
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, drop='first')  # Drop first to avoid multicollinearity
X_encoded = encoder.fit_transform(X_categorical)

# Get feature names
feature_names = encoder.get_feature_names_out()

OrdinalEncoder

Encodes categorical features as integers (preserves ordinal relationship).

1
2
3
4
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_encoded = encoder.fit_transform(X_ordinal)

LabelEncoder

Encodes target labels as integers (for classification tasks).

1
2
3
4
5
6
7
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Inverse transform to get original labels
y_original = encoder.inverse_transform(y_encoded)

TargetEncoder

Encodes categories using target statistics (supervised encoding).

1
2
3
4
from sklearn.preprocessing import TargetEncoder

encoder = TargetEncoder(smooth=0.25)
X_encoded = encoder.fit_transform(X_categorical, y)

Handling Missing Values

SimpleImputer

Replaces missing values using various strategies.

Strategies:

  • mean: Replace with column mean
  • median: Replace with column median
  • most_frequent: Replace with mode
  • constant: Replace with specified constant
1
2
3
4
5
6
7
8
9
from sklearn.impute import SimpleImputer

# For numerical features
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_numerical)

# For categorical features
imputer_cat = SimpleImputer(strategy='most_frequent')
X_cat_imputed = imputer_cat.fit_transform(X_categorical)

KNNImputer

Imputes missing values using k-nearest neighbors.

1
2
3
4
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

IterativeImputer

Models each feature with missing values as a function of other features.

1
2
3
4
5
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)

Discretization

KBinsDiscretizer

Bins continuous features into discrete intervals.

1
2
3
4
from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_binned = discretizer.fit_transform(X)

Custom Transformations

FunctionTransformer

Applies custom functions to features.

1
2
3
4
5
6
from sklearn.preprocessing import FunctionTransformer
import numpy as np

# Apply log transformation
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X)

Feature Engineering

Feature engineering creates new features or transforms existing ones to improve model performance.

Polynomial Features

Creates polynomial and interaction features.

1
2
3
4
5
6
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Example: [a, b] becomes [a, b, a², ab, b²]

Feature Selection

Variance Threshold

Removes features with low variance.

1
2
3
4
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

SelectKBest

Selects k highest scoring features based on statistical tests.

1
2
3
4
5
6
7
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature indices
selected_features = selector.get_support(indices=True)

Recursive Feature Elimination (RFE)

Recursively removes features and builds model on remaining features.

1
2
3
4
5
6
7
8
9
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=10, step=1)
X_selected = selector.fit_transform(X, y)

# Get feature rankings
print(selector.ranking_)

SelectFromModel

Selects features based on importance weights from a model.

1
2
3
4
5
6
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(n_estimators=100)
selector = SelectFromModel(estimator, threshold='median')
X_selected = selector.fit_transform(X, y)

Dimensionality Reduction

Principal Component Analysis (PCA)

Reduces dimensionality by projecting data onto principal components.

1
2
3
4
5
6
7
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # Retain 95% variance
X_reduced = pca.fit_transform(X)

# Explained variance
print(pca.explained_variance_ratio_)

Linear Discriminant Analysis (LDA)

Supervised dimensionality reduction maximizing class separability.

1
2
3
4
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)
X_reduced = lda.fit_transform(X, y)

t-SNE

Non-linear dimensionality reduction for visualization.

1
2
3
4
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)

Text Feature Extraction

CountVectorizer

Converts text documents to token count matrix.

1
2
3
4
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_counts = vectorizer.fit_transform(documents)

TfidfVectorizer

Converts text to TF-IDF features.

1
2
3
4
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(documents)

Cyclical Encoding

Encodes periodic features (hour, day, month) preserving cyclical nature.

1
2
3
4
5
6
7
8
9
10
import numpy as np

def encode_cyclical(data, period):
    """Encode cyclical feature using sine and cosine."""
    sin_feature = np.sin(2 * np.pi * data / period)
    cos_feature = np.cos(2 * np.pi * data / period)
    return sin_feature, cos_feature

# Example: Encode hour of day (24-hour period)
hour_sin, hour_cos = encode_cyclical(df['hour'], 24)

Pipelines

Pipelines chain multiple processing steps together, ensuring consistent transformations across train/test data and preventing data leakage.

Creating Pipelines

make_pipeline

Quick way to create pipeline without naming steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(
    StandardScaler(),
    PCA(n_components=10),
    LogisticRegression()
)

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

Pipeline

Create pipeline with named steps for better control.

1
2
3
4
5
6
7
8
9
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression())
])

pipe.fit(X_train, y_train)

ColumnTransformer

Apply different transformations to different columns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['city', 'gender'])
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

FeatureUnion

Concatenate results of multiple transformer objects.

1
2
3
4
5
6
7
8
9
10
11
from sklearn.pipeline import FeatureUnion

feature_union = FeatureUnion([
    ('pca', PCA(n_components=10)),
    ('select', SelectKBest(k=15))
])

pipe = Pipeline([
    ('features', feature_union),
    ('classifier', LogisticRegression())
])

Pipeline Benefits

  1. Prevents Data Leakage: Transformations fitted only on training data
  2. Cleaner Code: Encapsulates entire workflow
  3. Easy Deployment: Single object for entire pipeline
  4. Cross-Validation Compatible: Works seamlessly with CV
  5. Hyperparameter Tuning: Can optimize all steps together

Accessing Pipeline Steps

1
2
3
4
5
6
7
8
# Access specific step
scaler = pipe.named_steps['scaler']

# Get learned parameters
print(scaler.mean_)

# Access intermediate transformations
X_transformed = pipe[:-1].transform(X_test)

Model Selection and Evaluation

Train-Test Split

1
2
3
4
5
6
7
8
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,     # Reproducibility
    stratify=y          # Maintain class distribution
)

Evaluation Metrics

Classification Metrics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# ROC-AUC for probability predictions
y_proba = classifier.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Detailed report
print(classification_report(y_test, y_pred))

Metric Selection:

  • Accuracy: Overall correctness (use when classes are balanced)
  • Precision: True positives / Predicted positives (minimize false positives)
  • Recall: True positives / Actual positives (minimize false negatives)
  • F1-Score: Harmonic mean of precision and recall (balanced metric)
  • ROC-AUC: Trade-off between true positive and false positive rates

Regression Metrics

1
2
3
4
5
6
7
8
9
10
11
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, 
    r2_score, mean_absolute_percentage_error
)

# Common metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

Metric Selection:

  • MSE/RMSE: Penalizes large errors more heavily
  • MAE: Robust to outliers
  • R²: Proportion of variance explained (0 to 1)
  • MAPE: Relative error as percentage

Handling Imbalanced Datasets

Class Weights

1
2
3
4
5
6
7
from sklearn.linear_model import LogisticRegression

# Automatically balance class weights
clf = LogisticRegression(class_weight='balanced')

# Manual weights
clf = LogisticRegression(class_weight={0: 1.0, 1: 5.0})

Resampling

1
2
3
4
5
6
7
8
9
10
11
12
13
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Oversampling with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Combined pipeline
pipe = ImbPipeline([
    ('smote', SMOTE()),
    ('classifier', LogisticRegression())
])

Stratified Splitting

1
2
3
4
5
6
7
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_idx, test_idx in splitter.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

Cross-Validation

Cross-validation assesses model generalization by training and testing on different data subsets.

K-Fold Cross-Validation

1
2
3
4
5
6
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Stratified K-Fold

Maintains class distribution in each fold.

1
2
3
4
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier, X, y, cv=skfold)

cross_validate

Returns multiple metrics and timing information.

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    model, X, y, 
    cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True,
    return_estimator=True  # Return fitted models
)

print(cv_results.keys())
# ['fit_time', 'score_time', 'test_accuracy', 'train_accuracy', ...]

Time Series Cross-Validation

1
2
3
4
5
6
7
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

Leave-One-Out Cross-Validation

1
2
3
4
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

Hyperparameter Tuning

Exhaustively searches specified parameter combinations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)

# Use best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

Samples random parameter combinations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(5, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=100,          # Number of parameter combinations to try
    cv=5,
    scoring='f1_weighted',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Pipeline Hyperparameter Tuning

Use double underscore notation to access nested parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('classifier', RandomForestClassifier())
])

param_grid = {
    'pca__n_components': [5, 10, 15, 20],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None]
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

Analyzing Grid Search Results

1
2
3
4
5
6
7
8
9
10
import pandas as pd

# Convert results to DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)

# Important columns
print(results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']])

# Best results
print(results_df.nlargest(5, 'mean_test_score'))

Best Practices

1. Always Set random_state

Ensures reproducibility of results involving randomness.

1
2
3
4
5
6
7
8
9
10
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model with randomness
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

2. Use Pipelines to Prevent Data Leakage

Wrong Approach (causes data leakage):

1
2
3
4
# DON'T DO THIS - fits on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

Correct Approach:

1
2
3
4
5
6
7
8
9
10
11
# Split first
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Use pipeline - fits only on training data
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

3. Apply Consistent Transformations

Always use the same transformer fitted on training data for test data.

1
2
3
4
5
6
# Fit on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data with same parameters
X_test_scaled = scaler.transform(X_test)  # NOT fit_transform!

4. Stratify When Splitting

Maintains class distribution in train/test splits.

1
2
3
4
5
6
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    stratify=y,  # Maintains class proportions
    random_state=42
)

5. Use Cross-Validation for Model Evaluation

Single train-test split may be unreliable.

1
2
3
4
5
6
7
8
# Instead of single split
X_train, X_test, y_train, y_test = train_test_split(X, y)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

# Use cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")

6. Choose Appropriate Metrics

Don’t rely solely on accuracy, especially with imbalanced datasets.

1
2
3
4
5
6
7
8
from sklearn.metrics import classification_report, confusion_matrix

# For imbalanced classification
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Use appropriate scoring in cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')

7. Scale Features for Distance-Based Algorithms

Algorithms sensitive to feature scales: SVM, KNN, Neural Networks, Linear/Logistic Regression with regularization.

1
2
3
4
5
6
7
8
# Required for these algorithms
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline([
    ('scaler', StandardScaler()),  # Essential!
    ('classifier', SVC())
])

Not required for tree-based algorithms: Decision Trees, Random Forest, Gradient Boosting.

8. Handle Missing Values Explicitly

1
2
3
4
5
6
7
8
9
10
11
# Check for missing values
print(X.isnull().sum())

# Handle in pipeline
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

9. Use n_jobs for Parallelization

Speed up computations on multi-core machines.

1
2
3
4
5
6
7
8
# Model training
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)  # Use all cores

# Cross-validation
scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)

# Grid search
grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1)

10. Save and Load Models

1
2
3
4
5
6
7
8
import joblib

# Save model
joblib.dump(model, 'model.pkl')

# Load model
loaded_model = joblib.load('model.pkl')
predictions = loaded_model.predict(X_test)

11. Feature Importance Analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
feature_names = X.columns  # Assuming pandas DataFrame

# Sort and visualize
indices = importances.argsort()[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=90)
plt.title('Feature Importances')
plt.tight_layout()
plt.show()

12. Monitor Training and Validation Performance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, 
    cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    n_jobs=-1
)

# Plot learning curve
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.show()

13. Use Appropriate Validation Strategy

1
2
3
4
5
6
7
8
9
10
11
# Time series data - use TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

# Grouped data - use GroupKFold
from sklearn.model_selection import GroupKFold
gkfold = GroupKFold(n_splits=5)

# Stratified for classification
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5)

14. Encode Target Variable for Classification

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.preprocessing import LabelEncoder

# If target is categorical strings
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Train model
model.fit(X_train, y_encoded)

# Decode predictions
y_pred = model.predict(X_test)
y_pred_original = le.inverse_transform(y_pred)

15. Optimize Memory Usage

1
2
3
4
5
6
7
8
9
10
# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)

# Specify dtype for memory efficiency
import numpy as np
X = X.astype(np.float32)  # Instead of float64

# Use sparse output in encoders
encoder = OneHotEncoder(sparse_output=True)

Common Pitfalls

1. Data Leakage

Problem: Information from test set influences training.

Examples:

  • Fitting scaler on entire dataset before split
  • Using target variable for feature engineering
  • Not handling time series data properly

Solution: Always split data first, fit only on training data.

1
2
3
4
5
6
7
# Wrong
scaler.fit(X)  # Uses information from test set
X_train, X_test = train_test_split(X)

# Correct
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only learns from training data

2. Not Using Pipelines

Problem: Inconsistent transformations between train and test data.

Solution: Always use pipelines for multi-step workflows.

1
2
3
4
5
6
# Use this pattern
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])
pipe.fit(X_train, y_train)

3. Overfitting

Symptoms:

  • High training accuracy, low test accuracy
  • Large gap between training and validation scores

Solutions:

1
2
3
4
5
6
7
8
9
10
# Regularization
model = LogisticRegression(C=0.1)  # Stronger regularization

# More training data or data augmentation

# Simpler model
model = RandomForestClassifier(max_depth=5)  # Limit complexity

# Cross-validation to detect
scores = cross_val_score(model, X, y, cv=5)

4. Ignoring Class Imbalance

Problem: Model biased toward majority class.

Solutions:

1
2
3
4
5
6
7
8
9
10
# Class weights
model = LogisticRegression(class_weight='balanced')

# Resampling
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Appropriate metrics
from sklearn.metrics import f1_score, roc_auc_score

5. Not Scaling Features

Problem: Algorithms like SVM, KNN perform poorly with unscaled features.

Solution: Always scale for distance-based algorithms.

1
2
3
4
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

6. Using predict() Instead of predict_proba()

Problem: Loses probability information for threshold tuning.

Solution: Use probabilities when available.

1
2
3
4
5
6
7
8
9
10
# Get probabilities
y_proba = classifier.predict_proba(X_test)[:, 1]

# Adjust decision threshold
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Choose optimal threshold
optimal_threshold = thresholds[np.argmax(precision + recall)]
y_pred_adjusted = (y_proba >= optimal_threshold).astype(int)

7. Incorrect Cross-Validation for Time Series

Problem: Using standard K-Fold for temporal data causes data leakage.

Solution: Use TimeSeriesSplit.

1
2
3
4
5
# Wrong for time series
kfold = KFold(n_splits=5)

# Correct for time series
tscv = TimeSeriesSplit(n_splits=5)

8. Not Handling Categorical Variables

Problem: Passing categorical strings directly to algorithms.

Solution: Encode categorical variables properly.

1
2
3
4
5
6
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

9. Forgetting to Set random_state

Problem: Results not reproducible.

Solution: Always set random_state.

1
2
3
4
# Everywhere randomness is involved
train_test_split(X, y, random_state=42)
RandomForestClassifier(random_state=42)
KFold(n_splits=5, shuffle=True, random_state=42)

10. Using Default Hyperparameters

Problem: Default parameters may not be optimal for your data.

Solution: Tune hyperparameters systematically.

1
2
3
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

Jargon Tables

Table 1: Machine Learning Lifecycle Terminology

Scikit-learn TermAlternative TermsDefinitionContext
fittrain, learn, buildLearn parameters from training dataModel training phase
transformpreprocess, convert, mapApply learned transformation to dataData preprocessing
predictinfer, forecast, estimateGenerate predictions on new dataModel inference
scoreevaluate, assess, measureCalculate performance metricModel evaluation
fit_transformlearn and applyFit transformer and transform data in one stepEfficient preprocessing
fit_predicttrain and inferFit model and predict on same dataClustering/unsupervised
estimatormodel, learner, algorithmObject that learns from dataCore ML concept
pipelineworkflow, chain, sequenceSeries of data transformations and modelEnd-to-end process
cross-validationk-fold validation, CVAssess model on multiple train-test splitsModel validation
hyperparametertuning parameter, meta-parameterParameter set before trainingModel configuration
featureattribute, variable, predictorInput variable for modelData representation
targetlabel, response, outcome, dependent variableVariable to predictSupervised learning
sampleinstance, observation, example, data pointSingle row of dataDataset element
training settrain data, training dataData used to fit modelModel training
test setholdout set, validation setData used to evaluate modelModel evaluation
overfittingmemorization, high varianceModel too complex for dataModel diagnostic
underfittingoversimplification, high biasModel too simple for dataModel diagnostic
regularizationpenalization, shrinkageTechnique to reduce overfittingModel constraint
stratificationproportional samplingMaintain class distribution in splitsSampling technique

Table 2: Hierarchical Differentiation of Estimator Types

LevelCategorySubcategoryExamplesPrimary Methods
1EstimatorBase class for allAll sklearn objectsfit()
2PredictorSupervised learnersAll classifiers/regressorsfit(), predict(), score()
  ClassifierLogisticRegression, SVCpredict_proba(), decision_function()
  RegressorLinearRegression, SVR(basic predict only)
2TransformerData transformersScalers, encoders, PCAfit(), transform(), fit_transform()
  Feature ExtractorCountVectorizer, TfidfVectorizerText → numeric features
  Feature SelectorSelectKBest, RFEReduce feature dimensions
  PreprocessorStandardScaler, OneHotEncoderData normalization/encoding
  Dimensionality ReducerPCA, LDA, t-SNEReduce feature space
2ClustererUnsupervised learnersKMeans, DBSCANfit(), fit_predict()
2Meta-EstimatorWraps other estimatorsPipeline, GridSearchCVComposite functionality
  EnsembleRandomForest, GradientBoostingCombines multiple models
  MultioutputMultiOutputClassifierHandles multiple targets
  CalibrationCalibratedClassifierCVProbability calibration

Table 3: Data Splitting Terminology

TermDescriptionUse CaseScikit-learn Function
Train-Test SplitSingle split into two setsQuick model evaluationtrain_test_split()
K-Fold CVk equal-sized foldsGeneral cross-validationKFold
Stratified K-FoldK-fold maintaining class distributionClassification with imbalanced classesStratifiedKFold
Time Series SplitSequential splits for temporal dataTime series forecastingTimeSeriesSplit
Leave-One-Outn folds (one sample per fold)Small datasetsLeaveOneOut
Leave-P-OutAll combinations leaving p samples outVery small datasetsLeavePOut
Group K-FoldK-fold respecting group boundariesGrouped data (patients, sessions)GroupKFold
Shuffle SplitRandom permutation splitFlexible train/test sizesShuffleSplit

Table 4: Preprocessing Method Categories

Operation TypeTechniquePurposeScikit-learn Class
ScalingStandardizationMean=0, Std=1StandardScaler
 Min-Max ScalingRange [0,1] or customMinMaxScaler
 Robust ScalingUse median and IQRRobustScaler
 Max-Abs ScalingDivide by max absolute valueMaxAbsScaler
NormalizationL1/L2 NormScale samples to unit normNormalizer
EncodingOne-Hot EncodingBinary columns per categoryOneHotEncoder
 Ordinal EncodingInteger encodingOrdinalEncoder
 Label EncodingEncode target labelsLabelEncoder
 Target EncodingUse target statisticsTargetEncoder
ImputationMean/MedianFill with central tendencySimpleImputer
 KNN ImputationUse k-nearest neighborsKNNImputer
 IterativeModel-based imputationIterativeImputer
DiscretizationBinningConvert continuous to discreteKBinsDiscretizer
Feature CreationPolynomialGenerate polynomial featuresPolynomialFeatures

Table 5: Model Evaluation Terminology

Metric TypeMetric NameFormula/DescriptionBest ValueUse Case
ClassificationAccuracy(TP + TN) / Total1.0Balanced classes
 PrecisionTP / (TP + FP)1.0Minimize false positives
 Recall/SensitivityTP / (TP + FN)1.0Minimize false negatives
 F1-Score2 × (Precision × Recall) / (Precision + Recall)1.0Balance precision/recall
 ROC-AUCArea under ROC curve1.0Binary classification
 Log Loss-Σ(y log(p) + (1-y)log(1-p))0.0Probability accuracy
RegressionMSEMean((y - ŷ)²)0.0Penalize large errors
 RMSE√MSE0.0Same units as target
 MAEMean(|y - ŷ|)0.0Robust to outliers
 R²1 - (SS_res / SS_tot)1.0Variance explained
 MAPEMean(|y - ŷ| / |y|) × 1000.0Percentage error
ClusteringSilhouette Score(b - a) / max(a, b)1.0Cluster separation
 Davies-BouldinAvg similarity of clusters0.0Cluster compactness
 Calinski-HarabaszBetween/within variance ratioHigher betterCluster density

Table 6: Hyperparameter Tuning Terminology

TermDescriptionStrategyScikit-learn Class
Grid SearchExhaustive search over parameter gridTry all combinationsGridSearchCV
Random SearchRandom sampling from parameter distributionsSample n_iter combinationsRandomizedSearchCV
Halving Grid SearchSuccessive halving with gridEarly stopping low performersHalvingGridSearchCV
Halving Random SearchSuccessive halving with randomProgressive refinementHalvingRandomSearchCV
Manual SearchUser-defined parameter testingCustom iterationLoop with cross_val_score

Advanced Topics

Custom Estimators

Create custom estimators by inheriting from base classes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.base import BaseEstimator, TransformerMixin

class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self, factor=1.0):
        self.factor = factor
    
    def fit(self, X, y=None):
        # Learn parameters from training data
        self.mean_ = X.mean(axis=0)
        self.std_ = X.std(axis=0)
        return self
    
    def transform(self, X):
        # Apply transformation
        return (X - self.mean_) / (self.std_ * self.factor)
    
    def inverse_transform(self, X):
        # Reverse transformation
        return (X * self.std_ * self.factor) + self.mean_

# Use in pipeline
pipe = Pipeline([
    ('custom_scaler', CustomScaler(factor=2.0)),
    ('classifier', LogisticRegression())
])

Ensemble Methods

Combine multiple models for better performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.ensemble import (
    VotingClassifier, StackingClassifier,
    BaggingClassifier, AdaBoostClassifier
)

# Voting Classifier - average predictions
voting = VotingClassifier([
    ('lr', LogisticRegression()),
    ('rf', RandomForestClassifier()),
    ('svc', SVC(probability=True))
], voting='soft')  # 'soft' uses probabilities

# Stacking Classifier - meta-model on predictions
stacking = StackingClassifier([
    ('rf', RandomForestClassifier()),
    ('svc', SVC())
], final_estimator=LogisticRegression())

# Bagging - bootstrap aggregation
bagging = BaggingClassifier(
    LogisticRegression(),
    n_estimators=10,
    max_samples=0.8,
    random_state=42
)

# Boosting - sequential learning
boosting = AdaBoostClassifier(
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

Calibration

Improve probability predictions.

1
2
3
4
5
6
7
8
9
10
11
from sklearn.calibration import CalibratedClassifierCV

# Calibrate classifier probabilities
calibrated = CalibratedClassifierCV(
    SVC(),  # SVC doesn't have good probability estimates
    method='sigmoid',  # or 'isotonic'
    cv=5
)

calibrated.fit(X_train, y_train)
y_proba = calibrated.predict_proba(X_test)

Multi-Output Models

Handle multiple target variables.

1
2
3
4
5
6
7
8
9
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor

# Multi-output classification
multi_clf = MultiOutputClassifier(RandomForestClassifier())
multi_clf.fit(X_train, y_train)  # y_train has multiple columns

# Multi-output regression
multi_reg = MultiOutputRegressor(LinearRegression())
multi_reg.fit(X_train, y_train)

Handling Mixed Data Types

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.compose import make_column_transformer, make_column_selector

# Automatic selection by dtype
preprocessor = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(), make_column_selector(dtype_include=object))
)

# Use in pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

Partial Fitting for Large Datasets

Train on data that doesn’t fit in memory.

1
2
3
4
5
6
7
8
from sklearn.linear_model import SGDClassifier

# Models supporting partial_fit
model = SGDClassifier()

# Train in batches
for X_batch, y_batch in data_batches:
    model.partial_fit(X_batch, y_batch, classes=np.unique(y))

Feature Hashing

Efficient encoding for high-dimensional categorical features.

1
2
3
4
from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=10, input_type='string')
X_hashed = hasher.transform(raw_data)

Performance Optimization Tips

1. Use Appropriate Data Structures

1
2
3
4
5
6
# For sparse data, use sparse matrices
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)

# For dense data with consistent dtype
X = np.array(X, dtype=np.float32)  # Less memory than float64

2. Parallelize Operations

1
2
3
4
# Use all CPU cores
model = RandomForestClassifier(n_jobs=-1)
scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)
grid_search = GridSearchCV(model, params, n_jobs=-1)

3. Reduce Data Size

1
2
3
4
5
6
7
8
# Feature selection before training
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=20)
X_reduced = selector.fit_transform(X, y)

# Sample data for prototyping
X_sample, _, y_sample, _ = train_test_split(X, y, train_size=0.1)

4. Use Warm Start for Iterative Models

1
2
3
4
5
6
7
# Continue training from previous state
model = GradientBoostingClassifier(warm_start=True, n_estimators=100)
model.fit(X_train, y_train)

# Add more trees
model.n_estimators = 200
model.fit(X_train, y_train)  # Continues from 100 trees

5. Profile Your Code

1
2
3
4
5
6
7
8
9
10
import time

start = time.time()
model.fit(X_train, y_train)
end = time.time()
print(f"Training time: {end - start:.2f} seconds")

# Use cross_validate for detailed timing
cv_results = cross_validate(model, X, y, cv=5, return_train_score=True)
print(f"Avg fit time: {cv_results['fit_time'].mean():.3f}s")

Complete Example Workflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# 1. Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# 2. Identify feature types
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# 3. Create preprocessing pipeline
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# 4. Create full pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 5. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# 6. Define hyperparameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_split': [2, 5, 10]
}

# 7. Perform grid search with cross-validation
grid_search = GridSearchCV(
    pipe,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# 8. Get best model
best_model = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)

# 9. Evaluate on test set
y_pred = best_model.predict(X_test)
print("\nTest Set Performance:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# 10. Cross-validation on full dataset
cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='f1_weighted')
print(f"\nCross-validation F1: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# 11. Save model
joblib.dump(best_model, 'best_model.pkl')

# 12. Load and use model later
loaded_model = joblib.load('best_model.pkl')
new_predictions = loaded_model.predict(X_new)

References

This post is licensed under CC BY 4.0 by the author.