from sklearn.model_selection import train_test_split★The single most-imported function in sklearn.import numpy as np, pandas as pdsklearn expects numeric arrays or DataFrames.pip install scikit-learnInstall (imports assklearn).import sklearn; sklearn.__version__Check the installed version.random_state=42★Set it anywhere randomness appears, for reproducibility.
from sklearn.datasets import load_iris★Small built-in toy datasets for practice.fetch_california_housing()Larger real-world datasets, downloaded on first use.make_classification(n_samples=200, n_features=4)★Synthetic classification data for testing.make_regression(...) make_blobs(...)Synthetic regression / clustering data.train_test_split(X, y, test_size=0.2, random_state=42)★Split into training and test sets.train_test_split(..., stratify=y)★Preserve class proportions in the split.
model = Estimator(**params)★Every algorithm is a class configured by hyperparameters.model.fit(X_train, y_train)★Learn from training data — the universal verb.model.predict(X_test)★Generate predictions on new data.model.predict_proba(X_test)Class probabilities (classifiers only).model.score(X_test, y_test)★Quick built-in metric (accuracy / R²).transformer.fit_transform(X)Fit + transform in one call (preprocessors).
StandardScaler()★Zero mean, unit variance — the default choice.MinMaxScaler()Scale into a fixed [0, 1] range.RobustScaler()★Uses median/IQR — resistant to outliers.Normalizer()Scale each row to unit norm (not each column).scaler.fit(X_train) → .transform(X_test)fit onceFit only on train — never refit on test.
OneHotEncoder(handle_unknown='ignore')★One binary column per category.OrdinalEncoder()Integer codes — only for genuinely ordered categories.LabelEncoder()Encodes the targetyonly, not feature columns.pd.get_dummies() — pandas alternativeQuick one-hot outside a Pipeline.
SimpleImputer(strategy='mean')★Fill with mean / median / most_frequent / constant.KNNImputer(n_neighbors=5)Fill using similar rows' values.IterativeImputer()Models each feature from the others (experimental).
PolynomialFeatures(degree=2)Add interaction & polynomial terms.SelectKBest(f_classif, k=10)★Keep the k highest-scoring features.RFE(estimator, n_features_to_select=5)Recursively drop the weakest feature.VarianceThreshold()Drop near-constant, uninformative features.SelectFromModel(estimator)★Keep features an already-fit model rates important.
Pipeline([('scaler', StandardScaler()), ('clf', model)])★Chain preprocessing + model as one estimator.make_pipeline(StandardScaler(), model)Same thing, with auto-named steps.ColumnTransformer([('num', StandardScaler(), num_cols), ...])★Different preprocessing per column group.pipe.named_steps['clf']Reach into a specific pipeline step.param_grid = {'clf__C': [...]}★Tune inside a pipeline withstep__param.
LogisticRegression()★Fast, interpretable linear baseline.KNeighborsClassifier()Vote among the k nearest training points.SVC(kernel='rbf')Max-margin classifier; rbf for non-linear boundaries.DecisionTreeClassifier()Interpretable, prone to overfitting alone.RandomForestClassifier()★Bagged trees — strong, low-tuning default.GaussianNB()Fast probabilistic baseline, good for text/small data.
LinearRegression()★Ordinary least squares baseline.Ridge() Lasso() ElasticNet()★L2 / L1 / both — regularized linear models.SVR(kernel='rbf')Support vector regression.RandomForestRegressor()★Bagged trees for continuous targets.GradientBoostingRegressor()Sequential boosted trees — strong but slower.
KMeans(n_clusters=3)★Fast, needs k chosen in advance.DBSCAN(eps=0.5, min_samples=5)★Density-based; finds arbitrary shapes + noise.AgglomerativeClustering()Hierarchical, bottom-up merging.MeanShift()Finds cluster count automatically, slower.
PCA(n_components=2)★Linear projection onto directions of max variance.TruncatedSVD()PCA variant that works on sparse data.TSNE(n_components=2)★Nonlinear — for 2D/3D visualization only.LinearDiscriminantAnalysis()Supervised projection that separates classes.
VotingClassifier(estimators=[...])★Combine predictions by (weighted) vote.StackingClassifier(estimators=[...], final_estimator=...)A meta-model learns from the base models.BaggingClassifier(estimator, n_estimators=10)Many models on bootstrap samples.AdaBoostClassifier()Sequentially reweights hard examples.
cross_val_score(model, X, y, cv=5)★k-fold accuracy/R² in one call.KFold(n_splits=5) StratifiedKFold(...)★Plain / class-balanced folds.cross_validate(model, X, y, scoring=[...])Multiple metrics + fit times, one call.learning_curve(model, X, y)Score vs. training-set size — diagnoses under/overfitting.
GridSearchCV(estimator, param_grid, cv=5)★Exhaustive search over a parameter grid.RandomizedSearchCV(estimator, param_distributions, n_iter=20)★Sample the grid — cheaper at scale.grid.best_params_ .best_score_ .best_estimator_★Winning config, its score, and the refit model.HalvingGridSearchCV(...)Successive halving — faster for large grids.
accuracy_score(y_test, y_pred)★Fraction correct — misleading if imbalanced.precision_score / recall_score / f1_score★Per-class quality on imbalanced data.confusion_matrix(y_test, y_pred)★True vs. predicted class counts.classification_report(y_test, y_pred)★Precision/recall/F1 for every class at once.roc_auc_score(y_test, y_scores)Ranking quality across all thresholds.
mean_squared_error(y_test, y_pred)★Penalizes large errors heavily.root_mean_squared_error(y_test, y_pred)Same units as the target (sklearn 1.4+).mean_absolute_error(y_test, y_pred)★Average absolute error, robust to outliers.r2_score(y_test, y_pred)★Variance explained — 1.0 is a perfect fit.
silhouette_score(X, labels)★Cohesion vs. separation — no ground truth needed.adjusted_rand_score(y_true, labels)Agreement with known labels, chance-corrected.calinski_harabasz_score(X, labels)Ratio of between/within-cluster dispersion.
import joblib★Preferred over pickle for NumPy-heavy sklearn models.joblib.dump(model, 'model.pkl')★Save a fitted model/pipeline to disk.joblib.load('model.pkl')★Reload it later, ready to.predict().persist the whole Pipeline, not just the modelKeeps preprocessing bundled with it.
fit scaler/encoder on train, transform bothleakageFitting on all data leaks test information.always set random_statesafeMakes splits, shuffling & models reproducible.class_weight='balanced'Reweight classes automatically for imbalance.n_jobs=-1Use all CPU cores where supported (forests, CV, search).preprocess outside cross-validationleakagePut preprocessing inside the Pipeline you cross-validate.
model.coef_ model.intercept_★Learned weights (linear models).model.feature_importances_★Split-based importance (trees / forests).model.classes_The label order used internally.model.cluster_centers_Centroid coordinates (KMeans).model.n_features_in_Number of features seen duringfit.
balanced classification★Accuracy is fine.imbalanced classification★Use F1, precision/recall, or ROC-AUC instead.regression★RMSE/MAE for error size, R² for variance explained.clusteringSilhouette score — works without labels.