Quick Reference · AWS AI/ML · Amazon SageMaker AI

AWS SageMaker AI cheat sheet

SageMaker AI is AWS's fully managed ML platform — it handles infrastructure so you can focus on data prep → train → tune → register → deploy → monitor. Every component maps to one of these lifecycle phases.

data / prepare train / tune model / registry deploy / serve MLOps / governance AWS AI services gotcha most common

Synthesized from: docs.aws.amazon.com/sagemaker · docs.aws.amazon.com/cli/latest/reference/sagemaker · skillcertpro.com AWS ML Specialty

The SageMaker AI ML lifecycle — six phases, each with its own managed tools
YOUR DATA PREPARE TRAIN & TUNE REGISTER DEPLOY MONITOR Your Data Amazon S3 (CSV, JSON, Parquet, RecordIO, Images, Text) EFS · FSx for Lustre Prepare Data Wrangler Ground Truth Feature Store Processing Jobs Train & Tune Estimator.fit() Built-in Algorithms BYOC / Frameworks HyperparameterTuner Register Model Registry ModelPackage version Approve / Reject Model Cards Deploy Real-time endpoint Serverless inference Async inference Batch Transform Monitor & Govern Model Monitor Clarify (bias/explain) Debugger / Experiments Pipelines (CI/CD for ML) drift detected → trigger retraining via Pipelines model.tar.gz → S3
01Platform Setup & ConfigSDK + CLI + IAM
02Studio & Dev Environmentswhere you write code
03Data Prep & Labeling60–70% of ML effort
04Built-in Algorithms: Supervisedtabular + time-series
05Built-in Algorithms: Unsupervisedno labels needed
06Built-in Algorithms: Text & VisionNLP + computer vision
07Training JobsSDK + CLI + instance tips
08JumpStart & AutoMLpre-built + zero-code ML
09Hyperparameter Tuning (AMT)automatic model tuning
10Inference Endpoints4 modes — pick the right one
11MLOps: Pipelines & Model RegistryCI/CD for ML
12Responsible AI & DebuggingClarify · Debugger · Experiments
13CLI Quick Referenceaws sagemaker …
AWS AI/ML Services Landscapebeyond SageMaker

Workflow patterns & decision guides

Rendered via Mermaid — based on the official SageMaker Developer Guide architecture diagrams.

Choose your inference mode

Pick based on latency tolerance, payload size, and cost sensitivity.

flowchart TD
  A["Need inference?"] --> B{"Batch or offline?"}
  B -->|Yes| BT["Batch Transform\n(no endpoint, cheapest\nfor bulk S3 data)"]
  B -->|No| C{"Payload over 6 MB\nor inference over 60s?"}
  C -->|Yes| AS["Async Inference\n(queued, up to 1 GB\npayload, SNS notify)"]
  C -->|No| D{"Traffic spiky\nor unpredictable?"}
  D -->|Yes| SL["Serverless Inference\n(scale-to-zero, sub-1s\ncold start, pay-per-call)"]
  D -->|No| RT["Real-time Endpoint\n(always-on, sub-1s latency\nA/B via Prod Variants)"]
      

SageMaker Pipelines flow

A typical CI/CD ML pipeline — each step is a managed SageMaker job.

flowchart LR
  P["ProcessingStep\n(Data Wrangler /\nSKLearn script)"] --> T["TrainingStep\n(Estimator.fit)"]
  T --> E["EvalStep\n(Processing +\nCondition)"]
  E -->|metric OK| R["RegisterModel\nStep"]
  E -->|metric fails| FAIL["Fail / Notify\n(SNS alert)"]
  R --> D["Deploy Step\n(create-endpoint\nif Approved)"]
      

SageMaker AI feature map by lifecycle phase

Every named feature sits in one or more lifecycle phases — use this to find the right tool fast.

flowchart LR
  subgraph Prep["Prepare"]
    GT["Ground Truth\n(labeling)"]
    DW["Data Wrangler\n(transforms)"]
    FS["Feature Store\n(online+offline)"]
    PJ["Processing Jobs\n(custom ETL)"]
  end
  subgraph Build["Build / Train"]
    JS["JumpStart\n(pre-trained FMs)"]
    BIA["Built-in\nAlgorithms"]
    BYOC["BYOC / Frameworks\n(TF, PyTorch, SKLearn)"]
    AP["Autopilot\n(AutoML)"]
  end
  subgraph Tune["Tune"]
    AMT["Auto Model\nTuning (AMT)"]
    DBG["Debugger\n(rules + profiler)"]
    EXP["Experiments\n(track + compare)"]
  end
  subgraph Deploy["Deploy"]
    RT2["Real-time\nEndpoint"]
    SRV["Serverless\nInference"]
    ASNC["Async\nInference"]
    BT2["Batch\nTransform"]
  end
  subgraph Gov["Govern"]
    MR["Model Registry\n(versions)"]
    MM["Model Monitor\n(drift)"]
    CL["Clarify\n(bias/SHAP)"]
    MC["Model Cards\n(docs)"]
    PL["Pipelines\n(CI/CD)"]
  end
  Prep --> Build --> Tune --> Deploy --> Gov
      

Worth memorizing

File vs Pipe vs FastFileFile=download all (default); Pipe=stream RecordIO (fast); FastFile=POSIX stream, any format
XGBoost ≠ GPUCPU-only; memory-bound → ml.m5 beats ml.c5; never ml.p2/p3 for XGBoost
Spot trainingEnableManagedSpotTraining=True → up to 80% savings; requires checkpointing + MaxWaitTime
Pipe mode = RecordIOPipe mode rejects CSV; convert to Protobuf RecordIO first; no EBS size limit
BYOC port 8080Custom inference containers must respond on port 8080 and answer /ping in <2 s
Model artifacts = tar.gzSageMaker expects model.tar.gz — not zip — stored in S3 at output_path
iam:PassRoleAny user calling create-model must have iam:PassRole on the SageMaker execution role
Production VariantsOne endpoint, multiple model versions with traffic weights → A/B tests, canary rollouts
Serverless = scale-to-zeroNo idle cost; ~1 s cold start; set MemorySizeInMB (1024–6144) + MaxConcurrency
SageMaker vs AI servicesSageMaker = custom ML. Rekognition/Comprehend/Forecast/etc. = pre-built, no training needed
delete-endpoint!Real-time endpoints bill per hour even when idle — always delete after testing
CSV no headerBuilt-in algorithms reject headers in CSVs; first column = label in supervised tasks