How to Deploy ML Models Using Vertex AI: Step-by-Step

The gap between a trained ML model and a production prediction API is wider than most tutorials admit. I've watched teams spend weeks debugging why their perfectly accurate scikit-learn model produces garbage results in production — wrong input preprocessing, mismatched feature encoders, unbounded memory growth under concurrent load. Vertex AI solves the infrastructure side of this problem comprehensively, but you still need to package your model correctly.

This guide takes you from a trained model file on your local machine to a production REST endpoint on Vertex AI that auto-scales, health-checks itself, and logs every prediction for audit. I'll use a scikit-learn classification model as the example, but the pattern applies identically to TensorFlow, PyTorch, and XGBoost.

Prerequisites

# Install Google Cloud SDK
brew install google-cloud-sdk  # macOS

# Authenticate
gcloud auth login
gcloud auth application-default login

# Set your project
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable aiplatform.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud services enable cloudbuild.googleapis.com

# Install Python SDK
pip install google-cloud-aiplatform scikit-learn cloudpickle

Step 1: Train and Serialize Your Model

Start with a simple scikit-learn pipeline that includes preprocessing. A critical mistake is serializing only the model and not the feature transformer — your production endpoint needs to apply the exact same transformations to raw input that training applied.

# train.py
import pandas as pd
import cloudpickle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv("customer_churn.csv")
X = df.drop("churned", axis=1)
y = df["churned"]

numeric_features = ["tenure_months", "monthly_charges", "total_charges"]
categorical_features = ["contract_type", "payment_method"]

preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

# IMPORTANT: Pipeline includes preprocessor + model together
model_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(n_estimators=200, max_depth=5))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train, y_train)

print(f"Test Accuracy: {model_pipeline.score(X_test, y_test):.4f}")

# Serialize the full pipeline (not just the classifier!)
with open("model.pkl", "wb") as f:
    cloudpickle.dump(model_pipeline, f)

Step 2: Upload Model Artifact to Cloud Storage

# Create a GCS bucket for model artifacts
gsutil mb -l us-central1 gs://my-project-model-artifacts

# Upload the model pickle file
gsutil cp model.pkl gs://my-project-model-artifacts/churn-model/v1/model.pkl

# Verify the upload
gsutil ls gs://my-project-model-artifacts/churn-model/v1/

Step 3: Register Model in Vertex AI Model Registry

Vertex AI has pre-built serving containers for common frameworks. For scikit-learn, you specify the framework version and point to the GCS artifact path — no custom Docker image required.

from google.cloud import aiplatform

aiplatform.init(project="my-gcp-project", location="us-central1")

# Upload model to Vertex AI Model Registry
# Vertex's pre-built sklearn container handles loading model.pkl automatically
model = aiplatform.Model.upload(
    display_name="customer-churn-classifier",
    artifact_uri="gs://my-project-model-artifacts/churn-model/v1/",
    
    # Pre-built sklearn serving container (no Docker needed)
    serving_container_image_uri=(
        "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest"
    ),
    
    # Define the input/output schema for request validation
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
)

print(f"Model registered: {model.resource_name}")
# Output: projects/12345/locations/us-central1/models/987654321

Step 4: Create and Deploy to an Endpoint

An Endpoint is the managed REST API surface. You create one endpoint per model use case, then deploy model versions to it. This separation allows you to update model versions without changing the endpoint URL your clients call.

# Create the endpoint
endpoint = aiplatform.Endpoint.create(
    display_name="churn-prediction-endpoint",
    description="Production endpoint for customer churn prediction",
)

# Deploy the model to the endpoint
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="churn-v1-deployment",
    
    # Machine type for serving
    machine_type="n1-standard-2",    # 2 vCPUs, 7.5 GB RAM
    
    # Auto-scaling configuration
    min_replica_count=1,              # Always keep at least 1 instance warm
    max_replica_count=10,             # Scale up to 10 under load
    
    # Traffic allocation (useful for A/B testing)
    traffic_percentage=100,           # 100% of traffic to this model version
)

print(f"Endpoint deployed: {endpoint.resource_name}")

Step 5: Send Prediction Requests

# Making predictions via Python SDK
instances = [
    {
        "tenure_months": 24,
        "monthly_charges": 79.95,
        "total_charges": 1918.80,
        "contract_type": "Month-to-month",
        "payment_method": "Electronic check"
    },
    {
        "tenure_months": 60,
        "monthly_charges": 45.00,
        "total_charges": 2700.00,
        "contract_type": "Two year",
        "payment_method": "Bank transfer"
    }
]

response = endpoint.predict(instances=instances)
print(response.predictions)
# Output: [1, 0]  -- customer 1 likely to churn, customer 2 unlikely

# Or via REST API (for any language/client):
# POST https://us-central1-aiplatform.googleapis.com/v1/{endpoint}/predict
# Authorization: Bearer $(gcloud auth print-access-token)
# Body: {"instances": [...]}

Step 6: Enable Model Monitoring

After deployment, enable drift monitoring to get alerted when incoming request distributions diverge from training data:

from google.cloud.aiplatform import model_monitoring

monitoring_job = aiplatform.ModelDeploymentMonitoringJob.create(
    display_name="churn-model-monitoring",
    endpoint=endpoint.resource_name,
    
    # Sample 10% of live traffic for monitoring analysis
    logging_sampling_strategy=model_monitoring.RandomSampleConfig(random_sample_rate=0.1),
    
    # Alert thresholds: flag if feature distribution shifts by more than 0.2 Jensen-Shannon divergence
    drift_thresholds={
        "tenure_months": model_monitoring.ThresholdConfig(value=0.2),
        "monthly_charges": model_monitoring.ThresholdConfig(value=0.2),
    },
    
    # Training data reference for distribution comparison
    training_dataset=aiplatform.Model.get(model.resource_name).training_dataset_info,
    
    # Check every 6 hours
    monitor_interval="21600s"
)

Conclusion

The six steps above take you from a local model file to a production REST endpoint with auto-scaling and drift monitoring in approximately two hours of work. The most important lesson is to always serialize the complete preprocessing pipeline alongside the model — this single decision eliminates the majority of training-serving skew bugs that plague production ML systems.