How to Deploy ML Models Using Vertex AI: Step-by-Step
The gap between a trained ML model and a production prediction API is wider than most tutorials admit. I've watched teams spend weeks debugging why their perfectly accurate scikit-learn model produces garbage results in production — wrong input preprocessing, mismatched feature encoders, unbounded memory growth under concurrent load. Vertex AI solves the infrastructure side of this problem comprehensively, but you still need to package your model correctly.
This guide takes you from a trained model file on your local machine to a production REST endpoint on Vertex AI that auto-scales, health-checks itself, and logs every prediction for audit. I'll use a scikit-learn classification model as the example, but the pattern applies identically to TensorFlow, PyTorch, and XGBoost.
Prerequisites
# Install Google Cloud SDK brew install google-cloud-sdk # macOS # Authenticate gcloud auth login gcloud auth application-default login # Set your project gcloud config set project YOUR_PROJECT_ID # Enable required APIs gcloud services enable aiplatform.googleapis.com gcloud services enable artifactregistry.googleapis.com gcloud services enable cloudbuild.googleapis.com # Install Python SDK pip install google-cloud-aiplatform scikit-learn cloudpickle
Step 1: Train and Serialize Your Model
Start with a simple scikit-learn pipeline that includes preprocessing. A critical mistake is serializing only the model and not the feature transformer — your production endpoint needs to apply the exact same transformations to raw input that training applied.
# train.py
import pandas as pd
import cloudpickle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv("customer_churn.csv")
X = df.drop("churned", axis=1)
y = df["churned"]
numeric_features = ["tenure_months", "monthly_charges", "total_charges"]
categorical_features = ["contract_type", "payment_method"]
preprocessor = ColumnTransformer(transformers=[
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])
# IMPORTANT: Pipeline includes preprocessor + model together
model_pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier(n_estimators=200, max_depth=5))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train, y_train)
print(f"Test Accuracy: {model_pipeline.score(X_test, y_test):.4f}")
# Serialize the full pipeline (not just the classifier!)
with open("model.pkl", "wb") as f:
cloudpickle.dump(model_pipeline, f)
Step 2: Upload Model Artifact to Cloud Storage
# Create a GCS bucket for model artifacts gsutil mb -l us-central1 gs://my-project-model-artifacts # Upload the model pickle file gsutil cp model.pkl gs://my-project-model-artifacts/churn-model/v1/model.pkl # Verify the upload gsutil ls gs://my-project-model-artifacts/churn-model/v1/
Step 3: Register Model in Vertex AI Model Registry
Vertex AI has pre-built serving containers for common frameworks. For scikit-learn, you specify the framework version and point to the GCS artifact path — no custom Docker image required.
from google.cloud import aiplatform
aiplatform.init(project="my-gcp-project", location="us-central1")
# Upload model to Vertex AI Model Registry
# Vertex's pre-built sklearn container handles loading model.pkl automatically
model = aiplatform.Model.upload(
display_name="customer-churn-classifier",
artifact_uri="gs://my-project-model-artifacts/churn-model/v1/",
# Pre-built sklearn serving container (no Docker needed)
serving_container_image_uri=(
"us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest"
),
# Define the input/output schema for request validation
serving_container_predict_route="/predict",
serving_container_health_route="/health",
)
print(f"Model registered: {model.resource_name}")
# Output: projects/12345/locations/us-central1/models/987654321
Step 4: Create and Deploy to an Endpoint
An Endpoint is the managed REST API surface. You create one endpoint per model use case, then deploy model versions to it. This separation allows you to update model versions without changing the endpoint URL your clients call.
# Create the endpoint
endpoint = aiplatform.Endpoint.create(
display_name="churn-prediction-endpoint",
description="Production endpoint for customer churn prediction",
)
# Deploy the model to the endpoint
model.deploy(
endpoint=endpoint,
deployed_model_display_name="churn-v1-deployment",
# Machine type for serving
machine_type="n1-standard-2", # 2 vCPUs, 7.5 GB RAM
# Auto-scaling configuration
min_replica_count=1, # Always keep at least 1 instance warm
max_replica_count=10, # Scale up to 10 under load
# Traffic allocation (useful for A/B testing)
traffic_percentage=100, # 100% of traffic to this model version
)
print(f"Endpoint deployed: {endpoint.resource_name}")
Step 5: Send Prediction Requests
# Making predictions via Python SDK
instances = [
{
"tenure_months": 24,
"monthly_charges": 79.95,
"total_charges": 1918.80,
"contract_type": "Month-to-month",
"payment_method": "Electronic check"
},
{
"tenure_months": 60,
"monthly_charges": 45.00,
"total_charges": 2700.00,
"contract_type": "Two year",
"payment_method": "Bank transfer"
}
]
response = endpoint.predict(instances=instances)
print(response.predictions)
# Output: [1, 0] -- customer 1 likely to churn, customer 2 unlikely
# Or via REST API (for any language/client):
# POST https://us-central1-aiplatform.googleapis.com/v1/{endpoint}/predict
# Authorization: Bearer $(gcloud auth print-access-token)
# Body: {"instances": [...]}
Step 6: Enable Model Monitoring
After deployment, enable drift monitoring to get alerted when incoming request distributions diverge from training data:
from google.cloud.aiplatform import model_monitoring
monitoring_job = aiplatform.ModelDeploymentMonitoringJob.create(
display_name="churn-model-monitoring",
endpoint=endpoint.resource_name,
# Sample 10% of live traffic for monitoring analysis
logging_sampling_strategy=model_monitoring.RandomSampleConfig(random_sample_rate=0.1),
# Alert thresholds: flag if feature distribution shifts by more than 0.2 Jensen-Shannon divergence
drift_thresholds={
"tenure_months": model_monitoring.ThresholdConfig(value=0.2),
"monthly_charges": model_monitoring.ThresholdConfig(value=0.2),
},
# Training data reference for distribution comparison
training_dataset=aiplatform.Model.get(model.resource_name).training_dataset_info,
# Check every 6 hours
monitor_interval="21600s"
)
Conclusion
The six steps above take you from a local model file to a production REST endpoint with auto-scaling and drift monitoring in approximately two hours of work. The most important lesson is to always serialize the complete preprocessing pipeline alongside the model — this single decision eliminates the majority of training-serving skew bugs that plague production ML systems.