⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

MLflow: No More Magic Numbers

Dec 30, 2025 • 18 min read

"What learning rate did we use for the model we shipped on Tuesday?" If your answer is "Let me check Slack" or "I think it's in this Jupyter notebook somewhere," you are doing MLOps wrong. MLflow is the open-source platform that turns model training from an ad-hoc experiment into a reproducible, auditable engineering process — tracking every parameter, metric, and artifact from every training run in a searchable database.

1. Setup

pip install mlflow

# Start the MLflow UI server
mlflow ui --port 5000
# Access at http://localhost:5000 — shows all experiments and runs

# Or use a remote tracking server (for team use)
mlflow server \
    --backend-store-uri postgresql://user:pass@localhost/mlflow \
    --default-artifact-root s3://your-bucket/mlflow-artifacts \
    --host 0.0.0.0 --port 5000

# Connect to remote server
import mlflow
mlflow.set_tracking_uri("http://your-mlflow-server:5000")

2. Autologging: Zero-Code Experiment Tracking

MLflow integrates with PyTorch, HuggingFace Transformers, scikit-learn, XGBoost, and more. Enable it with a single line:

import mlflow
import mlflow.transformers
from transformers import TrainingArguments, Trainer

# Enable autologging — captures everything automatically
mlflow.transformers.autolog(log_every_n_steps=10)

# Set the experiment name (groups related runs together)
mlflow.set_experiment("sentiment-classifier-v3")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
)

with mlflow.start_run(run_name="roberta-base-3ep-lr2e5"):
    trainer = Trainer(model=model, args=training_args, ...)
    trainer.train()
    # MLflow automatically logs:
    # - Params: learning_rate=2e-5, epochs=3, batch_size=16, weight_decay=0.01...
    # - Metrics: train_loss, eval_loss, eval_accuracy (per step and epoch)
    # - Artifacts: model.safetensors, tokenizer.json, config.json, training_args.json

3. Manual Logging for Custom Metrics

import mlflow
from ragas import evaluate

with mlflow.start_run(run_name="rag-pipeline-v2"):
    # Log hyperparameters
    mlflow.log_params({
        "chunk_size": 512,
        "chunk_overlap": 50,
        "embedding_model": "text-embedding-3-small",
        "top_k": 5,
        "llm_model": "gpt-4o",
        "temperature": 0.1,
    })
    
    # Get RAG evaluation metrics
    eval_results = evaluate(rag_pipeline, test_dataset, metrics=["faithfulness", "answer_relevancy"])
    
    # Log custom metrics
    mlflow.log_metrics({
        "faithfulness": eval_results["faithfulness"],
        "answer_relevancy": eval_results["answer_relevancy"],
        "avg_latency_ms": avg_latency,
        "cost_per_query_usd": total_cost / num_queries,
        "p95_latency_ms": p95_latency,
    })
    
    # Log artifacts: confusion matrix, evaluation plots, sample predictions
    mlflow.log_artifact("evaluation_report.html")
    mlflow.log_artifact("confusion_matrix.png")
    
    # Log a custom figure
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    ax.plot(training_losses)
    mlflow.log_figure(fig, "training_curve.png")

4. The Model Registry: Staging to Production

The MLflow Model Registry is a versioned catalog of your models with lifecycle stages. This replaces the "deploy the model Dave trained on Wednesday" pattern with a structured promotion workflow:

from mlflow import MlflowClient

client = MlflowClient()

# Register a model from a completed run
run_id = "abc123def456"  # From mlflow.active_run().info.run_id
model_uri = f"runs:/{run_id}/model"

result = mlflow.register_model(
    model_uri=model_uri,
    name="SentimentClassifier",
    tags={"team": "nlp", "use_case": "product-reviews"},
)
print(f"Model version: {result.version}")  # e.g., 42

# Add description and stage transition
client.update_model_version(
    name="SentimentClassifier",
    version=42,
    description="RoBERTa-base fine-tuned on 50k product reviews. F1=0.94"
)

# Promote to Staging (requires passing eval gates first)
client.transition_model_version_stage(
    name="SentimentClassifier",
    version=42,
    stage="Staging",
    archive_existing_versions=False,
)

# After QA sign-off, promote to Production
client.transition_model_version_stage(
    name="SentimentClassifier",
    version=42,
    stage="Production",
    archive_existing_versions=True,  # Auto-archives previous Production version
)

# Load Production model in your serving code (always loads latest Production version)
model = mlflow.pyfunc.load_model(model_uri="models:/SentimentClassifier/Production")
prediction = model.predict(["This product is amazing!"])

5. Querying Experiments Programmatically

# Find the best performing run across all experiments
best_runs = mlflow.search_runs(
    experiment_names=["sentiment-classifier-v3"],
    filter_string="metrics.eval_f1 > 0.90 AND params.learning_rate = '2e-5'",
    order_by=["metrics.eval_f1 DESC"],
    max_results=5,
)

print(best_runs[["run_id", "params.learning_rate", "metrics.eval_f1", "start_time"]])

# Compare two specific runs
run_a = mlflow.get_run("run_id_a")
run_b = mlflow.get_run("run_id_b")

print(f"Run A F1: {run_a.data.metrics['eval_f1']:.4f}")
print(f"Run B F1: {run_b.data.metrics['eval_f1']:.4f}")
print(f"Diff: {run_b.data.metrics['eval_f1'] - run_a.data.metrics['eval_f1']:+.4f}")

6. MLflow with LLM Evaluation (MLflow 2.8+)

import mlflow
import mlflow.openai

# MLflow 2.8+ has built-in LLM evaluation metrics
with mlflow.start_run():
    eval_results = mlflow.evaluate(
        model="openai:/gpt-4o",
        data=test_df,           # DataFrame with 'inputs' and 'ground_truth' columns
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",   # Uses toxicity, faithfulness, answer_correctness
    )
    print(eval_results.metrics)  # Logged automatically to MLflow

Frequently Asked Questions

MLflow vs Weights & Biases vs Comet — which should I use?

MLflow is the best choice if: you need a fully open-source self-hosted solution, you're in a regulated industry (data can't leave your infrastructure), or you use Databricks (MLflow is deeply integrated). Weights & Biases has the best UI and visualization features and is popular for research. Comet is strong for enterprise teams needing compliance reporting. For most teams fine-tuning models for production, MLflow's model registry and artifact tracking are the key differentiators.

How do I prevent MLflow from logging too much data and slowing training?

Set log_every_n_steps=100 in the autolog call to reduce metric logging frequency. Use log_models=False during exploration and only log models for promising runs. Artifact storage (model weights) is the main cost driver — configure S3 lifecycle policies to archive artifacts older than 90 days.

Conclusion

MLflow transforms ML training from a black-box art into an auditable engineering discipline. The model registry's staging → production promotion workflow is particularly valuable for teams: it enforces a review gate before any model hits production and makes rollbacks as simple as transitioning the previous version back to Production. Once you've experienced "which run produced the production model?" having a definitive answer in the registry, you'll never go back to ad-hoc model management.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact