MLflow in 2026: Still the Most Practical MLOps Platform, Now Going All-In on LLMs

MLflow has been quietly sitting at 25k+ stars for a while, but what caught my attention recently is the pace of commits. The repo is pushing multiple production-relevant changes daily — Gemini autolog fixes, OpenTelemetry trace improvements, performance fixes for single-tenant installs, session tracking for traces. This isn't a project coasting on its reputation. Something is actively being built here.

I've used MLflow on and off since 2019, mostly for experiment tracking on traditional ML projects. Coming back to it now, I had to recalibrate. The project has expanded significantly into LLM territory, and I wanted to figure out whether that expansion is genuine or just marketing copy chasing the AI hype cycle.

What MLflow Actually Does

At its core, MLflow is a platform for tracking, evaluating, and deploying ML artifacts — models, experiments, metrics, parameters. That part hasn't changed. What has changed is that the team has bolted on a serious LLMOps layer that covers:

Distributed tracing for LLM calls and agent workflows, built on OpenTelemetry
Evaluation pipelines with 50+ built-in judges and the ability to define your own
Prompt versioning and registry — treat prompts like code artifacts with lineage
An AI Gateway that sits in front of multiple LLM providers with rate limiting, fallbacks, and credential management

The autolog feature is where this gets practically useful. You call mlflow.openai.autolog() before your code runs, and MLflow intercepts and logs every LLM call — inputs, outputs, latency, token counts, model parameters. Same thing works for LangChain, LlamaIndex, Anthropic, Gemini, and 60+ other integrations. That's not a small thing. Instrumenting LLM calls manually is tedious and error-prone.

Why This Matters Right Now

The honest answer is: most teams building LLM applications are flying blind. They have no systematic way to compare prompt versions, no way to catch quality regressions before deployment, and no visibility into what their agents are actually doing in production. Tools like LangSmith exist, but they're proprietary and tied to the LangChain ecosystem. OpenTelemetry gives you the plumbing but not the UI or the evaluation layer.

MLflow fills that gap in a way that's framework-agnostic and self-hostable. That last part matters a lot for enterprise teams who can't send production traces to a third-party SaaS. You can run the entire stack on your own infrastructure.

The timing also aligns with a broader shift in how ML teams are organized. Teams that used MLflow for classical ML are now being asked to also own LLM pipelines. Having one platform that handles both reduces the cognitive overhead and the number of tools you're managing.

Features Worth Calling Out

1. Autolog with OpenTelemetry integration

The fact that MLflow's tracing is built on OpenTelemetry is underrated. It means you're not locked into MLflow's proprietary format. If you later want to route traces to Jaeger, Datadog, or any other OTel-compatible backend, you can. Most LLM observability tools make you bet on their format. MLflow doesn't force that.

2. LLM Evaluation with custom judges

The evaluation framework lets you run systematic evals against a dataset, score outputs using LLM-as-judge, and track those scores over time as experiments. You can define custom scorers in Python. This is the kind of thing that separates teams that actually know if their model is regressing from teams that are guessing. The recent v3.11.1 release added automatic issue detection — you select traces in the UI and it surfaces potential problems. That's genuinely useful for debugging agent behavior.

3. Prompt Registry

Versioning prompts sounds simple but most teams don't do it. MLflow's prompt registry gives you lineage — which prompt version was used with which model version to produce which output. When something breaks in production, this is the kind of information that cuts debugging time from hours to minutes.

4. Self-hostable AI Gateway

The AI Gateway is an OpenAI-compatible proxy that sits in front of your LLM providers. You configure it once with your API keys, and your application code talks to the gateway. This centralizes credential management, lets you implement rate limiting and fallbacks, and gives you a single place to do A/B testing between models. For teams with multiple developers all making direct API calls with their own keys, this is a significant operational improvement.

5. Traditional ML tracking is still solid

Don't overlook the original use case. Experiment tracking, model registry, deployment integrations with SageMaker, Azure ML, Kubernetes — that stuff works well and has years of production hardening behind it. If you're running both classical ML and LLM workloads, having one platform for both is a real advantage.

Who Should Use This

Use MLflow if: - You're already using it for ML experiment tracking and now need LLM observability — the migration path is low friction - You need a self-hosted solution because of data privacy or compliance requirements - You're working across multiple LLM providers and frameworks and don't want to be locked into one vendor's observability tool - You have a team that spans both traditional ML and LLM work and you want to reduce tool sprawl - You're building on Databricks — MLflow is deeply integrated there and you get a lot for free

Skip it or look elsewhere if: - You're a solo developer building a simple LLM app and don't need the operational overhead of running a server - You're 100% in the LangChain ecosystem and LangSmith's tighter integration is worth the lock-in to you - You need real-time production monitoring with alerting — MLflow's monitoring story is still more batch/eval oriented than continuous production observability - You want a fully managed SaaS with zero infrastructure — MLflow Cloud exists but the open source version requires you to manage your own deployment

Concerns and Limitations

I'll be direct about the things that gave me pause.

Databricks dependency creep. Databricks acquired MLflow and their fingerprints are increasingly visible. The databricks-sdk is now a hard dependency in the base package. Several recent commits reference Databricks-specific features. The project is still Apache-2.0 and genuinely open source, but the strategic direction is clearly oriented toward making Databricks the preferred deployment target. If you're not on Databricks, you'll occasionally hit features that work better or only work there.

2008 open issues. That number is high. Some of it is expected for a project this size, but when I dug in, there are real bugs and feature requests that have been sitting open for a long time. The team is clearly prioritizing LLM features right now, and some of the traditional ML functionality feels like it's in maintenance mode.

The dependency footprint is heavy. Look at the dependency list: Flask, FastAPI, pandas, numpy, scikit-learn, scipy, pyarrow, SQLAlchemy, matplotlib, Docker, and more — all in the base package. This is a kitchen-sink install. For production deployments where you want a lean container, this is a problem. There's a mlflow-skinny package that strips out some of the heavier dependencies, but the documentation on what you lose is not great.

Copilot as a top contributor. I noticed that the GitHub Copilot bot has 987 commits, making it the second most active contributor. I'm not ideologically opposed to AI-assisted development, but when a bot is responsible for nearly 1000 commits on a production platform, I want to see rigorous review processes. The quality of those contributions is hard to audit from the outside.

The LLM features are newer and rougher. The autolog integrations work well for the happy path, but edge cases surface bugs. The Gemini autolog image test fix in the recent commits is a good example — these integrations require ongoing maintenance as provider APIs change. If you're using a less popular LLM provider, expect to hit issues.

Verdict

MLflow is worth adopting, with clear eyes about what you're getting. For traditional ML experiment tracking and model management, it's still the most practical open source option available. The feature set is mature, the community is large, and the documentation is solid.

For LLM observability and evaluation, it's a legitimate choice — especially if self-hosting matters to you or you're already in the MLflow ecosystem. The OpenTelemetry foundation is the right architectural decision, and the evaluation framework is more capable than most alternatives. But expect some rough edges, and go in knowing that Databricks is steering the roadmap.

If you're starting fresh on a greenfield LLM project with no existing MLflow investment, I'd evaluate it alongside LangSmith (if you're LangChain-heavy) and Langfuse (if you want a lighter-weight open source option). But if you have an existing ML team that's expanding into LLMs, MLflow is the path of least resistance to a unified observability and evaluation platform.

Bottom line: It's not perfect, but it's the most complete open source option in this space right now, and the development velocity suggests it's getting better fast.

View the repository on GitHub →

MLflow in 2026: Still the Most Practical MLOps Platform, Now Going All-In on LLMs

MLflow in 2026: Still the Most Practical MLOps Platform, Now Going All-In on LLMs

What MLflow Actually Does

Why This Matters Right Now

Features Worth Calling Out

Who Should Use This

Concerns and Limitations

Verdict

More Reviews