← All Reviews

Ray Is the Distributed Python Runtime You Probably Need (But Should Approach Carefully)

📦 ray-project/ray
Language: Python
Stars: 42,029
Trend: Rising
View on GitHub →

Ray Is the Distributed Python Runtime You Probably Need (But Should Approach Carefully)

Ray has been gaining steady traction — 42,000+ stars, consistently active commits, and a trend that's still rising in 2026. That's not hype momentum anymore; that's infrastructure adoption. When a tool hits this kind of sustained growth in the ML infrastructure space, it's worth taking seriously. But "worth taking seriously" and "drop it into your stack without hesitation" are two very different things.

I've spent time in the codebase, tracked the commit history, and used Ray across a few different workloads. Here's my honest take.

What Ray Actually Is

Forget the "AI compute engine" tagline for a second. At its core, Ray is a distributed task and actor framework for Python. You decorate a function with @ray.remote, call it with .remote(), and Ray handles scheduling that work across a cluster. That's the primitive. Everything else — the training library (Train), the hyperparameter tuner (Tune), the serving layer (Serve), the data pipeline toolkit (Data), RLlib — is built on top of that primitive.

This architecture matters because it means Ray isn't just a point solution. You're not adopting a hyperparameter tuner. You're adopting a distributed runtime that happens to have a hyperparameter tuner on top of it. That's both its biggest strength and the source of most of its complexity.

The mental model is: tasks are stateless remote functions, actors are stateful remote objects, and the object store is a shared memory layer that lets results flow between them without unnecessary serialization round-trips. Once that clicks, the rest of the API makes sense.

Why This Matters Right Now

The ML infrastructure landscape has a real problem: there's a massive gap between "runs on my GPU workstation" and "runs on 100 nodes in a cloud cluster." For years, the options were either write your own distributed glue code, use something heavyweight like Spark (which was never designed for ML workloads), or lock yourself into a cloud provider's managed training service.

Ray fills that gap in a way that doesn't require you to rewrite your Python. The @ray.remote decorator is genuinely low-friction. You can parallelize a for loop across a cluster with maybe 10 lines of changes to existing code. For teams that are hitting single-node limits but aren't ready to invest in a full MLOps platform, that's a compelling proposition.

The timing is also right because LLM inference and serving have become first-class workloads. The recent commits reflect this — there's active work on token-based request routing in Serve, KubeRay integration improvements, and Kafka support in Data. The team is clearly responding to where the industry is going.

Key Features Worth Knowing About

Ray Core (Tasks + Actors): This is the part I trust most. It's been around since 2016, has serious academic research behind it (the ownership paper from NSDI is worth reading), and the abstractions are genuinely well-designed. If you need to parallelize embarrassingly parallel Python work or build stateful distributed services, this is solid.

Ray Serve: The serving layer has matured significantly. The recent addition of a central capacity queue for token-based request routing shows the team is thinking seriously about LLM serving patterns — things like handling variable-length token sequences fairly across concurrent requests. If you're serving models and need more control than a basic REST wrapper, Serve is worth evaluating.

Ray Tune: Probably the most mature of the higher-level libraries. Hyperparameter search with distributed execution, support for most major search algorithms, and good integration with PyTorch and TensorFlow. If you're doing serious HPO work, this is one of the better open-source options available.

Ray Data: I'm more cautious here. It's actively developed, but the recent patch release (2.54.1) literally disabled the hanging issue detector because it was causing the scheduling loop to block and "severely degrade pipeline performance." That's the kind of thing that makes me want to test Data pipelines thoroughly before putting them in production.

KubeRay: The Kubernetes operator for Ray has been getting consistent attention. The recent addition of a standalone KubeRay IPPR (in-place pod replacement) provider is a sign that production Kubernetes deployments are a genuine priority. If you're running on K8s, this is the right way to deploy Ray.

Who Should Use This

You should use Ray if: - You have Python ML workloads that are hitting single-node limits and you want to scale without rewriting everything - You're building LLM serving infrastructure and need more than a simple FastAPI wrapper - You need distributed hyperparameter search and want something that integrates cleanly with PyTorch - Your team is already Kubernetes-native and wants a managed way to run distributed ML jobs - You're doing reinforcement learning at scale — RLlib is one of the most complete open-source RL libraries available

You should probably not use Ray if: - You need a simple job queue — Celery or even basic multiprocessing will have a lower operational overhead - You're early-stage and your biggest problem isn't scale — the complexity cost is real - You want a fully managed, zero-ops experience — Ray is infrastructure, not a service (though Anyscale offers that if you want to pay for it) - Your workload is primarily data engineering rather than ML — Spark or DuckDB might be a better fit depending on the scale

Concerns and Limitations

Let me be direct about the things that give me pause.

3,583 open issues. That's a lot. Some of it is expected for a project this large and widely used, but it also means you will hit edge cases. The deflaking commits in the recent history — multiple PRs specifically about making tests less flaky — suggest there are real reliability rough edges, particularly in Train and Serve. Before you commit to Ray for a production workload, I'd strongly recommend running your specific use case through a load test.

The surface area is enormous. Ray Core, Data, Train, Tune, Serve, RLlib, Workflows — each of these is essentially a separate product. The documentation is good, but navigating the ecosystem is genuinely confusing, especially when you're trying to figure out which combination of libraries you actually need. I've seen teams adopt Ray for one thing and then slowly accumulate dependencies on five other Ray libraries without fully understanding the operational implications.

Anyscale's involvement is a double-edged sword. Anyscale funds most of the core development, and the top contributors are Anyscale employees. That's why the project moves fast and has good support. But it also means the open-source version sometimes feels like it's driving you toward the managed service. The "Get started for free" badge in the README links to Anyscale's platform. That's fine — it's a legitimate business model — but you should go in with eyes open about where the commercial incentives lie.

Debugging distributed applications is still hard. The Ray Dashboard has improved a lot, and the distributed debugger is genuinely useful. But when something goes wrong in a multi-node Ray cluster, the debugging experience is still meaningfully harder than debugging a single-process application. If your team doesn't have experience with distributed systems debugging, budget time for that learning curve.

Python version floor. Requires Python 3.9+. This is reasonable in 2026 but worth noting if you're maintaining older environments.

Verdict

Ray is one of the most important pieces of open-source ML infrastructure available right now. The core distributed runtime is genuinely well-engineered, the ecosystem is broad, and the development velocity is high. For teams that are scaling Python ML workloads, it's probably the most practical choice available that doesn't require vendor lock-in.

But "most practical choice" comes with caveats. Start with Ray Core and one specific higher-level library that solves your immediate problem. Don't try to adopt the whole ecosystem at once. Test your specific workload thoroughly — the open issue count and the recent stability fixes in Data and Serve tell me that production reliability varies by use case. And make sure your team has at least one person who's comfortable thinking about distributed systems, because when Ray breaks, it breaks in distributed ways.

If I were starting a new ML platform project today that needed to scale beyond a single node, I'd use Ray. But I'd be deliberate about which parts I used, and I'd invest time upfront in understanding the operational model before I was under production pressure.

The 42,000 stars aren't wrong. Just go in knowing what you're getting into.

Repository: https://github.com/ray-project/ray

View ray-project/ray on GitHub →
Need help building with tools like this?
We build AI-powered applications and developer tools. 30+ years of engineering experience.
Get in Touch
distributed-systemsmachine-learningpythonmlopsllm-serving
Next → django-typer: Finally, Management Commands That Don't Feel Like a Chore
← Back to All Reviews