Modern applications run inside containers, which encapsulate software and its dependencies into a portable, consistent execution environment. Unlike virtual machines, containers share the host OS kernel, making them lightweight and efficient.
Managing hundreds or thousands of containers manually is impractical. Kubernetes automates container deployment, scaling, and orchestration across clusters of machines. The fundamental unit in Kubernetes is a Pod, which groups containers that work together.
However, Kubernetes does not support live pod migration—relocating a running Pod from one node to another without downtime. This is a major limitation for stateful workloads such as AI inference, databases, and real-time analytics.
Currently, when a Pod must be moved due to node failures, autoscaling, or resource rebalancing, Kubernetes follows a terminate-and-recreate model:
For stateful applications, this results in downtime, performance degradation, and potential data loss. Live migration would:
✔ Preserve application state, network connections, and execution progress.
✔ Enable smoother autoscaling and resource optimization.
✔ Improve fault tolerance without service interruption.
runc
and CRIU capture the container state (memory, CPU, open connections).No Kubernetes-native solution for live pod migration exists today.
Our research aims to close this gap.
We analyzed Kubernetes Enhancement Proposals (KEPs) to understand how new features integrate with the ecosystem.
A proof-of-concept live migration system that:
The design document for one of the PoCs is here: https://docs.google.com/document/d/1n4tEj2LaNzL7lkq6jqTy4O-3v2dTfrn0lIdWDaeifnM/edit?tab=t.0#heading=h.nxlv0abv8hql
We are currently improving:
Join the discussion and contribute to the project! 🚀