Sailor: Automated ML Training System for the Cloud

PyTorch DeepSpeed GKE Docker

The Challenge

Training large-scale LLMs in the cloud is expensive and complex. Loads of systems struggle with slow fault tolerance mechanisms. It is difficult to find optimal resource configurations, and adapt to node failures, especially when using cheaper preemptible instances.

Solution: Sailor System

Sailor introduces a comprehensive system for automating ML training in the cloud with two main innovations: elastic fault-tolerant mechanisms and an optimal resource planner.

Sailor System Architecture - Multi-cluster coordination

Multi-cluster coordination with master controller

1. Elastic Fault-Tolerant Mechanism

We developed a novel in-CPU checkpointing mechanism that enables fast recovery from failures and preemptions. Unlike traditional approaches that write checkpoints to disk or remote storage, our method:

2. Optimal Resource Planner

Our optimization algorithm intelligently navigates the cloud resource search space to find the most cost-effective training configurations.

Results & Impact

Our evaluation demonstrates significant improvements over existing approaches:

Real-World Impact

Sailor enables researchers and organizations to train large language models at a fraction of the cost, making advanced ML capabilities more accessible. By efficiently leveraging preemptible VMs, which can be 60-90% cheaper than on-demand instances, the system democratizes access to large-scale ML training.

Open Source

This work has been continued and expanded by the Efficient Architectures and Systems Lab at ETH Zürich. The project is now open source and available on GitHub:

View on GitHub