When it comes to distributed AI training, I've noticed people in the web2AI community label it a "false proposition." The reasoning is that while computing power can be aggregated, effective distributed collaboration comes with prohibitive bandwidth costs. And @0G_labs recently published the DiLoCox paper, seemingly aiming to address this very issue. Let's discuss this in detail:
1) First, let's discuss why distributed training is considered a "false proposition." The core contradiction is simple: you might want to replace 100 A100s with 100 inexpensive GPUs, seemingly saving 90% of hardware costs. However, to synchronize training across these 100 GPUs, each epoch requires exchanging terabytes of gradient data.
Traditional solutions require 100Gbps of dedicated line bandwidth, and achieving a data center-grade network with 100Gbps can cost hundreds of thousands of dollars per month. In the end, the money you save on GPUs is entirely spent on bandwidth, or even out of pocket. By this logic, saving on hardware costs while incurring additional bandwidth costs doesn't mean the problem hasn't been solved. So, this is the crux of the problem, and it's a false proposition.
2) 0G's DiLoCoX paper attracted attention because they claimed to have trained a 107B parameter model on a 1Gbps network (common office bandwidth), achieving a speed 357 times faster than the traditional AllReduce approach. This number is truly astonishing—consider the bandwidth difference between 1Gbps and 100Gbps, which is 100 times greater. Yet, the training speed actually increased by 357 times?
How did they achieve this? A brief study revealed that this approach incorporates four optimizations:
Pipeline Parallelism divides model processing into segments;
Dual Optimizer Policy uses a dual optimizer strategy to reduce synchronization frequency; One-Step-Delay Overlap allows communication and computation to proceed in parallel without waiting for each other; and Adaptive Gradient Compression intelligently compresses gradients.
In simple terms, the original requirement of "real-time strong synchronization" has been changed to "asynchronous weak synchronization," and "full data transmission" has been replaced with "compressed incremental transmission."
For example, a traditional approach is like a real-time video conference with 100 people, where every action must be live-streamed simultaneously. DiLoCoX, on the other hand, is like everyone recording their own data, only sending keyframes and changes. Communication traffic is reduced 100-fold, while maintaining over 99% information integrity.
Why is this feasible? I believe the key lies in their understanding of a key characteristic of AI training: fault tolerance. Model training isn't like transaction transfers, where even a single error can be tolerated. A slight error in gradient updates or a slight delay in synchronization will have minimal impact on the final model convergence.
DiLoCoX leverages this "fault tolerance" to achieve orders of magnitude efficiency gains at an acceptable loss in accuracy. This is classic engineering thinking—not striving for perfection, but for optimal cost-effectiveness.
3) But simply solving the bandwidth problem isn't enough. 0G's ambitions are clearly even greater. Their overall architecture makes this clear: they have a storage layer that boasts of crushing Filecoin at $10/TB, while the DA layer is specifically designed for AI, achieving gigabyte-level throughput.
The reason for achieving a 100-fold reduction in storage costs is, frankly, due to specialized optimizations for AI training scenarios. For example, the terabytes of data generated during training, such as checkpoints and logs, only last a few days and don't require strict "permanent storage."
Thus, a pragmatic "tiered storage" approach is adopted, providing appropriate service levels only when needed: hot data is fast to read and write but more expensive, cold data is cheaper but slower, and temporary data is deleted upon use for the cheapest.
This differentiated pricing directly hits the heart of AI training.
It's clear that 0G Labs has intentionally adapted the computing power, storage, and data flow challenges of AI training to meet AI requirements. Even the consensus mechanism has been optimized for AI. They use a modified version of CometBFT, boasting 2500+ TPS and sub-second finality, specifically tuned for the asynchronous nature of AI workloads.
In other words, 0G isn't simply patching existing blockchains to support AI; instead, they've designed an "AI-native" infrastructure from the ground up. As to whether it can eventually gain application-level commercial verification under the competition from traditional AI, we will have to wait and see, but this differentiated breakthrough idea is worth learning from.