Storage and File Systems for Artificial Intelligence
As AI models move from gigabyte-scale to terabyte-scale parameters, the “I/O wall” has become a genuine bottleneck. Standard file systems often buckle under the pressure of millions of small files (like images for computer vision) or the massive sequential throughput required for checkpointing.
Below are reviews on requirments/designs/challanges of storage and file systems for AI and emerging techs like Data Versioning and Hardware-Software Co-design which are currently trending in research.
Introduction
The explosion of Deep Learning (DL) has shifted the storage paradigm. Unlike traditional High-Performance Computing (HPC), AI workloads are characterized by:
- Massive Datasets: Petabyte-scale collections of unstructured data.
- Random Access Patterns: Shuffling data during training to improve model generalization.
- High Concurrency: Thousands of GPU workers requesting data simultaneously.
Requirements for AI Storage
To sustain peak GPU utilization, an AI-optimized storage system must meet these criteria:
|Requirement|Description|
|——–|——|
|High Throughput|Must feed data fast enough to prevent GPU starvation.|
|Low Latency|Critical for metadata operations (opening/closing millions of small files).|
|Scalability|Linear performance growth as storage capacity increases.|
|POSIX Compliance|Compatibility with standard deep learning frameworks (PyTorch, TensorFlow).|
|Data Locality|Minimizing data movement by bringing computation closer to storage.|
Architecture and Design Patterns
Modern AI storage is moving away from traditional NAS (Network Attached Storage) toward more distributed and tiered architectures.
Distributed Parallel File Systems (PFS)
Systems like Lustre and GPFS (Spectrum Scale) provide high-speed concurrent access to shared data. They separate metadata from data to allow parallel paths for I/O.
Data Tiering and Caching
Because NVMe drives are expensive, many designs use a “Hot/Cold” strategy:
- Hot Tier: Local NVMe or burst buffers for active training sets.
- Cold Tier: Object storage (S3/Azure Blob) for long-term archival.
Data Pre-fetching and Shuffling
Software-defined storage layers now include “smart” pre-fetching. By predicting which mini-batch the GPU will need next, the system can hide I/O latency by loading data into RAM before it is requested.
Key Challenges
Despite advancements, several “pain points” remain in the AI lifecycle:
- The Small File Problem: Most file systems are optimized for large, sequential reads. Loading millions of 10KB images causes metadata bottlenecks.
- Checkpointing Overhead: Large models (LLMs) save “checkpoints” frequently. This writes hundreds of gigabytes to disk at once, often freezing training for several minutes.
- Data Silos: Data often sits in object storage, but training requires a file interface, leading to “hydration” delays.
- Consistency vs. Performance: Maintaining strict POSIX consistency often slows down distributed training.
Emerging Trends
Data Versioning and Lineage
As models are retrained, tracking which version of a dataset produced which model is vital for reproducibility. Tools like DVC (Data Version Control) and LakeFS treat data like code.
Storage for Vector Databases
With the rise of RAG (Retrieval-Augmented Generation), storing and querying high-dimensional vectors (using databases like Milvus or Pinecone) is a new storage frontier that requires specialized indexing (HNSW, IVF).
Near-Data Processing (NDP)
Instead of moving data to the GPU for simple preprocessing (like resizing images), “SmartSSDs” perform these tasks directly on the storage controller to save PCIe bandwidth.
Case Study: Industry Implementations
Meta’s storage strategy is a prime example of disaggregated storage—separating compute from storage to scale each independently.
- Tectonic (The Backbone): Meta replaced HDFS with Tectonic, an exabyte-scale distributed file system. Unlike traditional systems, Tectonic uses a layered metadata approach, storing metadata in a distributed key-value store (ZippyDB). This allows Meta to handle billions of small files without hitting the “Metadata Wall.”
- Tectonic-Shift (The Flash Tier): To combat the latency of HDDs during LLM training, Meta introduced a flash-based caching tier. This tier uses novel cache policies to “absorb” the massive I/O bursts during data loading, reducing power consumption by ~29% while increasing throughput by up to 3x.
- Parallel NFS (Hammerspace): For their 24k GPU clusters (used for Llama 3), Meta co-developed a parallel NFS solution. This allows researchers to perform interactive debugging—where code changes are instantly visible across thousands of nodes—without sacrificing the throughput needed for exabyte-scale data loading.
OpenAI: Co-designed Supercomputing
OpenAI’s approach is characterized by deep integration with Microsoft Azure and a focus on predictable scaling.
- Azure AI Supercomputer: OpenAI and Azure co-designed a purpose-built supercomputer. Their storage architecture relies heavily on Blob Storage for the “Data Lake”, which is then “hydrated” into high-performance local NVMe caches or managed Lustre/GPFS instances during training runs.
- Ray Orchestration: OpenAI uses Ray to coordinate data movement and training across thousands of GPUs. Ray’s distributed object store allows for zero-copy data sharing between tasks on the same node, significantly reducing the memory overhead typically associated with moving large tensors.
- Predictable I/O: A key focus for OpenAI (as noted in the GPT-4 technical report) is ensuring that the software stack behaves predictably at scale. They prioritize “asynchronous, zero-copy data movement” to ensure that the I/O path never stalls the compute kernels.
DeepSeek: The Fire-Flyer File System (3FS)
While Meta and OpenAI often rely on established or proprietary cloud-native storage, DeepSeek developed 3FS from the ground up to exploit the performance of modern NVMe SSDs and RDMA (Remote Direct Memory Access) networks.
Disaggregated Architecture
3FS employs a disaggregated storage model, decoupling compute nodes from storage nodes. This allows thousands of SSDs across a distributed network to aggregate their bandwidth, providing a “locality-oblivious” shared storage layer. In production, DeepSeek has demonstrated aggregate read throughput exceeding 6.6 TiB/s.
Technical Innovations
- CRAQ (Chain Replication with Apportioned Queries): To maintain strong consistency without sacrificing read performance, 3FS uses the CRAQ algorithm. While writes propagate along a chain of storage nodes, any node in the chain can serve a read request, effectively doubling or tripling the available read bandwidth compared to traditional replication.
- Stateless Metadata with FoundationDB: 3FS offloads metadata management to a transactional key-value store (typically FoundationDB). This removes the “Metadata Bottleneck” common in traditional POSIX systems, allowing for rapid file creation and lookups even when dealing with billions of small objects.
- KVCache for Inference: Beyond training, 3FS includes specialized optimizations for KVCache. By providing a high-throughput, flash-based alternative to expensive DRAM caching, 3FS enables larger context windows and higher concurrency during model inference at a fraction of the cost.
DeepSeek’s 3FS is not just theoretical; it has set high industry standards:
- GraySort Performance: 3FS demonstrated the ability to sort 110 TiB of data in roughly 30 minutes.
- Dataloader Efficiency: By enabling high-speed random access, 3FS eliminates the need for complex data pre-fetching or shuffling logic in the training code, simplifying the ML engineer’s workflow.
Specialized AI file systems
The emergence of 3FS represents a shift toward specialized AI file systems. Unlike general-purpose distributed systems (like HDFS or Ceph), 3FS is uncompromisingly designed for the “Read-Heavy, Random-Access” nature of AI training and the “High-Throughput, Massive-Capacity” needs of LLM inference.
Comparative Analysis of Architectures
| Feature |
Meta (Tectonic/Shift) |
OpenAI (Azure/Ray) |
DeepSeek (3FS) |
| Primary Goal |
Efficiency & Global Scale |
Predictability & Iteration Speed |
Raw Throughput & Strong Consistency |
| Metadata Strategy |
Distributed Key-Value (ZippyDB) |
Centrally Managed (Azure Infrastructure) |
Stateless KV (FoundationDB) |
| Data Access |
FUSE-based API + Parallel NFS |
Ray Object Store + Blob Storage |
Native RDMA + FUSE Interface |
| Caching Layer |
Tectonic-Shift (Flash Tier) |
GPU-Local RAM & NVMe Buffers |
SSD-based KVCache for Inference |
| Open Source |
High (Contributes to OCP/PyTorch) |
Low (Proprietary stack/Azure exclusive) |
High (Fully Open Sourced 3FS) |
Comparative Analysis of Architectures
|Feature|Meta (Tectonic)|OpenAI (Azure/Ray)|DeepSeek (3FS)|
|——-|—————|——————|————–|
|Consistency|Eventual/Tunable|Managed Cloud-Consistent|Strong (via CRAQ)|
|Network Fabric|Standard Ethernet/IP|InfiniBand / Azure Fabric|RDMA (InfiniBand/RoCE)|
|Small File Ops|Layered Metadata|Object Storage Hydration|Stateless KV Metadata|
|Key Advantage|Global Namespace Scale|Rapid Iteration & Ease|Raw I/O Throughput/Efficiency|
Discussion: The Convergence of Storage and Compute
The case studies suggest that for state-of-the-art AI, storage is no longer a “passive” repository. It has become an active part of the training loop.
- Metadata is the Bottleneck: Both companies have moved away from traditional tree-based file systems in favor of database-backed metadata.
- Tiering is Mandatory: No single storage medium (HDD, SSD, or RAM) can cost-effectively satisfy both the capacity and speed requirements of AI.
- FUSE is Evolving: While FUSE (Filesystem in Userspace) was once considered too slow, Meta’s optimizations show it can be viable if backed by a high-performance distributed backend.
Conclusion
The future of AI storage lies in the convergence of Object Storage scale with File System performance. Solving the I/O bottleneck is no longer just a hardware problem, but a coordination challenge between the data loader, the network fabric, and the underlying storage media.