cppio

Storage and File Systems for Artificial Intelligence

As AI models move from gigabyte-scale to terabyte-scale parameters, the “I/O wall” has become a genuine bottleneck. Standard file systems often buckle under the pressure of millions of small files (like images for computer vision) or the massive sequential throughput required for checkpointing.

Below are reviews on requirments/designs/challanges of storage and file systems for AI and emerging techs like Data Versioning and Hardware-Software Co-design which are currently trending in research.

Introduction

The explosion of Deep Learning (DL) has shifted the storage paradigm. Unlike traditional High-Performance Computing (HPC), AI workloads are characterized by:

Requirements for AI Storage

To sustain peak GPU utilization, an AI-optimized storage system must meet these criteria: |Requirement|Description| |——–|——| |High Throughput|Must feed data fast enough to prevent GPU starvation.| |Low Latency|Critical for metadata operations (opening/closing millions of small files).| |Scalability|Linear performance growth as storage capacity increases.| |POSIX Compliance|Compatibility with standard deep learning frameworks (PyTorch, TensorFlow).| |Data Locality|Minimizing data movement by bringing computation closer to storage.|

Architecture and Design Patterns

Modern AI storage is moving away from traditional NAS (Network Attached Storage) toward more distributed and tiered architectures.

Distributed Parallel File Systems (PFS)

Systems like Lustre and GPFS (Spectrum Scale) provide high-speed concurrent access to shared data. They separate metadata from data to allow parallel paths for I/O.

Data Tiering and Caching

Because NVMe drives are expensive, many designs use a “Hot/Cold” strategy:

Data Pre-fetching and Shuffling

Software-defined storage layers now include “smart” pre-fetching. By predicting which mini-batch the GPU will need next, the system can hide I/O latency by loading data into RAM before it is requested.

Key Challenges

Despite advancements, several “pain points” remain in the AI lifecycle:

Data Versioning and Lineage

As models are retrained, tracking which version of a dataset produced which model is vital for reproducibility. Tools like DVC (Data Version Control) and LakeFS treat data like code.

Storage for Vector Databases

With the rise of RAG (Retrieval-Augmented Generation), storing and querying high-dimensional vectors (using databases like Milvus or Pinecone) is a new storage frontier that requires specialized indexing (HNSW, IVF).

Near-Data Processing (NDP)

Instead of moving data to the GPU for simple preprocessing (like resizing images), “SmartSSDs” perform these tasks directly on the storage controller to save PCIe bandwidth.

Case Study: Industry Implementations

Meta: The Tectonic Ecosystem

Meta’s storage strategy is a prime example of disaggregated storage—separating compute from storage to scale each independently.

OpenAI: Co-designed Supercomputing

OpenAI’s approach is characterized by deep integration with Microsoft Azure and a focus on predictable scaling.

DeepSeek: The Fire-Flyer File System (3FS)

While Meta and OpenAI often rely on established or proprietary cloud-native storage, DeepSeek developed 3FS from the ground up to exploit the performance of modern NVMe SSDs and RDMA (Remote Direct Memory Access) networks.

Disaggregated Architecture

3FS employs a disaggregated storage model, decoupling compute nodes from storage nodes. This allows thousands of SSDs across a distributed network to aggregate their bandwidth, providing a “locality-oblivious” shared storage layer. In production, DeepSeek has demonstrated aggregate read throughput exceeding 6.6 TiB/s.

Technical Innovations

Performance Benchmarks

DeepSeek’s 3FS is not just theoretical; it has set high industry standards:

Specialized AI file systems

The emergence of 3FS represents a shift toward specialized AI file systems. Unlike general-purpose distributed systems (like HDFS or Ceph), 3FS is uncompromisingly designed for the “Read-Heavy, Random-Access” nature of AI training and the “High-Throughput, Massive-Capacity” needs of LLM inference.

Comparative Analysis of Architectures

Feature Meta (Tectonic/Shift) OpenAI (Azure/Ray) DeepSeek (3FS)
Primary Goal Efficiency & Global Scale Predictability & Iteration Speed Raw Throughput & Strong Consistency
Metadata Strategy Distributed Key-Value (ZippyDB) Centrally Managed (Azure Infrastructure) Stateless KV (FoundationDB)
Data Access FUSE-based API + Parallel NFS Ray Object Store + Blob Storage Native RDMA + FUSE Interface
Caching Layer Tectonic-Shift (Flash Tier) GPU-Local RAM & NVMe Buffers SSD-based KVCache for Inference
Open Source High (Contributes to OCP/PyTorch) Low (Proprietary stack/Azure exclusive) High (Fully Open Sourced 3FS)

Comparative Analysis of Architectures

|Feature|Meta (Tectonic)|OpenAI (Azure/Ray)|DeepSeek (3FS)| |——-|—————|——————|————–| |Consistency|Eventual/Tunable|Managed Cloud-Consistent|Strong (via CRAQ)| |Network Fabric|Standard Ethernet/IP|InfiniBand / Azure Fabric|RDMA (InfiniBand/RoCE)| |Small File Ops|Layered Metadata|Object Storage Hydration|Stateless KV Metadata| |Key Advantage|Global Namespace Scale|Rapid Iteration & Ease|Raw I/O Throughput/Efficiency|

Discussion: The Convergence of Storage and Compute

The case studies suggest that for state-of-the-art AI, storage is no longer a “passive” repository. It has become an active part of the training loop.

Conclusion

The future of AI storage lies in the convergence of Object Storage scale with File System performance. Solving the I/O bottleneck is no longer just a hardware problem, but a coordination challenge between the data loader, the network fabric, and the underlying storage media.