By Petros Koutoupis, VDURA
With all the excitement round synthetic intelligence and machine studying, it’s simple to lose sight of which high-performance computing storage necessities are important to ship actual, transformative worth to your group.
When evaluating an information storage answer, probably the most widespread efficiency metrics is enter/output operations per second (IOPS). It has lengthy been the usual for measuring storage efficiency, and relying on the workload, a system’s IOPS may be crucial.
In observe, when a vendor advertises IOPS, they’re actually showcasing what number of discontiguous 4 KiB reads or writes the system can deal with beneath the worst-case situation of absolutely random I/O. Measuring storage efficiency by IOPS is barely significant if the workloads are IOPS-intensive (e.g., databases, virtualized environments, or net servers). However as we transfer into the period of AI, the query stays: does IOPS nonetheless matter?
A Breakdown of your Customary AI Workload
AI workloads run throughout the complete knowledge lifecycle, and every stage places its personal spin on GPU compute (with CPUs supporting orchestration and preprocessing), storage, and knowledge administration assets. Listed here are a number of the most typical sorts you’ll come throughout when constructing and rolling out AI options.
AI workflows (supply: VDURA)
Information Ingestion & Preprocessing
Throughout this stage, uncooked knowledge is collected from sources resembling databases, social media platforms, IoT units, and APIs (as examples), then fed into AI pipelines to organize it for evaluation. Earlier than that evaluation can occur, nevertheless, the information should be cleaned, eradicating inconsistencies, corrupt or irrelevant entries, filling in lacking values, and aligning codecs (such
as timestamps or items of measurement), amongst different duties.
Mannequin Coaching
After the information is prepped, it’s time for essentially the most demanding part: coaching. Right here, giant language fashions (LLMs) are constructed by processing knowledge to identify patterns and relationships that drive correct predictions. This stage leans closely on high-performance GPUs, with frequent checkpoints to storage so coaching can rapidly get better from {hardware} or job failures. In lots of circumstances, some extent of fine-tuning or related changes may additionally be a part of the method.
High quality-Tuning
Mannequin coaching sometimes includes constructing a basis mannequin from scratch on giant datasets to seize broad, common information. High quality-tuning then refines this pre-trained mannequin for a particular activity or area utilizing smaller, specialised datasets, enhancing its efficiency.
AI workflows (supply: VDURA)
Mannequin Inference
As soon as educated, the AI mannequin could make predictions on new, fairly than historic, knowledge by making use of the patterns it has realized to generate actionable outputs. For instance, if you happen to present the mannequin an image of a canine it has by no means seen earlier than, it can predict: “That could be a canine.”
How Excessive-Efficiency File Storage is Affected
An HPC parallel file system breaks knowledge into chunks and distributes them throughout a number of networked storage servers. This enables many compute nodes to entry the information concurrently at excessive speeds. In consequence, this structure has grow to be important for data-intensive workloads, together with AI.
Through the knowledge ingestion part, uncooked knowledge comes from many sources, and parallel file techniques could play a restricted position. Their significance will increase throughout preprocessing and mannequin coaching, the place high-throughput techniques are wanted to rapidly load and remodel giant datasets. This reduces the time required to organize datasets for each coaching and inference.
Checkpointing throughout mannequin coaching periodically saves the present state of the mannequin to guard in opposition to progress loss as a consequence of interruptions. This course of requires all nodes to save lots of the mannequin’s state concurrently, demanding excessive peak storage throughput to maintain checkpointing time minimal. Inadequate storage efficiency throughout checkpointing can lengthen coaching occasions and improve the danger of knowledge loss.
It’s evident that AI workloads are pushed by throughput, not IOPS. Coaching giant fashions requires streaming large sequential datasets, typically gigabytes to terabytes in measurement, into GPUs. The true bottleneck is combination bandwidth (GB/s or TB/s), fairly than dealing with tens of millions of small, random I/O operations per second. Inefficient storage can create bottlenecks, leaving GPUs and different processors idle, slowing coaching, and driving up prices.
Necessities based mostly solely on IOPS can considerably inflate the storage price range or rule out essentially the most appropriate architectures. Parallel file techniques, alternatively, excel in throughput and scalability. To satisfy particular IOPS targets, manufacturing file techniques are sometimes over-engineered, including value or pointless capabilities, fairly than being designed for optimum throughput.
Conclusion
AI workloads demand high-throughput storage fairly than excessive IOPS. Whereas IOPS has lengthy been a normal metric, fashionable AI — significantly throughout knowledge preprocessing, mannequin coaching, and checkpointing — depends on transferring large sequential datasets effectively to maintain GPUs and compute nodes absolutely utilized. Parallel file techniques present the required scalability and bandwidth to deal with these workloads successfully, whereas focusing solely on IOPS can result in over-engineered, expensive options that don’t optimize coaching efficiency. For AI at scale, throughput and combination bandwidth are the true drivers of productiveness and price effectivity.
Creator: Petros Koutoupis has spent greater than 20 years within the knowledge storage business, working for firms which embody Xyratex, Cleversafe/IBM, Seagate, Cray/HPE and, now, AI and HPC knowledge platform firm VDURA.
















