Project Description
GPFS is the foundation of a growing number of scale-out systems, including high-performance computing (HPC), network file servers (NFS), Hadoop, and more. In these scalable systems, the storage subsystem can be configured using Network Storage Disk (NSD) nodes, each with direct-attached storage arrays as an alternative to SAN-based back-end storage. Although the NSD architecture scales smoothly, there are also several opportunities for enhancement:
- Since GPFS stripes user data across all of the cluster's NSD nodes for optimal performance, if any NSD's attached storage controller is slow (due to RAID rebuilding for instance), the entire cluster file system performance can be commensurably slow.
- Traditional storage controllers are costly compared to JBOD arrays.
- There is no end-to-end data checksum protection from a disk to an application.
To address these issues, the Perseus project is designing an advanced-RAID, software-based storage controller that runs within the GPFS NSD layer with several important features:
- Using an internal virtualization layer, Perseus evenly distributes or "declusters" GPFS user data and spare space across a large number of disks in a direct-attached JBOD storage array.
- Perseus protects against two or three simultaneous faults by either mirroring or a Reed Solomon erasure code.
- Perseus provides an end-to-end checksum to protect against non-signaled data loss in storage devices. We've labeled these composite RAID architectural features "RAID-D2" or "RAID-D3," short for "RAID declustered, 2/3-fault-tolerant."
These capabilities offer several key benefits:
- Reduction of storage subsystem hardware costs by 20 - 40%.
- High file system throughput even during storage rebuilds after disk failures.
- Extremely high system reliability after simultaneous storage faults.
- Capability to defer disk maintenance to reduce RAS costs.
- Protection from undetected lost storage subsystem writes (possibly due to disk firmware bugs).
Ongoing research includes how best to design and implement:
- Dynamic placement of declustered data, parity and spare space in a declustered array.
- In-memory metadata structures.
- Checksum disk layout and algorithm for tracking lost writes.
- Reed Solomon parity calculations on various microprocessors.
- Recovery process after node failure.
- Procedures to handle disk-replacement and policies for deferred maintenance.
People
- Ralph A Becker-Szendy
- Bruce Cassidy
- Veera Deenadhayalan
- Robert Garner
- Scott Guthridge
- Bryan Henderson
- Ronald Mak
- Jim Wyllie
- GPFS Development and Test team members in Poughkeepsie and China
Product Impact
Perseus and GPFS will be used in the High Productivity Computing Systems (HPCS) supercomputer called PERCS (Productive, Easy-to-use, Reliable Computer System), which is based on high-end IBM Power7 servers and a high-performance interconnect fabric. A PERCS system is slated to be installed at the University of Illinois at Urbana-Champaign (UIUC) National Center for Supercomputing Applications (NCSA). Perseus is also being evaluated and considered for other potential products.
