|
In 2005, Almaden researchers broke the world record for sorting a terabyte
of data on an RS/6000 SP. The researchers sorted one terabyte of 100-byte
records in 17 minutes, 37 seconds. This is almost a factor of 3 faster
than the previous record holder, a cluster of Compaq NT servers at
Sandia Labs. The speed of sorting has been used as a measure of computer
systems I/O and communication performance for a number of years. The
Almaden sort program, SPsort, sustained nearly 2.8GB/s of I/O to and
from the GPFS global file system, 5.6GB/s of
interprocessor communication across the SP switch, and about 1.9GB/s
to scratch files on local disks during its execution.
Background
In 1985, an article in
Datamation Magazine (A Measure of Transaction
Processing Power, by Anon. et al.) proposed a sort of one million
records of 100 bytes, each with random 10-byte keys, as a useful
benchmark of computer systems I/O performance. The benchmark ground rules
are that all input must start on disk, all output must end on disk,
and that the overhead to start the program and create the output
files must be included in the benchmark time. Since the current
record for this benchmark is around a second, new benchmarks were
established to stress ever larger computing systems. "MinuteSort"
measures how much data can be sorted in one minute, and "PennySort"
measures how much data can be sorted for one cent. At the high end is
Terabyte Sort. A number of Terabyte Sort records have been reported
recently. Almaden's SPsort improves substantially upon the best
of these.
Hardware and software
SPsort was run on an RS/6000 SP with 488 nodes. Each node contains four
332MHz 604 processors, 1.5GB of RAM, and a 9GB SCSI disk. The nodes
communicate with one another through the high-speed SP switch with a
bi-directional link bandwidth to each node of 150 MB/sec. Global storage
of 6 TB of disk storage in the form of 336 RAID arrays is attached
to 56 of the nodes. Besides the 56 disk servers, 400 of the SP nodes
actually ran the sort program.
Sort input and output data was stored in the GPFS parallel file system.
All 336 RAID devices were configured as a single mountable file system.
GPFS stripes files across all these devices, allowing the machine's
aggregate bandwidth of over 2.5GB/s to be brought to bear on a single
file when necessary. The sort benchmark program averaged 1.89GB/s
through GPFS during its execution, although the peak rates were
significantly higher. Corresponding rates of MPI communication through
the switch and access to local disks were sustained during the sort.
SPsort is a custom sort program optimized for Terabyte Sort. It uses
standard SP and AIX services: XOpen-compliant file system access through
GPFS, MPI message passing between nodes, Posix pthreads, and the
SP Parallel Environment to initiate and control a sort job running
on many nodes.
IBM Almaden Research - File Systems
|