The Five-minute Rule 20 Years Later and How Flash Memory Changes the Rules
ABSTRACT
In 1987, Jim Gray and Gianfranco Putzolu published their now-famous five-minute rule1 for trading off memory and I/O capacity. Their calculation compares the cost of holding a record (or page) permanently in memory with the cost of performing disk I/O each time the record (or page) is accessed, using appropriate fractions of prices for RAM chips and disk drives. The name of their rule refers to the break-even interval between accesses. If a record (or page) is accessed more often, it should be kept in memory; otherwise, it should remain on disk and read when needed.
Based on then-current prices and performance characteristics of Tandem equipment, Gray and Putzolu found that the price of RAM to hold a 1-KB record was about equal to the (fractional) price of a disk drive required to access such a record every 400 seconds, which they rounded to five minutes. The break-even interval is about inversely proportional to the record size. Gray and Putzolu gave one hour for 100-byte records and two minutes for 4-KB pages.
The five-minute rule was reviewed and renewed 10 years later in 1997.2 Lots of prices and performance parameters had changed (e.g., the price of RAM had tumbled from $5,000 to $15 per megabyte). Nonetheless, the break-even interval for 4-KB pages was still around five minutes. The first purpose of this article is to review the five-minute rule after another 10 years.
Of course, both previous papers acknowledged that prices and performance vary among technologies and devices at any point in time (e.g., RAM for mainframes versus minicomputers, SCSI versus IDE disks, etc.). Therefore, interested readers are invited to reevaluate the appropriate formulas for their environments and equipment. The values used here (in table 1) are meant to be typical for 2007 technologies rather than universally accurate.
In addition to quantitative changes in prices and performance, qualitative changes already under way will affect the software and hardware architectures of servers and, in particular, database systems. Database software will change radically with the advent of new technologies: virtualization with hardware and software support, as well as higher utilization goals for physical machines; many-core processors and transactional memory supported both in programming environments and hardware; 3 deployment in containers housing thousands of processors and many terabytes of data;4 and flash memory that fills the gap between traditional RAM and traditional rotating disks.
Flash memory falls between traditional RAM and persistent mass storage based on rotating disks in terms of acquisition cost, access latency, transfer bandwidth, spatial density, power consumption, and cooling costs.5 Table 1 and some derived metrics in table 2 illustrate this point.
Given that the number of CPU instructions possible during the time required for one disk I/O has steadily increased, an intermediate memory in the storage hierarchy is desirable. Flash memory seems to be a highly probable candidate, as has been observed many times by now.
Many architecture details remain to be worked out. For example, in the hardware architecture, will flash memorybe accessible via a DIMM slot, a SATA (serial ATA) disk interface, or yet another hardware interface? Given the effort and delay in defining a new hardware interface, adaptations of existing interfaces are likely.
A major question is whether flash memory is considered a special part of main memory or a special part of persistent storage. Asked differently: if a system includes 1-GB traditional RAM, 8-GB flash memory, and a 250-GB traditional disk, does the software treat it as 250 GB of persistent storage and a 9-GB buffer pool, or as 258 GB of persistent storage and a 1-GB buffer pool? The second purpose of this article is to answer this question and, in fact, to argue for different answers in file systems and database systems.
Many design decisions depend on the answer to this question. For example, if flash memory is part of the buffer pool, pages must be considered “dirty” if their contents differ from the equivalent page in persistent storage. Synchronizing the file system or checkpointing a database must force disk writes in those cases. If flash memory is part of persistent storage, these write operations are not required.
Designers of operating systems and file systems will want to use flash memory as an extended buffer pool (extended RAM), whereas database systems will benefit from flash memory as an extended disk (extended persistent storage). Multiple aspects of file systems and database systems consistently favor these two designs.
Moreover, the characteristics of flash memory suggest some substantial differences in the management of B-tree pages and their allocation. Beyond optimization of page sizes, B-trees can use different units of I/O for flash memory and disks. Presenting the case for this design is the third purpose of this article.
Assumptions
Forward-looking research always relies on many assumptions. This section lists the assumptions that led to the conclusions put forth in this article. Some of the assumptions are fairly basic, whereas others are more speculative.
One assumption is that file systems and database systems assign flash memory to a level between RAM and the disk drives. Both software systems favor pages with some probability that they will be touched in the future but not with sufficient probability to warrant keeping them in RAM. The estimation and administration of such probabilities follow the usual lines (e.g., LRU, or least recently used).
We assume that the administration of such information employs data structures in RAM, even for pages whose contents have been removed from RAM to flash memory. For example, the LRU chain in a file system’s buffer pool might cover both RAM and flash memory, or there might be two separate LRU chains. A page is loaded into RAM and inserted at the head of the first chain when it is needed by an application. When it reaches the tail of the first chain, the page is moved to flash memory and its descriptor to the head of the second LRU chain. When it reaches the tail of the second chain, the page is moved to disk and removed from the LRU chain. Other replacement algorithms would work mutatis mutandis.
Such fine-grained LRU replacement of individual pages is in contrast to assigning entire files, directories, tables, or databases to different storage units. It seems that page replacement is the appropriate granularity in buffer pools. Moreover, proven methods exist to load and replace buffer-pool contents entirely automatically, without assistance from tuning tools and without directives by users or administrators. An extended buffer pool in flash memory should exploit the same methods as a traditional buffer pool. For truly comparable and competitive performance and administration costs, a similar approach seems advisable when flash memory is used as an extended disk.
File systems
The research for this article assumed a fairly traditional file system. Many file systems differ in one way or another from this model, but most still generally adhere to it.
Each file is a large byte stream. Files are often read in their entirety, their contents manipulated in memory, and the entire file replaced if it is updated at all. Archiving, version retention, hierarchical storage management, data movement using removable media, etc. all seem to follow this model as well.
Based on this model, space allocation on disk attempts to use contiguous disk blocks for each file. Metadata is limited to directories, a few standard tags such as a creation time, and data structures for space management.
Consistency of these on-disk data structures is achieved by careful write ordering, fairly quick write-back of updated data blocks, and expensive file-system checks after any less-than-perfect shutdown or media removal. In other words, we assume the absence of transactional guarantees and transactional logging, at least for file contents. If log-based recovery is supported for file contents such as individual pages or records within pages, then a number of the arguments presented here need to be revisited.
Database systems
We assume fairly traditional database systems with B-tree indexes as the workhorse storage structure. Similar tree structures capture not only traditional clustered and nonclustered indexes but also bitmap indexes, columnar storage, contents indexes, XML indexes, catalogs (metadata), and allocation data structures.
With respect to transactional guarantees, we assume traditional write-ahead logging of both contents changes (such as inserting or deleting a record) and structural changes (such as splitting B-tree nodes). Efficient log-based recovery after failures is enabled by checkpoints that force dirty data from the buffer pool to persistent storage.
Other assumptions include variations such as “second chance” or fuzzy checkpoints. In addition, nonlogged (allocation-only logged) execution is permitted for some operations such as index creation. These operations require appropriate write ordering and a “force” buffer pool policy.6