Read and write hundreds of gigabytes per second
Summary: The Network File System (NFS) is a stalwart component of mostmodern local area networks (LANs). But NFS is inadequate for the demandinginput- and output-intensive applications commonly found in high-performancecomputing—or, at least it was. The newest revision of the NFS standardincludes Parallel NFS (pNFS), a parallelized implementation of file sharingthat multiplies transfer rates by orders of magnitude. Here's a primer.[Note: The article has been updatedwith regard to vendor involvement in the origin and development of pNFS —Ed.]
Date: 26 Nov 2008 (Published 04 Nov 2008)
Level: Intermediate
PDF: A4 and Letter (161KB | 11 pages)Get Adobe? Reader?
Also available in: Russian Japanese
Activity: 22433 views
Comments: 0 (View | Add comment - Sign in)
Through NFS, which consists of server and client software andprotocols running among them, a computer can share its physicalfile system with many other computers connected to the same network.NFS masks the implementation and type of the server's filesystem. To applications running on an NFS client, the shared file systemappears as if it's local, native storage.
Figure 1 illustrates a common deployment of NFS withina network of heterogeneous operating systems, including Linux?, MacOS X, and Windows?, all of which support the NFS standard. (NFS isthe sole file system standard supported by the Internet Engineering TaskForce.)
In Figure 1, the Linux machine is the NFS server; it shares orexports (in NFS parlance) one or more of its physical, attachedfile systems. The Mac OS X and Windows machines are NFS clients. Eachconsumes, or mounts, the shared file system. Indeed, mounting anNFS file system yields the same result as mounting a local drivepartition—when mounted, applications simply read and write files,subject to access control, oblivious to the machinations required topersist data.
In the case of a file system shared through NFS, Read and Write operationstraverse—represented by the blue shadow—through the client(in this case, the Windows machine) to the server, which ultimately fulfillsrequests to retrieve or persist data or alter file metadata, such aspermissions and last modified time.
NFS is quite capable, as evidenced by its widespread use as NetworkAttached Storage (NAS). It runs over both Transmission Control Protocol(TCP) and User Datagram Protocol (UDP) and is (relatively) easy toadminister. Furthermore, NFS version 4, the most recent, ratifiedversion of the standard, improves security, furthersinteroperability between Windows and UNIX?-like systems, andprovides better exclusivity through lock leases. (NFSv4 was ratified in2003.) NFS infrastructure is also inexpensive, because it typically runswell on common Ethernet hardware. NFS suits most problem domains.
However, one domain not traditionally well served by NFS ishigh-performance computing (HPC), where data files are verylarge, sometimes huge, and the number of NFS clients can reachinto the thousands. (Think of a compute cluster or grid composed ofthousands of commodity computing nodes.) Here, NFS is a liability, becausethe limits of the NFS server—be it bandwidth, storage capacity, orprocessor speed—throttle the overall performance of thecomputation. NFS is a bottleneck.
Or, at least it was.
The next revision of NFS, version 4.1, includes an extension calledParallel NFS (pNFS) that combines the advantages of stock NFSwith the massive transfer rates proffered by parallelized input and output(I/O). Using pNFS, file systems are shared from server to clients asbefore, but data does not pass through the NFS server. Instead, clientsystems and the data storage system connect directly, providing numerousparallelized, high-speed data paths for massive data transfers. After abit of initialization and handshaking, the pNFS server is left "out ofthe loop," and it no longer hinders transfer rates.
Figure 2 shows a pNFS configuration. At the top are thenodes of a compute cluster, such as a large pool of inexpensive,Linux-powered blades. At the left is the NFSv4.1 server. (For this discussion,let's just call it a pNFS server.) At thebottom is a large parallel file system.
Like NFS, the pNFS server exports file systems and retains andmaintains the canonical metadata describing each and every file in thedata store. As with NFS, a pNFS client—here a node in acluster—mounts the server's exported file systems. Like NFS, eachnode treats the file system as if it were local and physically attached.Changes to metadata propagate through the network back to the pNFS server.Unlike NFS, however, a Read or Write of data managed with pNFS is adirect operation between a node and the storage system itself,pictured at the bottom in Figure 2. The pNFS server is removed from datatransactions, giving pNFS a definite performance advantage.
Thus, pNFS retains all the niceties and conveniences of NFS and improvesperformance and scalability. The number of clients can be expanded toprovide more computing power, while the size of the storage system canexpand with little impact on client configuration. All you need to do iskeep the pNFS catalog and storage system in sync.
So, how does it work? As shown in Figure 3, pNFS is implemented as acollection of three protocols.
The pNFS protocol transfers file metadata (formally known asa layout) between the pNFS server and a client node. You canthink of a layout as a map, describing how a file is distributed acrossthe data store, such as how it is striped across multiple spindles.Additionally, a layout contains permissions and other file attributes.With metadata captured in a layout and persisted in the pNFS server, thestorage system simply performs I/O.
The storage access protocol specifies how a client accesses data fromthe data store. As you might guess, each storage access protocol definesits own form of layout, because the access protocol and the organizationof the data must be concordant.
The control protocol synchronizes state between the metadata server and thedata servers. Synchronization, such as reorganizing files onmedia, is hidden from clients. Further, the control protocol is notspecified in NFSv4.1; it can take many forms, allowing vendors theflexibility to compete on performance, cost, and features.
Given those protocols, you can follow the client-access process:
More specifically, a Read operation is a series of protocol operations:
LOOKUP+OPEN
request to the pNFS server. The server returns a file handle and state information. LAYOUTGET
command. The server returns the file layout. READ
request to the storage devices, which initiates multiple Read operations in parallel. LAYOUTRETURN
. CB_LAYOUTRECALL
to indicate that the layout is no longer valid and must be purged and/or refetched. A Write operation is similar, except that the client must issue aLAYOUTCOMMIT
beforeLAYOUTRETURN
to "publish" the changes to thefile to the pNFS server.
Layouts can be cached in each client, further enhancing speed, and a clientcan voluntarily relinquish a layout from the server if it's no longer ofuse. A server can also restrict the byte range of a Write layout to avoidquota limits or to reduce allocation overhead, among other reasons.
To prevent stale caches, the metadata server recalls layouts that havebecome inaccurate. Following a recall, every affected client must ceaseI/O and either fetch the layout anew or access the file through plain NFS.Recalls are mandatory before the server attempts any file administration,such as migration or re-striping.
It's location, location,location
As mentioned above, each storage access protocol defines a type of layout,and new access protocols and layouts can be added freely. To bootstrap theuse of pNFS, the vendors and researchers shaping pNFS have already definedthree storage techniques: file, block, and object stores:
No matter the type of layout, pNFS uses a common scheme to refer toservers. Instead of hostname or volume name, servers are referred to by aunique ID. This ID is mapped to the access protocol-specific serverreference.
Which of these storage techniques is better? The answer is, "It depends."Budget, speed, scale, simplicity, and other factors are all part of theequation.
Before you break out your checkbook, let's look at the state of pNFS.
As of this writing in November 2008, the draft Request for Comments (RFC)for NFSv4.1 is entering "last call," a two-month period set aside tocollect and consider comments before the RFC is published and opened toindustry-wide scrutiny. When published, the formal RFC review period oftenlasts a year.
In addition to providing broad exposure, the draft proposed standardcaptured in the RFC lays a firm foundation for actual product development.As only minor changes to the standard are expected during the forthcomingreview period, vendors can design and build workable, marketable solutionsnow. Products from multiple vendors will be available some time next year.
In the immediate term, open source prototype implementations of pNFS onLinux are available from a git repository located at the University of Michigan(see Resources for a link). IBM, Panasas, Netapp,and the University of MichiganCenter for Information Technology Integration (CITI) are leading the developmentof NFSv4.1 and pNFS for Linux.
The potential for pNFS as an open-source parallel file system client is enormous.The fastest supercomputer in the world (as ranked by the Top500 survey) and thefirst computer to reach a petaflopuses the parallel file system built by Panasas (a supporter of the pNFS object-based implementation).(A petaflop is one thousandtrillion operations per second.) Dubbed Roadrunner, located at theLos Alamos National Laboratory and pictured inFigure 4, the gargantuan system has 12,960 processors,runs Linux, and is the first supercomputer to be constructed usingheterogeneous processor types. Both AMD Opteron X64 processors and IBM'sCell Broadband Engine? drive computation. In 2006, Roadrunnerdemonstrated a peak 1.6 gigabytes-per-second transfer rate using an earlyversion of Panasas's parallel file system. In 2008, the Roadrunnerparallel storage system can sustain hundreds of gigabytes persecond. In comparison, traditional NFS typically peaks at hundreds ofmegabytes per second.
The entire NFSv4.1 standard and pNFS are substantive improvements to theNFS standard and represent the most radical changes made to atwenty-something-year-old technology that originated with SunMicrosystems' Bill Joy in the 1980s. Five years in development, NFSv4.1and pNFS now (or imminently) stands ready to provide super-storage speedsto super-computing machines.
We have seen the future, and it is parallel storage.
Learn
Get products and technologies
Discuss
Martin Streicher is a freelance Ruby on Rails developer and the formerEditor-in-Chief ofLinux Magazine. Martinholds a Masters of Science degree in computer science from PurdueUniversity and has programmed UNIX-like systems since 1986. He collectsart and toys.
联系客服