Last updated: 14 Jun 2009
Despite huge increases in disk capacity, storage has not kept pace with advances in computing. A storage area network (SAN) virtualizes storage and removes dependencies on direct-attached storage (DAS), such as disks cabled to one computer, enabling backups within the SAN, seamless addition of new disks, and other features. Usually, there are multiple connections to the SAN for both fault tolerance and performance. SANs such as fiber-channel and iSCSI provide block-level access to storage, where blocks of low-level disk sectors are read and written, as opposed to networked filesystems, which provide file-level access to storage.
However, SANs are expensive and complex to manage. Computing power, RAM, serial ATA (SATA) disks and ethernet connections and switches are inexpensive, and many people are familiar with details of managing these. Can the DAS disks of inexpensive computers be reliably aggregated to appear as one big, fast storage area to clients?
With open source storage clustering (e.g., GlusterFS and Lustre) or parallel filesystems (e.g., PVFS), the answer appears to be "yes"... at least for Unix servers and clients. Windows clients might be possible with some kind of user-mode file system (e.g., Dokan), but no such user-mode filesystem mechanism is considered reliable enough yet.
A storage node is a computer that participates as a member of a storage cluster, and generally contributes its DAS to the cluster for use. A storage node may be metadata server (one that knows details about files and directories), if required by the clustering technology.
A Virtual Computing Laboratory (VCL) is a collection of real or virtual machines that are partially or fully controlled by a remote user. For simplicity, a real machine in a VCL is called a "compute node", though it has its own local storage and does not need to focus on maximizing computation.
The VCL compute nodes need to have fast access to terabytes of data, in chunks that range from CD/DVD image size (0.7GB to 8.5GB) to virtual machine (VM) image size (usually more than 2GB, but typically about 6 GB for student use). ISO images from CD/DVDs need to be present on disk because there is no way the user can insert media. They must also be read-only. Pre-defined virtual machines, such as those intended to be used for coursework where it is not required that the VM be created and operating system installed, are also read-only and need to be copied to the compute nodes from the storage nodes. However, modified VMs would neet to be written to storage nodes.
(ens of gigabytes of data could be transferred by each student using one VCL compute node, primarily composed of reading one or more ISO images or VMs, as well as writing any modified VMs. In addition, there could be tens of students using the compute nodes. Consequently, one huge storage server will bottleneck the clients. A storage cluster is designed to be distributed and scalable, which helps it avoid or reduce bottlenecks.
How do the clients (compute nodes) communicate with the storage cluster? A gigabit, eithernet-based local area network (LAN) provides the connections. Gigabit ethernet cards and switches are inexpensive and can reach speeds in excess of 80MB/sec. SATA II hard disk speeds can reach sequential read speeds nearing 200MB/sec and sequential write speeds of around 100MB/sec -- if only one thread is accessing the disk at a time.
Multiple threads might represent more than one compute node trying to read/write a storage node's physical disk. Adding threads dramatically reduces the I/O rate, but RAID 0 and RAID 10 can improve them again. A review of the Adaptec 1430AS SATA II disk controller card using benchmarks indicated that one might be able to achieve about 70MB/sec read or write, with four active threads -- but that brand of controller card had some reliability issues and was said to not support disks larger than 1TB.
A storage node is an inexpensive 64-bit computer with
Note that 64-bit CPUs are required for modern filesystems such as ZFS.
In our environment, the computer consists of the following components:
Onboard NIC is VT6130, with jumbo frame support. SATA RAID controller is VT8237S.
Each 500GB disks are configured into two partitions with large block sizes (8KB or higher):
Striping improves read performance.
Striping improves read/write performance, while mirroring makes a copy to protect against drive failure.
The initial number of storage nodes (excluding the backup node) is six, with an additional hot spare. The intent is to provide a ratio of no greater than five compute nodes to one storage node, due to the massive amounts of data transfers that will take place. Originally, fifteen storage nodes were planned with no backup and no spare nodes, which would provide a two to one ratio. Performance testing would determine what ratios higher than two to one are acceptable.
The contents of the read-only partition of storage nodes is created on one node by lab staff or faculty and automatically replicated to all other nodes.
The software consists of:
There is one storage node that is specially designed to be a backup node. It has an eSATA PCI-e card to connect to one of two 1.5 TB drives in an external eSATA enclosure. Data is replicated as-needed on a daily basis to the external drive from all non-backup nodes. Once a week the external drive is removed and placed in an offsite location, and the other external drive replaces it.
Once offsite, the data are considered for disaster-recovery purposes only because there aren't enough drives with enough capacity to handle the n x 300GB maximum load. However, most of the time, the load will be much less than that -- half of that is what the initial configuration of storage nodes covers. Over time, we envision one backup node and two drives for every four storage nodes (of 300GB r/w storage apiece)
If a storage node fails for anything other than a disk failure (handled by RAID 1), ideally the data local to the node would be available from the backup node to copy to a spare node or to distribute across the remaining nodes. Realistically, the backup node may not have all of the most-recent data for the node on its current external disk (or anywhere), so the recovery would be unreliable.
One possible way to address the reliability due to the offline status of the other external disk is to exploit the Network Direct Access Storage (NDAS) capability of the enclosure and place the "offline" drive on the network in a location in another room in the same building (NDAS does not route) and remotely access the "offline" drive to get its data. Another way might be to exploit the full capacity of the two internal 500GB drives of the backup node to copy all or most of the external drive's data, either prior to removing the drive for weekly offlining or replicating the data as it is backed up, using some kind of algorithm to figure out what to keep and what to discard to make room for new data.