April 4, 2011

VMware, HPC clusters and clustered file systems

Lately I’ve been involved in setting up an HPC cluster. HPC clusters predate VMware solutions quite a bit and the software used typically had their initial versions 15-20 years ago for non-x86 architectures. I’ve found that HPC clusters do actually share quite a bit of functionality that we see in vCenter. They use a shared file system, you have OS templates that you deploy, and you can manage all of your nodes from a single GUI. In addition to that they have a batch job scheduler that will put some processing tasks onto the nodes and the analogy in the vCenter world is DRS and load balancing VMs upon poweron.



In most VMware environments the shared storage is normally a SAN or NAS array, but in this cluster world we have disk nodes with local disk controllers and local disks. These nodes each has a stripe of a file system that is shared out over Infiniband. I believe the most common HPC file system is Lustre, but the one used in the setup we used was FraunHofer Parallel Cluster File System, normally shortened to FhGFS (wonder why it’s not shortened to FHCFS...?).



Each node had 25 SAS disks delivered from a MDS600 shelf and a P700m disk controller with 512MB ram. Since the file system is striped over these nodes it means that you get the performance from all the nodes when doing disk operations. As we had four nodes with local disks it meant that we had a 50TB file system that was shared among all processing nodes. When reading a file, all four disk nodes would give you their slices of that file. And this all happens over Infiniband, which is a low latency high speed network.

The data sets that I have seen so far in this HPC environment reminds me of what we see in a VM environment. There are typically a set of config files (xml text files) and some larger binary files. The binary files I’ve seen so far vary in size from a few hundred Megabytes to a few hundred Gigabytes.

In a VMware environment shared storage has been key to get all the basic cross-host functionality that is administered from vCenter working. A dedicated storage system can bring you much more functionality and scalability than a traditional system with direct attached storage (DAS). The possibility of using the local disks for VMFS is possible in a VMware environment as well if you’re using third party tools like LeftHand VSA, StorMagic SvSAN or similar. Doing it in such a way does however add an extra layer here that can affect performance.



Traditional HPC file systems are using Infiniband, but now that we’re seeing 10GbE being introduced in more and more networks, we also see 10GbE being used for such file systems.

Nowadays we also see solid state disks being introduced. In addition to that we’re also seeing Fusion IO coming with even faster technology known as solid state flash. This technology will easily outperform any traditional SAN/NAS, but is still quite expensive. The price on solid state disks has dropped a bit during the past couple of years and we will probably see an increased drop in price as it goes more mainstream. A FusionIO drive is still very expensive, but is also faster than anything else I’ve seen out there.


Will we ever see the clustered file system VMFS getting functionality similar to what we’re currently seeing in a HPC clustered file systems? A posting by Duncan Epping about possible new features of future ESX versions points to a VMware Labs page called “Lithium: Virtual Machine Storage for the Cloud” and if you go in and read that pdf whitepaper it shows that VMware has actually developed a clustered file system quite similar to what is being used today in HPC clusters.
We don’t know if this functionality is ready for the next ESX version, or if it’s coming at all. But is shows that such functionality is something they have been evaluating for quite some time, and it can be present sooner, later or never. VMware is still a company owned by the storage company EMC and who know what EMC think of such functionality?



A modern SAN is however providing much more functionality than a traditional SAN, and a clustered file system will in reality be a JBOD stretched over several nodes and EMC will not lose all their market share because of such a file system, but for some customers I’m sure it will be a viable alternative instead of a low end SAN/NAS.

No comments:

Post a Comment