Genomics, Bioinformatics, and Computer Hardware
So I failed in my quest of writing something here, in this space, every day for 30 days. Part of the excuse is that I was at my parent’s for Thanksgiving weekend and busy with lots of things, which made it difficult, but I could have written something. But it can be difficult to write something meaningful every day. So today I want to write a little bit about something that is at least potentially useful, if not meaningful to some people, and that is the design I put together at my employer for our compute and data storage cluster for our genomics work.
First, a little about what we are doing. We aren’t a genomics centre and we aren’t a research lab. This is a Molecular Diagnostics lab within the healthcare system. We will be handling data from two Illumina MiSeqs running targeted panels (TruSight Tumor 15 and TruSight Myeloid) for solid tumour and myeloid workups. We want turn around time to be as short as possible, especially since we will be batching tests given our volume and how many samples we can get on each run. We cover a large geographic region, but we aren’t exactly a massive population centre. Chances are we will be doing 1 run a week of each of these panels, at least to start. This may increase over time, especially on the solid tumour side as more and more tumour types become eligible for the testing. Data needs to come from the machine run to me for bioinformatics processing, and a summary of the findings to the Director who signs out the molecular reports to the pathologists or requesting clinicians. We need to keep the raw data and other information around for a long time to comply with standards in Pathology and Laboratory Medicine. We had a relatively small budget and modest needs but we wanted to get a platform that is powerful, capable of being useful for research alongside the clinical work, that could accommodate future needs, and that could grow easily.
There are lots of vendors that fit in this space, and as I discovered when spending a few months mostly gathering quotes and specs from them, they fall along a massive price gradient. Since we are doing purchasing through a government institution we were constrained a little by factors outside of our control, and in a few cases it was also just easier, if moderately more expensive, to go with vendors we already had relationships with. For the compute cluster we ended up going with the Dell FX2 system. We went with the four node configuration and IO aggregator modules, which means that communication between those four nodes doesn’t need to go the switch but happens over the midplane which is fantastic. Each node has two 6-core Xeon 2620 CPUs and 128 GB of RAM. They also have a few TB each of onboard storage is linked together as a shared GLustre file system which has made things very convenient.
For data storage we have 3 units from 45 Drives. In their own right each of these is pretty capable as they also have Xeon 2620 CPUs and a decent amount of RAM. But they also each can hold 45 4TB hard drives. One node is configured for research purposes where I am testing BTRFS and the other two each have ZFS file systems. The ZFS file systems are Z3 level redundancy which is roughly equivalent to a theoretical RAID7. So we have plenty of storage space with lots of redundancy. At the moment we only 1/3rd filled each node with 15 drives. That still gives us something like 48 TB of usable space on the BTRFS system and 42 TB free on the ZFS systems. Overall the design fit within a fairly midrange budget (in the scheme of things) and is very capable. It is flexible, and can easily grow. We can triple our storage capacity with addition of more hard drives very easily and adding new storage nodes to the network is quite simple. For ease of use right now everything is relatively independent from each other but a distributed storage system could be configured in a pretty straightforward manner. The compute cluster can easily accommodate a HPC environment like GridEngine or Parasol but I am testing out Apache Mesos and associated bits and pieces of that ecosystems. Again, I went this direction because of the inherent flexibility of the system. I am also running a replicated Cassandra cluster on the compute nodes for live storage of anonymized sequencing results for easier annotation, retrieval, and report generation.
All in all it is a flexible system that I think fills a midrange niche. This setup could be easily scaled up, but also down. A four node Dell FX2 system is something an individual lab or core group could purchase for instance, and the 45 drive storage systems are relatively inexpensive. The design stems out of their work for the cloud back-up company Backblaze, where they were manufacturing the enclosures and optimizing the design of the first few models of their open design StoragePods. In future articles I’ll dive deeper into some of the specifics of what we are working with, the analysis pipeline, and other details.