Rise of NVMe Storage

Rajesh Tamhane

Published: July 22, 2020

The biggest explosion in the history of the universe

Forty kilometers to the north of Pune in western India lies the Giant Meter-wide Radio Telescope (GMRT) that's staring into the sky in multiple frequency bands. It's not just one telescope but an array of thirty-two 45 meter wide parabolic radio antennae. Scientists at the National Center of Radio Astrophysics in Pune and around the world through this metal looking glass searching for the secrets of the universe. How do galaxies form? What makes pulsars pulse? How exactly do supernovae explode? And closer to home, they look to the sun to understand nano-solar winds amongst a myriad of other questions.

On a hot August day in 2018, GMRT spotted something – something instrumental in the discovery of the farthest galaxy known to humans. And, more recently, on another hot day in February 2020, GMRT was used to observe one of the biggest explosions in the history of the universe - the Ophiuchus Supercluster explosion.

Giant Meter-wide Radio Telescope (GMRT), Pune

Looking through metal

How do scientists 'look' through GMRT? It begins with the radio-antennae 'listening' for specific radio-frequency bands, from 50 MHz to 1390 MHz. Each antenna provides 2 outputs, as an analog signal, that are 180 degrees out of phase. The signal goes through an analog to digital converter which streams UDP packets to a storage device. At a clock frequency of 800MHz, a Field Programmable Gate Array (FPGA) or a programmable CPU streams the output UDP packets at the rate of 1600 MBps, and at 1000 MHz the data rate is 1900 MBps. A single hour of observation will eventually generate a data volume of 7.2 TB.

This data is then written-to-disk on a DELL PowerEdge T620 that is equipped with dual Xeon processors, 64GB RAM, 2x dual 10G ethernet adapters and 17x 6TB SAS HDDs configured as a single RAID 0 volume. The 17 SAS HDDs are there just to be able to meet the write data rate of 1.9 GBps.

Scientists use this data to run their analytical algorithms a ndlook' at the sky through metal.

Overcoming data indigestion

While ingestion of data at that rate (1.9GBps) through 17 magnetic disks with movable parts was working, it was causing trouble - from drive failures, to packet-loss.

This article is of how we used commodity hardware and a new type of storage to meet GMRT’s data velocity challenge.

A brief walk through memory and CPU lanes

I built my first PC in 1994. Bill Clinton was the US President and Michael Jordan hadn't won the NBA Finals. My PC, however, boasted a 486DX4 Intel CPU running at 100 MHz, 16MB of RAM and 250MB of space on the hard disk drive. At the time, this was the state-of-the-art in personal computing. Today, we carry far more compute power in our mobile phones. Until recently, Moore’s law has kept up it’s prediction and the number of transistors on chips have been doubling every couple of years. The DELL T620 at GMRT has 2 Xeon processors that run at 3 GHz and are equipped with 16 cores. That is a 300x increase in CPU clock speed and even larger increase in performance. The clock speeds of commodity CPUs are approaching 5 GHz and those of memory have already crossed 4000 MHz.

During this period, however, the data transfer speeds of storage devices have only increased from 133 MBps to 600MBps. That is a mere 4x increase in over 2 decades.

The express lane

A new type of flash storage called Non-Volatile Memory Express (NVMe) is closing the chasm that existed between memory and storage speeds - with a difference. At 3 GBps, it is 5x faster than the SATA SSD and 25x faster than traditional HDDs. To be precise, NVMe is a protocol that is used on new generation NANT based storage devices. It runs on the PCIe bus and that is one of the reasons it's blazingly fast.

So, when we designed our storage pods for the data intensive computing cluster, NVMes were the go-to choice for storage. We put together several pods using the AMD Ryzen CPU and chose to use NVMe flash storage as a part of a cluster. When we ran the FIO tests, our benchmarks resulted in random read-write throughputs of 3 GBps on a single consumer grade NVMe. That is 3 gigabytes per second. Combining 3 of the NVMes into a single disk using RAID 0 resulted in transfer speeds of close to 10 GBps.

Rubber meets the road

We took the findings to the National Center for Radio Astrophysics (NCRA) in Pune and they offered to let us test the storage pod at their observatory.

We ran 4 tests to determine if the NVMes could match the write performance of this storage configuration. These tests are a more realistic representation of the real-world write performance of NVMes. The FIO tests wrote data from memory to the disk, while in this test, the data was written from the network interface card to a ring buffer and then copied to the disk.

The results showed that a single NVMe was able to support a write speed of 1.6GBps without filling up the buffer, but started to drop packets when the data rate was increased to 1.7Gbps. In a RAID 0 configuration with 3 NVMes, no bottlenecks observed even at 1.9GBps.

Lessons learned

NVMes let you create low to medium density storage using commodity hardware at a very attractive price point. This can be useful in computing applications that are read-write intensive and deal with large file sizes. By choosing an NVMe SSD with the appropriate TBW (Total Bytes Written) a data acquisition system could be built at both a lower cost and power consumption. (Side note; the NVMes used for the test were rated at 10W).

The CPU architecture is important when designing storage nodes with NVMes. The number of PCIe lanes limit the storage node's density. Hardware RAID can improve CPU performance by off-loading some of the work it needs to do in managing the RAID volume.

Points to ponder over

Persistent storage is closing the gap with volatile memory. With the arrival of NVDIMM (Non-Volatile Dual Inline Memory Modules), this boundary may completely disappear. We are already seeing high-bandwidth memory impact data intensive application performance. Data intensive applications, databases and data structures, and algorithms have factored in the latency that has existed with persistent storage. How will these disspearing boundaries affect the way our algorithms are written and the way our databases engines have been designed?

Acknowledgements

I would like to thank Dr. Yashwant Gupta, Director, National Center for Radio Astrophysics whose encouragement and keen questioning helped us examine our assumptions and explore further. This work has been a team effort with heavy lifting from Saurabh Mookherjee and Swapnil Khandekar. Saurabh’s deep systems experience has been crucial in architecting the compute cluster while Swapnil’s ability to navigate across systems and code made trivial work of some of the hardest problems. My colleagues, Chhaya Yadav and Prasanna Pendse have been instrumental in getting this artilce into shape. And finally, thanks to Harshal Hayatnagarkar who introduced us to the 4th paradigm of computing and started us on this journey.

Industries

Publications and Tools

All Insights