Advances in machine learning neural networks have made it possible for us to process increasingly larger amounts of stored data. The traditional approach has been to transfer the data to the algorithm, but does it really make sense to move massive datasets (up to 1 PB) for processing by an algorithm that may only be a few tens of megabytes? The idea of processing data closer to the location where it is stored is therefore generating a lot of attention. This article investigates the theory and practice of computational storage and how a computational storage processor (CSP) can be used to provide hardware acceleration and performance enhancements to many compute-intensive tasks without placing a significant overhead on the host processor.
The Rise of Datasets
The use of neural network algorithms in automotive, industrial, security, and consumer applications has increased significantly in recent years. Edge-based IoT sensors typically processed only small volumes of data and hence only used algorithms that occupy little code space. However, as the processing capability of microcontrollers grows and their power consumption falls, the use of machine learning algorithms in edge applications has begun to grow exponentially. Convolutional neural networks are used in vision processing and for object detection in industrial and automotive applications. For example, vision processing systems can be used to detect if labels are correctly attached to bottles on a high-speed industrial production line. Vision systems are also suitable for more complicated tasks like sorting objects according to their type, condition, and size. The requirement for multi-object classification and identification, using real-time vision systems, in automotive applications leverages neural networks even more fully. Aside from market applications, the power of neural networks is also leveraged in scientific research. For example, they are extensively used to process large volumes of data collected from remote sensing satellites and arrays of earthquake sensors distributed around the globe.
In the majority of applications, machine learning is used to increase the probability of correctly observing and classifying objects. However, training algorithms for this purpose requires large datasets (up to a PB), which are challenging to move, process and store.
The popularity of NAND flash-based storage has grown considerably in recent years, and it is no longer only used in high-end storage applications, but also in commodity solid-state storage which is replacing magnetic disk drives in laptop and desktop computers. The proliferation of solid-state storage, in combination with the rise of non-volatile memory express (NVMe) protocol (enabling higher bandwidth, low latency, and higher storage density) and the increased data rates made possible by PCIe connectivity, provide us with the opportunity to rethink our approach to how we use storage and computing resources.
The traditional approach shown in Figure 1 moves data between the compute and storage planes. Compute resources are used for data transfer, processing, compression, and decompression along with a multitude of other system-related tasks. In combination these place a heavy load on available resources.
The computational storage architecture shown in Figure 2 is a more efficient approach. This uses a hardware accelerator (typically housed on an FPGA) to perform more computationally intensive tasks. By placing NVMe flash storage adjacent and connecting to the hardware accelerator, the CPU is no longer required to move data from its storage location to the point of processing, thus significantly reducing its operational overhead. As shown in Figure 3, the FPGA performs the role of a computational storage processor (CSP), thereby relieving the CPU of computationally intensive tasks such as compression, encryption, or neural network inferencing.
The BittWare IA-220-U2 FPGA-based Computational Storage Processor
An example of a computational storage processor is the IA-220-U2 from BittWare. The IA-220-U2 features an Intel Agilex FPGA (with up to 1.4 M logic elements, up to 16 GB DDR4 memory, and four PCIe Gen4 interfaces). The DDR4 SDRAM can transfer data at rates of up to 2,400 MT/s. It uses an SFF-8639 compliant 2.5 inches U.2 package with a convection-cooled heatsink and is designed to be incorporated into a U.2 NVMe storage array as shown in Figure 4.
It typically consumes 20 Watts of power and supports hot swapping. The IA-220-U2 has an on-board NVMe-MI compliant SMBus controller, an SMBus FPGA flash control function, and SMBus access to on-board voltage and temperature monitoring sensors, which make it suitable for use in enterprise IT and datacentre applications. The functional block diagram and the key features of the BittWare IA-220-U2 are illustrated in Figure 5.
The IA-220-U2 is designed to perform a variety of acceleration tasks, including algorithm inferencing, compression, encryption, and hashing, image search and database sorting, and deduplication, in high-volume applications.
CSP Implementation using the BittWare IA-220-U2
The BittWare IA -220-U2 can be delivered as a pre-configured solution using Eideticom’s NoLoad IP. Alternatively, it can be user-programmed for customs applications
BittWare support custom development by providing an SDK that includes PCIe drivers, board monitoring utilities, and board libraries. FPGA application development can be performed using Intel’s Quartus Prime Pro and high-level synthesis toolchains and design flows.
The Eideticom’s NoLoad IP includes a pre-configured plug-and-play solution with an integrated software stack based on the BittWare U.2 module. It also provides a set of hardware-accelerated computational storage services (CSS), highlighted in orange in Figure 6.
Figure 7 features the software components of the NoLoad IP – which include a kernel space stacked file system and NVMe driver that uses the NoLoad CSSs, and the Libnoload, application-oriented user space.
The offloading capabilities of the Eideticom NoLoad CPU agnostic solution improves quality of services (QoS) by up to 40x and combines the advantages of lower cost-of-ownership with reduced power consumption.
Offloading Compute Intensive Tasks Accelerates Throughput
Using a NVMe -based computational storage architecture provides better performance and uses less power in large data processing applications. This architecture reduces the requirement to transfer data from the point of storage to a processor (and back) by instead using an FPGA-based computational storage processor to perform compute-intensive tasks. Storing data close the point of processing on NVMe NAND flash arrays saves energy while also reducing latency and the amount of required bandwidth.
Source: Mouser Electronics