MYRTLE’S RECURRENT NEURAL NETWORK ACCELERATOR HANDLES 4000 SIMULTANEOUS SPEECH-TO-TEXT TRANSLATIONS WITH JUST ONE FPGA, OUTPERFORMS GPU IN TOPS, LATENCY, AND EFFICIENCY

A speech-to-text (STT) transcription application running on Myrtle’s scalable inference engine – which is based on the company’s MAU Accelerator cores – was one of the four accelerated workloads discussed in the recent blog about HPE’s addition of an Intel® FPGA Programmable Acceleration (PAC) Card D5005 option to its ProLiant DL380 Gen10 server. (See “HPE now delivering ProLiant dL380 Gen10 Servers with optional Intel® D5005 Programmable Accelerator Card based on Intel Stratix® 10 FPGAs.”) This STT workload can process more than 4000 voice channels in real time when accelerated by just the one Intel Stratix 10 FPGA integrated into the Intel FPGA PAC Card D5005.

Myrtle’s scalable inference engine handles the high-performance STT workload with a grid of MAU Accelerator cores instantiated in the FPGA. The MAU accelerator cores are optimized for the Intel FPGA PAC D5005 and Myrtle has published a nine-page paper titled “Exploiting Unstructured Sparsity on Next-Generation Datacenter Hardware” that discusses the details of its MAU accelerator. An Intel Solution Brief titled “Accelerating Speech Workloads for the Data Center,” which you’ll find under the AI tab on the Intel FPGA Acceleration Hub Web Page, provides additional details.

Rather than repeating the numerous technical details of Myrtle’s STT solution here in this blog, let’s skip right to the Solution Brief’s conclusions:

• A Mozilla DeepSpeech workload running on an Intel Stratix 10 FPGA using the Myrtle STT implementation with sparsity and quantization optimizations delivered 54 TOPS, which was slightly better than the throughput performance of an NVIDIA Tesla V100 GPU running code that was optimized for throughput. At the same time, the FPGA-based MAU accelerator was 6X more efficient than the GPU when measuring TOPS per watt. The Myrtle STT implementation has a latency of 0.343 milliseconds per 1 second of speech input while the GPU has a latency of 126 milliseconds – a 365X difference. These performance improvements came at the cost of slightly less accuracy for the Myrtle STT implementation because of the sparsity and quantization optimizations – but it’s just a 0.23 percent reduction.

• Optimizing the GPU solution for latency instead of throughput reduces the GPU’s latency to 10.1 milliseconds, which is 29X slower than the 0.343 milliseconds achieved by the FPGA-accelerated Myrtle solution. However the GPU’s latency reduction comes as a cost: its throughput performance drops from 53.37 to 1.12 TOPS, which is a performance difference of 48X compared to the FPGA-accelerated Myrtle STT solution. (Note that the FPGA-accelerated Myrtle STT solution delivers high throughput and low latency at the same time.)

If these results interest you, be sure to click over to Myrtle’s paper and download the related Intel Solution Brief for more details.

Legal Notice and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel does not control or audit third-party data. You should review this content, consult other sources, and confirm whether referenced data are accurate.

Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel, the Intel logo, Intel Xeon, Intel Arria, and Intel eASIC are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.