A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto
Copyright c 2011 by Martin Labrecque
Overlay Architectures for FPGA-Based Software Packet Processing Martin Labrecque Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2011 Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with ﬁxed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in ﬁnding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet ﬂows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and ﬁnd that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarsegrained synchronization around shared data structures. Comparing with multithreaded processors using ii
lock-based synchronization, we measure up to 57% additional throughput with the use of transactionalmemory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment.
First, I would like to thank my advisor, Gregory Steffan, for his constructive comments throughout this project and especially for his patience in showing me how to improve my writing.
I also aknowledge my labmates and groupmates for their camaradery. I also thank the UTKC, my sempais and Tominaga Sensei for introducing me to the ’basics’.
Special thanks go to my relatives for their encouragements and support. Most of the credit goes to my parents, who are coaches and unconditional fans of mine. An important mention goes to my brother for helping me move in Toronto. My beloved Mayrose also deserves a special place on this page for making the sun shine even on rainy days.
Finally, the ﬁnancial support from the “Fonds qu´ b´ cois de la recherche e e sur la nature et les technologies” made this work possible.
List of Figures List of Tables Glossary 1 Introduction 1.1 FPGAs in Packet Processing . . . . 1.2 Soft Processors in Packet...