Table Of ContentRedesigning Database Systems in Light of
CPU Cache Prefetching
Shimin Chen
CMU-CS-05-192
December 2005
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Thesis Committee:
Anastassia Ailamaki, Co-Chair
Todd C. Mowry, Co-Chair
Christos Faloutsos
Phillip B. Gibbons
David J. DeWitt, University of Wisconsin at Madison
Submitted in partial fulflllment of the requirements
for the degree of Doctor of Philosophy.
Copyright �c 2005 Shimin Chen
This research was sponsored by the National Science Foundation (NSF) under grant nos. CCR-0205544 and CCR-0085938,
the National Aeronautics and Space Administration (NASA) under grant no. NAG2-1230, and the Microsoft Corporation
through a generous fellowship. The views and conclusions contained in this document are those of the author and should
not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S.
government or any other entity.
Keywords: Cache Prefetching, Database Systems, CPU Cache Performance, Data Locality Opti-
+
mizations, B -Trees, Hash Joins.
To my parents Guanhu and Aili, my wife Qin, and my daughter Ada.
iv
Abstract
Computer systems have enjoyed an exponential growth in processor speed for the past 20 years, while
main memory speed has improved only moderately. Today a cache miss to main memory takes hundreds
of processor cycles. Recent studies have demonstrated that on commercial databases, about 50% or more
of execution time in memory is often wasted due to cache misses. In light of this problem, a number of
recent studies focused on reducing the number of cache misses of database algorithms. In this thesis, we
investigate a different approach: reducing the impact of cache misses through a technique called cache
prefetching. Since prefetching for sequential array accesses has been well studied, we are interested in
studying non-contiguous access patterns found in two classes of database algorithms: the B+-Tree index
algorithm and the hash join algorithm. We re-examine their designs with cache prefetching in mind, and
combine prefetching and data locality optimizations to achieve good cache performance.
For B+-Trees, we first propose and evaluate a novel main memory index structure, Prefetching B+-
Trees, which uses prefetching to accelerate two major access patterns of B+-Tree indices: searches and
range scans. We then apply our findings in the development of a novel index structure, Fractal Prefetch-
ing B+-Trees, that optimizes index operations both for CPU cache performance and for disk performance
in commercial database systems by intelligently embedding cache-optimized trees into disk pages.
For hash joins, we first exploit cache prefetching separately for the I/O partition phase and the join
phase of the algorithm. We propose and evaluate two techniques, Group Prefetching and Software-
Pipelined Prefetching, that exploit inter-tuple parallelism to overlap cache misses across the processing
of multiple tuples. Then we present a novel algorithm, Inspector Joins, that exploits the free information
v
obtained from one pass of the hash join algorithm to improve the performance of a later pass. This new
algorithm addresses the memory bandwidth sharing problem in shared-bus multiprocessor systems.
+
We compare our techniques against state-of-the-art cache-friendly algorithms for B -Trees and hash
joins through both simulation studies and real machine experiments. Our experimental results demon-
strate dramatic performance benefits of our cache prefetching enabled techniques.
vi
Acknowledgments
I must thank my advisors, Todd Mowry and Anastassia Ailamaki, for the tremendous time, energy, and
wisdom they invested in my Ph.D. education. Todd and Natassa taught me everything from choosing
research topics, to performing high-quality studies, to writing papers and giving talks. Their guidance
and support throughout the years are invaluable.
I am indebted to Phillip Gibbons for collaborating on my entire thesis research, for participating
in our weekly discussions, and for contributing his knowledge, sharpness, and efforts to all the four
publications that form the basis of this thesis.
I would like to thank the other members of my Ph.D. thesis committee, David DeWitt and Christos
Faloutsos, for their thoughtful comments and invaluable suggestions that have improved the quality of
the experimental results and the completeness of this thesis.
I thank Gary Valentin for collaborating on the fpB+-Tree paper. Gary implemented jump pointer
array prefetching in IBM DB2 and collected DB2 results on B+-Tree range scan I/O performance for
the paper. I am grateful to Bruce Lindsay for giving a CMU DB Seminar talk on the DB2 hash join
implementation, which inspired my hash join work. My thesis work also benefitted from insightful
discussions with David Lomet and Per-A˚ ke Larson.
I would like to acknowledge Angela Brown, Kun Gao, John Griffin, Stratos Papadomanolakis, Jiri
Schindler, Steve Schlosser, and Minglong Shao, for helping me set up machines for running real-machine
experiments in my publications and in this thesis; Chris Colohan and Gregory Steffan, for helping me un-
derstand the internals of the “cello” simulator for my simulation studies; members of the CMU database
group and STAMPede group, for giving comments on my practice talks; staff of CMU SCS facilities,
vii
for allowing me to reboot machines in the machine room after working hours; and Sharon Burks, for her
administrative helps in arranging my thesis proposal and defense talks.
My summer internships expanded my research experience. I would like to thank David Lomet and
Phillip Gibbons for mentoring me during my internships. I also want to thank all the people in IBM
Toronto Lab, Microsoft Research Redmond, and Intel Research Pittsburgh, for making my summers
productive and enjoyable.
I thank my friends in Pittsburgh, Toronto, and Seattle for their helps in my CMU graduate life and in
my summer internships, and for adding a lot of funs to my life.
Finally, I must express my deepest gratitude to my family. I owe a great deal to my parents, Guanhu
and Aili, who gave a life, endless love, and persistent encouragement to me. My sister, Yan, and brother-
in-law, Gang, are always helping and supportive. I am deeply indebted to my dear wife, Qin, for sharing
every moment of the CMU graduate life with me. Without her love, patience, encouragement, and
support, it is impossible for me to complete this six and half years of long journey. Last but not least,
my one-year-old daughter, Ada, motivated me to finish my thesis with her sweet voice of “baba”.
viii
Contents
Abstract v
Acknowledgments vii
Contents ix
List of Figures xv
List of Tables xxi
1 Introduction 1
1.1 Can We Simply Adapt Memory-to-Disk Techniques? . . . . . . . . . . . . . . . . . . . 2
1.2 Cache Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Reducing the Number of Cache Misses . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Reducing the Impact of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Our Approach: Redesigning Database Systems in Light of CPU Cache Prefetching . . . 9
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Related Work on B+-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Related Work on Hash Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Exploiting Cache Prefetching for Main Memory B+-Trees 17
ix
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Previous Work on Improving the Cache Performance of Indices . . . . . . . . . 18
+
2.1.2 Our Approach: Prefetching B -Trees . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Index Searches: Using Prefetching to Create Wider Nodes . . . . . . . . . . . . . . . . 20
+
2.2.1 Modifications to the B -Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Index Scans: Prefetching Ahead Using Jump-Pointer Arrays . . . . . . . . . . . . . . . 27
2.3.1 Solving the Pointer-Chasing Problem . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Implementing Jump-Pointer Arrays to Support Efficient Updates . . . . . . . . . 30
2.3.3 Prefetching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.5 Internal Jump-Pointer Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Itanium 2 Machine Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Simulation Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
+
2.4.3 B -Trees Studied and Implementation Details . . . . . . . . . . . . . . . . . . . 41
2.4.4 A Simple Cache Prefetching Experiment: Measuring Memory Bandwidth on the
Itanium 2 Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.5 Search Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.6 Range Scan Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.7 Update Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.8 Operations on Mature Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4.9 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.10 Cache Performance Breakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4.11 Impact of Larger Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5 Discussion and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
x