Locality

The program performance on modern hardware is characterized by \emph{locality of reference}, that is, it is faster to access data that is close in address space to data that has been accessed recently than data in a random location. This is due to many architectural features including caches, prefetching, virtual address translation and the physical properties of a hard disk drive; attempting to model all the components that constitute the performance of a modern machine is impossible, especially for general algorithm design purposes. What if one could prove an algorithm is asymptotically optimal on all systems that reward locality of reference, no matter how it manifests itself within reasonable limits? We show that this is possible, and that excluding some pathological cases, cache-oblivious algorithms that are asymptotically optimal in the ideal-cache model are asymptotically optimal in any reasonable setting that rewards locality of reference. This is surprising as the cache-oblivious framework envisions a particular architectural model involving blocked memory transfer into a multi-level hierarchy of caches of varying sizes, and was not designed to directly model locality-of-reference correlated performance.


Introduction
Modeling memory access time of modern computers is an important area of research that lies at the intersection of theoretical computer science, algorithm engineering, and practical aspects of computing. The reality is that modern computers are extremely complicated with numerous components that try to reduce the access time of the elements in the memory. Consequently, the access time varies by orders of magnitude depending on whether favorable conditions are met or not. Choices in the algorithm design can highly impact reaching those favorable conditions, which necessitates building good theoretical models of memory structures of modern computers.
The main direction of existing theoretical work has been on modeling of the memory hierarchy. This has been initiated by the Disk Access Model (DAM) [3], which is also known as the I/O model or the External Memory (EM) model. The DAM assumes a (fast) memory of size M and a disk (i.e., slow memory) of infinite size. The disk stores the input in blocks of size B and can be read or written to via the input-output (I/O) operations, where each I/O transfers one block of data at unit cost. The analysis then typically only considers the number of I/Os (read or written), ignoring any other computational cost. The justification is that since a disk is so much slower than internal memory, minimizing the number of block transfers and ignoring all else is a good model of runtime, however, sometimes this is not a realistic assumption. For example, when DAM is used to model cached memory (modeling cache misses as I/O operations, M as the size of the cache, and B as the size of the cache lines), the relative difference in the cost of a cache misses compared to the arithmetic operations is much smaller. Aside from this, the DAM model has additional limitations, e.g., it ignores the fact that accessing adjacent blocks on a disk is in practice much faster than two random blocks [13] and it models only two levels of memory.
Modeling more than two levels of the memory hierarchy is rather challenging. The big problem is that very precise models (e.g., the ones defining individual parameters for each level of memory hierarchy [14]) are often too complicated, making it hopelessly difficult to design and analyze algorithms. Other approaches, such as the  Figure 1: On the right, one can see a hierarchical model of modern computers, with multiple levels of permanent storage, look up tables for virtual memory and CPU cache. From top to bottom, the levels become smaller but faster. On the left, a sample access time is shown (not to scale). Here, it is assumed that each level contains a continuous portion of the elements from the level above it. As a result, the access time increases as we move away from the elements that are located in the fastest memory, as a complicated function of the distance. For caches, the access time dramatically increases when we hit an element that is stored in the next (slower) level. For mechanical devices (tapes, hard disks), the access time increases rather smoothly due to mechanical processes involved.
hierarchical memory model (HMM) [1,2], model memory with variable access costs by assuming that the cost to access a memory address x is a non-decreasing function, f (x), of the address itself. However, this does not accurately represent modern caches.
The most successful attempt at analyzing cache misses in multi-level cache hierarchies is probably the cacheoblivious framework [9]. It surprisingly avoids the complexity of modeling memory hierarchies by completely avoiding it: the algorithms are designed "obliviously to M and B", i.e., in the classical (single-level) RAM model. However, if such algorithms are analyzed in the two-level DAM with the best cache-management policy, also known as the ideal-cache model, and they happen to be efficient in terms of cache misses with respect to the two levels of memory, then they are also efficient for all levels of multi-level memory hierarchy. Moreover, it has been shown [12] that such algorithms are also efficient for many reasonable cache management policies, e.g., least-recently used (LRU) policy, typically implemented by hardware in practice. However, up to now, cache-oblivious algorithms haven't been shown to optimize anything beyond cache misses.
Capturing locality of reference. In a real hardware, cache utilization is only one aspect that affects program runtime. For example, Jurkiewicz and Mehlhorn [10] show that the time it takes to perform address translations for virtual memory noticeably affects the runtime of programs on real hardware. Figure 1 illustrates the complicated ways various hardware features affect the memory access time. It shows a memory hierarchy, where cheaper and larger but slower memories are placed at the top. The functions that describe the access times of the components may have different behaviors, e.g., for a magnetic tape the access time is basically a linear function of the physical distance, whereas for hard disks it is a more complicated function.
Locality of reference is a fundamental principle of computing that heavily impacts both hardware and algorithm design [13]. In his widely-cited article [8], Peter Denning provides an overview of the history of the concept and how it became very popular in almost all aspects of computing. To quote him, "The locality was adopted as an idea almost immediately by operating systems, database, and hardware architects." The DAM algorithm and the cache-oblivious algorithms try to capture spatial locality and temporal locality -the two fundamental components of the notion of "locality of reference" -from an algorithmic point of view. However, given the complexity of modern hardware, the main question is if other, possibly more complex hardware features, can be modeled simply enough to facilitate design and analysis of algorithms. The approach of the HMM model [1,2] to model the cost of access as a function of memory address is one way to keep the modeling complexity at bay. For instance the authors [1,2] show that if the cost of accessing address x is log(x) then sorting n elements can be done in O(n log n log log n) time. But is log(x) the correct function for modern (and ever-changing) hardware?
1.1 Our Results. We propose to pick up where the previous attempts have left off by following a holistic approach. We present the locality of reference (LoR) model, a computational model that looks at memory in a new way: the cost of a memory access is based on the proximity from prior accesses via what we call a locality function.
More specifically, we consider the machine as having an infinite memory with a linear address space, i.e., memory cells are numbered with the set of natural numbers. Let E = {e 1 , . . . , e |E| } be the sequence of memory addresses accessed by an algorithm A while running on a given input. A simple locality function can define the cost of accessing address e i as a function of the distance from the address e i−1 of the preceding access, for example, log(|e i − e i−1 |), |e i − e i−1 |, or any other arbitrarily function of |e i − e i−1 |.
A specific locality function can capture the (complicated) cost of accessing the data on the hardware running the algorithm. For example, in the classical random access memory (RAM) model, the sequence E simply takes O(|E|) time, so setting to a constant function captures the RAM model. As we show later, by setting to a logarithmic function, we can model the cost of the TLB in virtual memory translation.
The goal of this paper is not to define locality functions for various models of computation. Instead, our results show that cache-oblivious algorithms go beyond minimizing the number of cache misses. In particular, we show that optimal cache-oblivious algorithms are locality-of-reference optimal, meaning, they are asymptotically optimal with respect to any choice of locality function, subject to some mild constraints (needed to ensure that these functions reward locality of reference). In Section 3 we present our result in a simplified setting, in which we focus on the algorithms that do not benefit from large cache sizes. In Section 4 we generalize the result to more general algorithms that do utilize the full cache.
1.2 Example application: optimality of van Emde Boas layout. We now demonstrate an example application of our results which will appear as Theorem 3.1: namely, that the van Emde Boas layout-a layout for implicit static search trees that is optimal for cache-oblivious searching [9]-is also optimal for address translation on modern virtual memory architectures. This cost of address translation is non-negligible and has been observed to impact the performance of fundamental algorithms such as sorting and permuting in practice [10].
Let us review modern virtual memory design. Consider a machine that uses U bits for addressing memory. Virtual memory is implemented as a trie of degree 2 b , for some parameter b, where the translation process translates b bits at a time, starting from the most significant bits of the address. In the worst-case, one translation necessitates U/b lookups, but this cost is often much lower when TLB caching is used. In particular, if two addresses e i and e i−1 share kb most significant bits, then after translating the first memory address, the first k steps of the second translation are cached in the TLB. As a result, the cost function associated with virtual memory translation is essentially (|e i − e i−1 |) = log b (|e i − e i−1 |). This function is clearly non-negative, non-decreasing and concave, thus, satisfying the requirements of Theorem 3.1.
The van Emde Boas layout of an implicit complete binary search tree (for brevity we'll call it vEB tree) is defined as follows. Given a complete binary search tree on n vertices of height h = log n , let T 0 be the top subtree of height h/2 and let T 1 , . . . , T √ n be the subtrees rooted at the children of the leaves of T 0 . Then, vEB tree is defined by placing the subtree T 0 in a contiguous portion of an array, immediately followed by T 1 , T 2 , . . . , T √ n , with each subtree T i laid out recursively. Search on vEB tree is known to incur O(log B n) cache misses, where B is the block size of the DAM model [9].
Consider a machine that is equipped with virtual memory with parameters U = O(log n) and b = O(1). The root-to-leaf traversal of vEB tree recursively traverses T 0 , jumps by at most n − √ n addresses from a leaf of T 0 to the root of the appropriate subtree T i , and recursively traverses T i . Thus, the cost Q LoR (n) of the traversal using the above locality of reference function for virtual memory can be computed using the following recurrence: This solves to Q LoR (n) = O(log n log log n). And Theorem 3.1 implies that this is an asymptotically optimal LoR cost for virtual address translation for the above function .

Related work.
The closest work to our presentation here is the Hierarchical Memory Model [1]. In this model, accessing memory location x takes time f (x). This was extended to a blocked version where accessing k memory locations takes times f (x) + k [2]. In particular the case where f (x) = log x was studied and optimality obtained for a number of problems. This model, through its use of the memory cost function f , bears a number of similarities to ours, and it is meant to represent a multi-level cache where the user manually controls the movement of data from slow to fast memory. However, while it is able to capture temporal coherence well, even in the blocked version it does not capture fully the idea of spatio-temporal locality of reference, where an access is fast because it was close to something accessed recently. Another model that proposed analyzing algorithm performance on a multi-level memory hierarchy is the Uniform Memory Hierarchy model (UMH) [4]. The UMH model is a multi-level variation of the DAM that simplifies analysis by assuming that the block size increases by a fixed factor at each level and that there is a uniform ratio between block and memory size at each level of the hierarchy. Unfortunately, this assumption is quite restrictive and doesn't hold in practice on modern hardware.

Preliminaries
2.1 Models of computation Let P be a problem, I P = {I 1 , I 2 , . . .} be a set of valid instances (input sequences) for which problem P can be solved, and I P n = {I ∈ P : |I| = n} be the subset of instances of P with input of size n. Let E(A, I) = {e 1 , e 2 , . . .} be the sequence of accesses (reads and writes) to memory locations that arises by executing algorithm A on instance I, and let A P = {A 1 , A 2 , . . .} denote the set of all algorithms that correctly solve P , i.e., generate a correct output for every instance in I P .
The general ideal-cache and LRU-cache models incorporate memory size M and block size B when computing the cost of an execution sequence. The M B blocks stored in internal memory make up the working set, and we define W M M,B (E, i) to be the working set after the i-th access to the execution sequence E in model M. In the ideal-cache model (and the DAM model) the evictions from the working set W opt M,B are selected such that the total cost of executing E is minimized [9], while in the LRU-cache model the evictions from the working set W lru A more rigorous and formal definition of the working set, cache replacement policies, and the cache-oblivious and LRU costs is included in the full version of the paper.In Sections 3 and 4 we also define the cost Q LoR (E(A, I)) of an algorithm on instance I in our memoryless and general Locality of Reference (LoR) models with locality function .
Similarly, we define W M x (P, A, n) = max I∈I P n {Q M x (E(A, I)} as the worst-case cost of algorithm A on problem instances of size n for problem P on model M with parameters x.
Analysis of both cache-oblivious and cache-aware algorithms relies on additional constraints defining the relationship between the cache parameters M and B. For example, the analysis of the cache-aware sorting algorithms [3] assumes M ≥ 2B, while the analysis of the cache-oblivious sorting algorithms [9] and cache-oblivious sparse matrix dense vector (SpMV) multiplication [6] assume, respectively, M ≥ B 2 and M ≥ B 1+ -the so-called tall-cache assumption. Therefore, let M P (B) denote the set of all values of M (as a function of B) that satisfy such constraints for problem P .
To prove our results, we need to rigorously define algorithm optimality. Otherwise, as there are multiple parameters involved, the order of the quantifiers would be unclear and ambiguous.
Definition 2.1. A cache-oblivious algorithm A for problem P is asymptotically CO-optimal in the ideal-cache model with the cache parameters B and M ∈ M P (B) iff 1 : To implement the cache replacement policy of the ideal-cache model requires the knowledge of what an algorithm will do in the future. Instead, modern hardware caches implement an approximation of the LRU-cache model, each time evicting the block that has been accessed least recently. While it is often easier to analyze the cache misses of an algorithm in the ideal-cache model, in this work we are able to work directly in the LRU-cache model.

Definition 2.2.
A cache-oblivious algorithm A for problem P is asymptotically LRU-optimal in the LRU-cache model with the cache parameters B and M ∈ M P (B) iff: For completeness, however, we also prove our results in the ideal-cache model, by utilizing the following well-known resource-augmentation result of Sleator and Tarjan [12] (which also applies to other reasonable cache-replacement policies): Any LRU-optimal algorithm in the LRU-cache model with cache parameters B and M is CO-optimal in the ideal-cache model with cache parameters B and 2M .
The equivalence in optimality between the ideal-cache and LRU-cache models relies on the cache augmentation, and says nothing about asymptotic equivalence for the same M . However, this is not an issue for a large class of natural problems, which can be solved using memory-smooth cache-oblivious algorithms.

Definition 2.3. A cache-oblivious algorithm
A is memory-smooth iff increasing the memory size M by a constant factor does not asymptotically change its execution cost. That is, where the Θ-notation is with respect to the size of instance I.
Finally, we define algorithm optimality in our LoR model: Let L be a class of functions. Algorithm A for problem P is asymptotically LoR-optimal with respect to L iff: The notion of an LoR-optimal algorithm with respect to all possible functions would be very powerful, as such an algorithm would be asymptotically optimal on any computing device that rewards locality of reference. In this paper we come very close to achieving such optimality, requiring only a natural set of restrictions on the functions in L.

B-stable problems
To show the equivalence between LoR-optimal and CO-optimal algorithms, we must avoid pathological problems with worst-case behavior that varies dramatically with different instances of the problem for different block sizes.
We say that a problem is B-stable if, for any algorithm A that solves P , there is some "worst-case" instance I w ∈ I P n that, for every B, has CO cost asymptotically no less than the optimal worst-case cost for that B, over all instances. Formally, Definition 2.5. Problem P is B-stable if, for any algorithm A ∈ A P that solves P : Intuitively, for any algorithm that solves a B-stable problem, there must be a single instance that, for all block sizes, has cost no less than the asymptotically worst-case optimal cost. The following lemma, implied by the definition of CO-optimality (Definition 2.1), states that every algorithm must have an instance on which it performs no better, asymptotically, than the CO-optimal algorithm, for every B: If an asymptotically CO-optimal algorithm A solves a B-stable problem P , then In the full version of the paper we prove the following lemma, which shows existence of non-B-stable problems, for which our main result (Theorem 3.1) does not hold. This justifies our classification and exclusion of these pathological cases.
There exists a problem P which is not B-stable and which has a CO-optimal algorithm which is not LoR-optimal.
Proof. Shifting the execution sequence may cause accesses that were in the same block to become in two neighboring blocks, and accesses that were in two neighboring blocks to become in the same block. Thus, the cost may grow or shrink by a factor of two, but not more.

Memoryless algorithms
We begin with the simplest case, namely, the memoryless cache model (MCM) where the internal memory is just a single block, i.e. M P (B) = {B}. 2 Note that in this case, there is no need to differentiate between LRU-cache model and ideal-cache model because the working sets in both cache models, after accessing e i ∈ E(A, I), consist of a single block that contains e i . Thus, the costs of the algorithm A on instance I in the MCM becomes This cost rewards spatial locality. Hence, in the LoR model it is natural to define the locality function to measure the cost of executing the sequence E(A, I) as a function of the spatial distance |e i − e i−1 | between accesses: Let L denote the set of all non-negative, non-decreasing, concave functions : N → R. Even though L encompasses a wide range of (arbitrarily complicated) functions, we will show that any cache-oblivious algorithm is LoR-optimal with respect to L if and only if it is CO-optimal in the MCM.

Main result. Let
We begin by proving our result for this specific locality function B ∈ L and generalize it to any ∈ L later.
Corollary 3.1. If a cache-oblivious algorithm A for problem P is asymptotically LoR-optimal with respect to L, then it is asymptotically CO-optimal in the MCM.
Proof. Since A is LoR-optimal, then it is within a constant factor of optimal for all locality functions, including The corollary follows from Lemmas 3.1 and 2.4.
We now show that any locality function ∈ L can be represented as a linear combination of N functions For every locality function ∈ L there exist non-negative constants α 1 , α 2 . . . α N and β 1 ,

Then for any integer
γ i x and this expression telescopes to l(x) (the full derivation can be found in the full version of the paper. Since function is non-negative and concave, all values of α i and β i are non-negative.
For every locality function ∈ L there exists a sets of n non-negative constants α 1 , α 2 . . . α n , and β 1 , β 2 . . . β n such that, for any execution sequence E, Theorem 3.1. Let L be a set of all non-negative, non-decreasing, concave functions l : N → R. Any cacheoblivious algorithm A that solves a B-stable problem P is LoR-optimal with respect to L if and only if it is CO-optimal in the memoryless cache model.
Proof. The first direction follows from Corollary 3.1. To prove the other direction, consider the CO-optimal algorithm A CO-Opt and some algorithm A that solves P . Since P is B-stable, by Definition 2.5, Using the definition of the worst-case cost W B , we get and since, by Lemma 3.1, for any B the smoothed CO cost is equivalent to the LoR cost with the corresponding B function, This inequality holds for all B and thus all linear combinations of various B. For any locality function in the set of valid locality functions, L, consider α 1 , α 2 , . . . , α n and β 1 , β 2 , . . . , β n given by Lemma 3.1. We use the β's as the B values and the α's as the coefficients in the linear combination to get I w is a single instance of P , therefore it cannot have a greater total cost than the single instance that maximizes the cost Moving the max outside the summation can only decrease the overall cost of the left side of the inequality, thus Using Corollary 3.1, we get We are considering an arbitrary algorithm A that solves P , so this applies to all A ∈ A P . By the definition of worst-case LoR cost, Thus, by Definition 2.4, A CO-Opt is asymptotically LoR-optimal.

General models for algorithms with memory
In the previous section, we considered execution sequences that did not utilize more than one block of memory, and thus locality to other than the previously accessed memory location was irrelevant. Now we generalize the model and apply it to execution sequences without this restriction. This requires that we consider the size and contents of internal memory when computing the expected cost of an access.

General LoR model
To capture the concept of the working set for algorithms that use internal memory, we define bidimensional locality functions that compute LoR cost based on two dimensions: distance and time. This bidimensional locality function, (d, δ) represents the cost of a jump from a source element, s, to a target element, t, where d and δ are the spatial distance and temporal distance, respectively, between s and t. This captures the concept of the working set by using "time" to determine if the source element is in memory or not. If the source is temporally close (was accessed recently) and spatially close to the target, t, the resulting locality cost of the jump is small. Details of the bidimensional locality functions. Let L B denote the set of bidimensional locality functions that we consider. The functions in this set have the following properties. An element of this set is a function of the form (d, δ) = max(f (d), g(δ)), where f (d) is non-negative, non-decreasing, and concave, while g(δ) is a 0-1 threshold function, i.e., for some value x. For any k ≤ i, the bidimensional locality cost of a jump from source element e k to target e i in the sequence E is (|e i − e k |, t(E, i) − t(E, k)), where t(E, i) is the time of the i-th access. For simplicity of notation, we define δ t (E, k, i) to be the temporal distance between the i-th access (e i ) and k-th access (e k ), i.e., δ t (E, k, i) = t(E, i) − t(E, k). Intuitively, we can think of δ t (E, k, i) as the time from access e k to "present" when accessing e i . In addition, we require that the functions cannot be more "sensitive" to temporal locality than spatial locality, i.e., for any locality function (d, δ) = max(f (d), g(δ)), we have that ∀ x [f (x) ≥ g(x)]. This corresponds to the tall cache assumption M ≥ B 2 , which is typically used in the analysis of cache-oblivious algorithms [7,11]. Therefore, we restrict the machine parameters, M and B, to all values B ≥ 1 and M P (B) = {M : M ≥ B 2 }. A more in-depth discussion of the tall cache assumption and how it relates to the LoR model can be found in the vull version of the paper. We form our definition of time based on the amount of change that occurs to the working set. For example, if an access causes a block of B elements to be evicted, we say that time increases by 1. Thus, time depends on the locality function, so we define the time of the i-th access of E, for the given locality function to be t (E, i) = i−1 k=1 (Q LoR (E, k)). That is, the time of access e i ∈ E is simply the sum of costs of all accesses prior to e i in sequence E. We note that the time after the last access of E is the total LoR cost (i.e., Q LoR (E) = t (E, |E|+1)).
Unlike the memoryless LoR cost, we cannot simply compute the cost of access e i using the distance from the previous access, e i−1 , since any of the prior accesses may be in the working set when accessing e i . Furthermore, since we no longer consider only non-decreasing execution sequences, when accessing e i , there may be accesses to both the left and right that could be in the same block as e i . Therefore, computing Q LoR (E, i) using the locality function from a single source is insufficient to capture the idea of the working set, and a detailed example showing why this is the case is included in the full version of the paper. We define the general LoR cost of access e i ∈ E as Intuitively, the LoR cost of access e i ∈ E is computed from the minimum cost jumps from both the left side and right side of e i . We note that this generalizes the LoR cost definition of the memoryless setting (Section 3), as the locality function from source e R will always evaluate to 1 for non-decreasing accesses. This formulation has the added benefit that it lets us easily visualize an execution sequence in a graphical representation, illustrated in Figure 2. We consider a series of accesses in execution sequence E as points in a 2-dimensional plane. The point representing access e i is plotted with the x and y coordinates corresponding to the spatial position, e i , and the temporal position, t(E, i), respectively. The cost of access e i is simply computed from the LoR cost with sources e L and e R (the previous access with the minimum locality function cost to the left and right, respectively). We can visually determine which previous accesses correspond to e L and e R : if a previous access is outside the gray region (i.e., δ > M B or d > B), the cost is 1. Otherwise, it is simply d B .

Equivalence to cache-oblivious cost
We note that this includes the cases where e L and/or e R do not exist, since, we set e L = −∞ and/or e R = ∞, respectively, in such cases.
Since both e L and e R are within distance B of e i , this is equal to LoR cost, i.e., Since they are equivalent for any access, e i ∈ E, then for any execution sequence E, Since the cache-oblivious cost is computed assuming ideal cache replacement, and LRU cache replacement with twice the memory is 2-competitive with ideal cache, we have We can also prove similar asymptotic equivalence result between the LoR and the ideal-cache models for the same M , if we consider memory-smooth algorithms:

Main result
We now extend our result to any bidimensional locality function ∈ L B . A cache-oblivious algorithm A for a B-stable problem P is LRU-optimal if and only if it is LoR-optimal with respect to L B , where L B is a set of all functions of the form (d, δ) = max(f (d), g(δ)), where g(δ) is a 0-1 threshold function, f (x) ≥ g(x) for all x ≥ 0, and f is a non-negative, non-decreasing, concave function.
Proof. If algorithm A LoR is LoR-optimal for all bidimensional locality functions, then it is optimal for locality functions M,B , for any M and B. By Theorem 4.2, it follows that A LoR is LRU-optimal for any M and B.
To prove that LRU-optimal algorithms are also LoR-optimal, we consider problem P and algorithm A LRU that solves P with optimal LRU cost, i.e., And by the definition of the worst-case cost W , Since P is B-stable, there is some instance I w ∈ I P n for each A such that  . For a given bidimensional locality function , we define k to be the k-th such M,B function that we use to represent it, i.e., k = max(α k β k , g(δ)). 3 Thus, we have Instance I w cannot result in greater cost than the instance that maximizes the total cost, so Moving the max outside of the summation can only decrease the cost of the left hand side of the inequality, thus The proof of Corollary 3.1 applies, giving us Using our definition of the worst-case LoR cost, Therefore, any LRU-optimal algorithm is also LoR-optimal.
A memory-smooth cache-oblivious algorithm A for a B-stable problem P is CO-optimal if and only if A is LoR-optimal with respect to L B , where L B is a set of all functions of the form (d, δ) = max(f (d), g(δ)), where g(δ) is a 0-1 threshold function, f (x) ≥ g(x) for all x ≥ 0, and f is a non-negative, non-decreasing, concave function.
Proof. Since the cache-oblivious model assumes ideal cache replacement, for any execution sequence E, Since algorithm A is memory-smooth, for any execution sequence E generated by A, Q sco ). Since the LRU cost and CO cost are asymptotically equivalent for every execution sequence generated by A, then A is asymptotically LRU-optimal if and only if it is asymptotically CO-optimal and, by Theorem 4.3, A is LoR-optimal if and only if it is CO-optimal.
Despite the increasing complexity of modern hardware architectures, the goal of many design and optimization principles remain the same: maximize locality of reference. Even many of the optimization techniques used by modern compilers, such as branch prediction or loop unrolling [13], can be seen as methods of increasing spatial and/or temporal locality. As we demonstrated in this work, cache-oblivious algorithms do just that, suggesting that the performance benefits of such algorithms extend beyond what was originally envisioned.
That is to say, although we have introduced a new way to model computation via locality functions, we are not advocating algorithm design and analysis using locality functions. Instead, though our transformations, we have shown that creating the best possible algorithms in the existing cache-oblivious models is they right way to design algorithms not just for a multi-level cache, but for any locality-of-reference-rewarding system. One can thus conclude that the cache-oblivious model is better than we thought it was. Proof. Let and let α i = iγ i and β i = i. Since is a non-negative and concave, all γ i values are non-negative, and, consequently, all α i and β i values are also non-negative. Thus: We first simplify the A term, which gives us We now simplify the C term from above Combining the simplified terms, we get

B Formal definitions of cache-oblivious and LRU cost
Analysis of cache-oblivious algorithms assumes M to be the size of internal memory, with M B blocks being stored in internal memory at a given time, which we call the working set. The working set is made up of blocks of contiguous memory, each containing B elements. For a given block size, B, we enumerate the blocks of memory by defining the block containing element e as {e} B (the e B -th block). Formally, we define the working set after the i-th access of execution sequence E on a system with memory size M , block size B, and cache replacement policy P (formally defined below) as W P M,B (E, i). For simplicity of notation, we refer to the working set after the i-th access simply as W i when the other parameters (M , B, P, and E) are unambiguous.
When we access an element e i , if the block containing e i is in the working set (i.e., {e i } B ∈ W i−1 ), it is a cache hit and, in the cache-oblivious model, it has a cost of 0. However, if {e i } B is not in the working set, it is a cache miss, resulting in a cost of 1. On a cache miss, the accessed block, {e i } B is loaded into memory, replacing an existing block, which is determined by cache replacement policy. We define a general cache replacement policy as a function that selects the block of the working set to evict when a cache miss occurs, i.e., for memory size M and block size B: where W is the working set, e i and e k are the i-th and k-th accesses in sequence E, respectively, k < i, and {e k } B ∈ W. For a given cache replacement policy and execution sequence, E, we define the working set after access i ∈ E as where P M,B (E, W i−1 , i) defines the block to be evicted and {e i } B is the new block being added to the working set.
Since a cache miss results in a cost of 1 and a cache hit has cost 0, the total cost of execution sequence E is simply: For this work, we focus on the following cache replacement policies: The ideal cache replacement policy with internal memory size M and block size B. The number of evictions (and cache misses) over execution sequence E is minimized. This is equivalent to Belady's algorithm [5] that evicts the block {e k } B that is accessed the farthest in the future among all blocks in W. We define W opt M,B (E, i) and W lru M,B (E, i) as the working sets after the i-th access of sequence E, when using the ideal and LRU cache replacement policies, respectively. Thus, the cache-oblivious cost (using the ideal cache replacement policy) of performing the i-th access of on a system with memory size M and block size and the total cost for the entire execution sequence E is We similarly define the cost with the LRU cache replacement policy for a single access e i and a total execution sequence E as Q lru M,B (E, i) and Q lru M,B (E), respectively.
Theorem B.1. For any execution sequence, E, memory size M , and block size B, the total cache misses using the LRU cache replacement policy with a memory twice the size (2M ) is 2-competitive with number of cache misses using the ideal cache replacement policy, i.e., Q lru Proof. It follows from the work of Sleator and Tarjan [12].

C On the tall cache assumption
Cache-oblivious algorithms are analyzed for memory size M and block size B and the tall cache assumption simply assumes that M ≥ B 2 . This assumption is required by many cache-obliviously optimal algorithms because many require that at least B blocks can be loaded into internal memory at a time. It has been proven that without the tall cache assumption, one cannot achieve cache-oblivious optimality for several fundamental problems, including matrix transposition [11] and comparison-based sorting [7]. Thus, we consider how this assumption is reflected in the LoR model, and whether we can gain insight into the underlying need for the tall cache assumption.
Recall that our class of bidimensional locality functions are of the form (d, δ) = max(f (d), g(δ)), where f is subadditive and g is a 0-1 threshold function. In Section 4.2 we define the locality function that corresponds to a memory system with memory size M and block size B to be M,B (d, δ) = max min 1, . This restriction between f and g implies that cannot be more "sensitive" to temporal locality than spatial locality. That is, the LoR cost when spatial and temporal distance are equal will be computed from the spatial distance (i.e., (d, δ) = f (d) if d ≥ δ). Additionally, this implies that (x, x) is subadditive. Intuitively, this tells us that, with the tall cache assumption, any algorithm that balances spatial and temporal locality of reference will not have performance limited by temporal locality. Many cache-obliviously optimal algorithms aim to balance spatial and temporal locality, thus requiring the tall cache assumption to achieve optimality.

D A single LoR source does not represent the working set
In this section, we show that computing the general LoR cost using only a single source (with the minimum cost) is insufficient to represent the working set. Specifically, we show the potential discrepancy between such a formulation of LoR cost and the smooth LRU cost. Informally, we show two execution sequences where the sequence of distances from the closest previous accessed object is given by: but all the accesses in the first sequence lie within one block, while in the second, 1 2 log B blocks are accessed. This shows that the distance from the closest previous access by itself can not characterize the runtime.
We formally define this single-source definition of LoR cost of accessing e i as Q LoR (E, i) = min i−1 k=1 (|e i − e k |, δ t (E, k, i)). To show the discrepancy between this formulation and the LRU cost, we consider the specific locality function that corresponds to the LRU cost for a specific memory size M and block size B:  Thus, the single-source cost formulation does not generalize the LRU cost, while using two sources does (as we prove in Lemma 4.2).

E Necessity of B stability
The following lemma shows that Theorem 3.1 would not hold if the restriction to B-stable algorithms were to be removed.
Lemma E.1. There exists a problem P which is not B-stable and which has a CO-optimal algorithm which is not LoR optimal.
Proof. Here we demonstrate a toy problem that meets the requirements of the lemma while also illustrating the unnaturalness of such problems. It has two candidate algorithms, one which has the same runtime on each instance, and a second one that for each instance has some values of B that it runs faster than the first algorithm, and some that it runs more slowly than the first algorithm on, asymptotically. Thus for each B the worst-case time of the first algorithm is better than the second, but there is no single bad instance for the second algorithm. Consider a problem P and a set A of two cache-oblivious algorithms A 1 and A 2 . The problem, given an n, has a set of n instances I n = I n 1 , I n 2 , . . . I n n . The runtimes of the two algorithms are given as follows: Q co M,B (E(A 1 , I n i )) = Θ min n log n log log n log i , i · n log n log log n B log i Q co M,B (E(A 2 , I n i )) = Θ n log n log log log n log B These runtimes can be realized through an appropriately twisted problem definition that forces an algorithm for each instance to read all elements in one of two sets of memory locations in order to be considered a valid algorithm. In particular our problem admits two algorithms, one of which, A 2 , can solve any instance by performing n log n log log log n reads in memory generated by n log log log n searches in a van Emde Boas search structure, and the other, A 1 , by reading at memory locations generated by an arithmetic progression, where the step and number of locations depends on the instance.
Accessing k memory locations evenly spaced σ apart takes Θ(1 + min k, kσ B ) I/Os in the ideal-cache model; thus the desired runtime of algorithm A 1 on instance I n i can be forced by having the algorithm A 1 instance I i read n log n log log n log i memory locations evenly spaced i apart. What are the worst-case runtimes of these algorithms?
Thus the CO-optimal A 2 has a LoR runtime (with (d) = log d) of Θ(log n log log n log log log n) which is a Θ(log log log) factor worse than the non-CO-optimal A 1 with LoR runtime of log n log log n. Since A 2 is not optimal for one locality function, it can not be optimal for all valid locality functions.
What made this problem not B-stable? It was the fact that every instance was constructed to be faster for one algorithm for some values of B and slower for others than the optimal worst-case algorithm. In this example A 1 ran instance I n i slower than A 2 for B close to i and faster than A 2 for B far from i. However, it is far from natural to have an instance in effect encode faster-than-worse-case performance on selected B's. In a standard data structure query, such as "what is the predecessor of a given item in an ordered set," the query item itself has nothing that combined with the problem definition allows a query to encode a preference for fast execution for certain B's in a non-optimal algorithm. We note that this is very different than some algorithms which may "hard-code" some instances and make them fast; this does not pose a problem with regards to B-stability as this makes this instance fast for all values of B.