Recently, someone said to me that we need to “speed up that function”.
After some clarification, we agreed that we need to minimize the number of times
this function is called, since each call is very time-consuming.
After identifying the call sites, we came to the conclusion that major refactorings
are too expensive.
So, what can you do, if you have a bottleneck function in your call stack, and you cannot
change the code at the call sites? By bottleneck, I mean both time and/or space-consuming.
For the upcoming discussion, I will use this gist for the code snippets.
The idea of memoization
In many cases, to save any resources in a software system, one has to trade off
Speed <-> Space
One such trade-off mechanism is called memoization. To illustrate what a memoizer does, consider this function:
A memoizer takes
foo as input and returns a function with the same signature which we will call
fooM to indicate its memoized nature:
fooM for which we will derive its implementation:
fooM behave, compared to
foo? From the discussion above, we require a speed-up(and possibly less memory usage) compared to calling
What about storing the calculated result of a call to
fooM somehow and next time
fooM is called with the exact same arguments, we return a cached result directly.
The next drawing depicts this idea, with the red color indicating an expensive call and the green color indicating a cheap call, in which a cached result was returned:
Starting with the root, the call to
fooM(14, "bla") is expensive, since it is the first call and no cached value can possibly exist. But when calling
fooM(14, "bla") again, then we can take the cached value from the previous call.
Looking at the left tree, what will happen when calling
fooM(16, "bla")? The arguments are not the same as before(
16 != 14), so we have to calculate the result, which is expensive.
From now on the result for
fooM(16, "bla") is also available in a cache, like the one for
fooM(14, "bla"). The exact same reasoning holds for the right tree, where the arguments are different for the second parameter(
"blubb" != "bla").
At the end of this call tree, we end up having three cached values for these arguments:
From what we have seen so far, we can observe some points which might be helpful when implementing a memoizer:
- A memoizer is a higher-order function. It takes a function(like
foo) and returns a function(like
fooM), with the same signature.
- A memoized function keeps a function-local cache, since clients would not be amused if we would pollute their space with global caching entities.
- A memoized function is a function with state. This means, that if we are memoizing a function with side effects, then the memoized version behaves differently. We will later come back to this important point.
- There should not be any restriction on the kind of functions we pass to the memoizer. This means we can pass free functions, lambdas, pointer to members, function objects(functors) etc. We will however forbid certain return and argument types.
This will also be shown later.
Finally, to get an idea on how a client will use the memoizer:
The discussion so far focused on the client side and requirements of a memoizer.
Let us now start with the actual implementation.
First, we focus on how to design the cache and how the body of
fooM should do its job(we will later examine on how to create
fooM in the first place):
The algorithm above represents a typical way of implementing a memoizer cache,
and you can view examples of it here and here.
In order for this post to be somehow unique, we need to do something special 😉
Let’s imagine our much valued project architect enters our office and says we need a special kind of cache for the memoizer.
He argues that we have costly functions which are called several times in a row with the same arguments. After that, these arguments are not used ever again.
For example, if we know in advance that
fooM(14, "bla") will be called three times in a row, and after that never again with these arguments, then it does not make sense to keep this result in the cache. The cache would simply grow without a valid reason.
To deal with this kind of calling behavior, we could try to implement a so called
least recently used cache. This kind of cache would simply remove the oldest element if a new one is to be stored. That way, the memory footprint of the cache is bounded by its capacity, which is specified at compile time. Below is the algorithm for the LRU cache:
The client for an LRU cache needs to specificy the capacity for the cache at compile time.
One way of doing it would be:
Our architect is OK with this approach and says that it would be great if we could support both cache types. This means for us, that we need to come up with a kind of plugin design, where we can later introduce even more complex caches.
But for now, let us concentrate only on these two cache types. For both of them we can use a simple
std::map as a storage container. The key type represents then the arguments to the memoized function and the value type is the return type.
We can use a
std::tuple for storing the argument types, so for our
fooM function, the declaration of the
std::map container would be:
We will later see how to do this with any function in a generic way. First, the two cache types need names and representations:
MapCache stands for the default cache. The
LRUCache type depends on the cache capacity and is provided by the user. The
auto specifier for
CacheCapacity is a new feature in C++17 and simplifies the usage of non-type template parameters a bit.
LRUCache is the type interface for the user, but for the implementation we need to think of a way to implement the least recently used mechanism.
We could use a
std::queue for this purpose, where we store the latest cache value iterator at the end of the queue.
That way we just need to
pop-front each time we want to get rid of the oldest value. Below is the
emplaceInCache function for this idea:
emplaceInCache for the
LRUCache checks if the cache capacity is reached.
If so, it deletes the first entry from the queue and then adds the new key-value pair into the cache. This is done with the other overload in
line 41, where the C++17 try_emplace is used.
This overload is also used for the default cache type.
We can now also define the find functions for the two caches. With overloading we provide two implementations, where the one for the
LRUCache is implemented in terms of the other for the default cache:
Implementation of the memoized function
With the cache logic in place, we can now implement the broader memoizer functionality.
As shown before, the memoizer takes a callable and returns a function with the same signature. So, without further ado here is the code for creating the memoized version from its original:
createMemoizer function is parameterized with the
Callable and with the
Cache type. The puropose of the
CacheReturn type as template paramter will be explained later. For now, it just serves as the return type for the memoized function.
As you see, we return a small lambda in which the
callable and the
cache are stored in its closure via the captures. The lambda takes the arguments from the user call and wraps them into a
line 75. Then in
line 76 the cache is asked if there is already a value for these arguments. If yes, then we return the value directly, otherwise we call
applyFunction which calculates the value, stores it in the cache and returns it:
The code snippets presented so far sketches the overall design and algorithms.
As it is often the case, we were not aware of some nasty details during the design phase.
These details come to our mind most often while writing tests or playing with the code from
a client side perspective.
Some (nasty) details
I owe you the actual client API implementation of our memoizer. You may remember that the client calls
auto fooM = memoize(foo) instead of
createMemoizer , which would not work without template parameters anyway. So let’s think about this API function first.
How about a little quiz?
Why are the following functions not good candidates for being memoized?
Here is my take on it:
firstfunction cannot be memoized, since the return value is
void. We cannot cache something which does not exist.
secondfunction has a return value, but it lacks arguments. So it is a constant function and returns always the same result. Hence, it does not make sense to cache a value which will be returned for all calls.
thirdfunction returns a reference to non-const. This function is error-prone anyway, since returning a local reference will lead to undefined behavior. But more important for our memoizer is the fact, that returning a non-const cache reference is a ticking time bomb, since the client could potentially change the cache value! This will lead to an inconsistent view of the cache for all other clients requesting this value.
thirdmust not be memoized.
fourthfunction returns a
unique_ptrwhich is a type that is not copy-constructible. That means we cannot store a copy of it in the cache. Therefore, I vote against memoizing
fourth. What is your take on this? Please drop a comment.
someGlobalIntand changes it in the body. So it is a side effect function. My advice is not to memoize these, since it will likely change the semantics of your software. Returning a cached value will skip these side effects.
These rules can be enforced by the type system for the first four functions:
I won’t torture you with boring template juggling, but lines
110 are worth a second look.
callable_arguments_t are here to derive the return and argument types from a callable. If you are interested in the details, jump over to this blog-post. Overall, these
asserts are here to help clients catch errors early on and with readable text.
There is another detail which might bite us if not considered carefully.
The default cache type uses a
std::map under the hood and the lifetime of its values is bound to the lifetime of its memoized function. That means we can safely return a reference-to-const since these references are always valid.
But what about the
LRUCache? Here, the lifetime of values is not bound to the lifetime of the memoized function, since we can delete values from the cache. If we were to return a reference to clients here, then these would at some point become dangling. That is the reason we can only return copies of the return type.
To detect the correct return type for the two caches we use these templates:
The primary template
result_adaption is for the default cache type, and we always return a reference to non-const.
The specialization is for the
LRUCache and removes all const and reference qualifiers with
std::remove_cv<std::remove_reference_t> such that only a copy is returned.
line 125, we obtain the correct return type with
result_adaption_t and pass it the cache type and the original return type from the non-memoized function.
The API functions
We’ve come a long way and can now finally define the missing API functions, which we will expose to our clients:
So, we have two functions, the generic
memoize and the convenience wrapper
memoizeWithLRU. The latter is here to relief the client of writing the longer
memoize<LRUCache<n>> and it just forwards the callable to
nodiscard attribute is new in C++17, and it instructs the compiler to issue a warning in cases where our clients discard the return value of
This statement is not useful and creates just unnecessary overhead.
We already discussed most of the stuff which is inside
memoize. First, we
assert if a callable is permitted to be memoized at all. Then we adapt the return type for the different caches.
Finally, we create the cache and call
createMemoizer which will build the memoized function.
Before finishing this post, let us come back to the problem at the beginning. We needed a way to speed-up a function without refactoring the call sites. Below is an example situation:
This is the situation our architect was referring to. We have three call sites
C.cpp, and they all call the expensive
foo with the same arguments. A static analysis has shown that
foo will never be called again with
"bla". This means, we could potentially save us two expensive functions executions, and instead rely on our memoizer with LRU cache.
foo like this:
And the definition of
Lib.cpp looks like this:
Then we can change
That way, we do not touch any of the call sites, and we delegate all calls to
foo to the memoized
fooM, which is local to
Congratulations, you’ve made it that far 😉
With this presentation you will hopefully have more items in your toolbox to address some performance issues in your code base.
It is not a silverbullet though: it is essential to identify the right use cases for some callables.
Applying memoization blindly, you will run the risk of either having a cache that grows too much, or changing the semantics of your program when memoizing callables with side effects.
My hope is that I presented the concepts in a clear manner and you could draw your own conclusions or even come up with your own caching idea.