Want speed? Memoization in C++17

Recently, someone said to me that we need to “speed up that function”.
After some clarification, we agreed that we need to minimize the number of times
this function is called, since each call is very time-consuming.
After identifying the call sites, we came to the conclusion that major refactorings
are too expensive.

So, what can you do, if you have a bottleneck function in your call stack, and you cannot
change the code at the call sites? By bottleneck, I mean both time and/or space-consuming.
For the upcoming discussion, I will use this gist for the code snippets.

The idea of memoization

In many cases, to save any resources in a software system, one has to trade off

Speed <-> Space

One such trade-off mechanism is called memoization. To illustrate what a memoizer does, consider this function:

A memoizer takes foo as input and returns a function with the same signature which we will call fooM to indicate its memoized nature:

Here is fooM for which we will derive its implementation:

How should fooM behave, compared to foo? From the discussion above, we require a speed-up(and possibly less memory usage) compared to calling foo.

What about storing the calculated result of a call to fooM somehow and next time fooM is called with the exact same arguments, we return a cached result directly.
The next drawing depicts this idea, with the red color indicating an expensive call and the green color indicating a cheap call, in which a cached result was returned:

Starting with the root, the call to fooM(14, "bla") is expensive, since it is the first call and no cached value can possibly exist. But when calling fooM(14, "bla") again, then we can take the cached value from the previous call.

Looking at the left tree, what will happen when calling fooM(16, "bla")? The arguments are not the same as before(16 != 14), so we have to calculate the result, which is expensive.
From now on the result for fooM(16, "bla") is also available in a cache, like the one for
fooM(14, "bla"). The exact same reasoning holds for the right tree, where the arguments are different for the second parameter("blubb" != "bla").

At the end of this call tree, we end up having three cached values for these arguments:

fooM(14, "bla")
fooM(16, "bla")
fooM(14, "blubb")

From what we have seen so far, we can observe some points which might be helpful when implementing a memoizer:

A memoizer is a higher-order function. It takes a function(like foo) and returns a function(like fooM), with the same signature.
A memoized function keeps a function-local cache, since clients would not be amused if we would pollute their space with global caching entities.
A memoized function is a function with state. This means, that if we are memoizing a function with side effects, then the memoized version behaves differently. We will later come back to this important point.
There should not be any restriction on the kind of functions we pass to the memoizer. This means we can pass free functions, lambdas, pointer to members, function objects(functors) etc. We will however forbid certain return and argument types.
This will also be shown later.

Finally, to get an idea on how a client will use the memoizer:

The caches

The discussion so far focused on the client side and requirements of a memoizer.
Let us now start with the actual implementation.
First, we focus on how to design the cache and how the body of fooM should do its job(we will later examine on how to create fooM in the first place):

The algorithm above represents a typical way of implementing a memoizer cache,
and you can view examples of it here and here.
In order for this post to be somehow unique, we need to do something special 😉

Let’s imagine our much valued project architect enters our office and says we need a special kind of cache for the memoizer.
He argues that we have costly functions which are called several times in a row with the same arguments. After that, these arguments are not used ever again.
For example, if we know in advance that fooM(14, "bla") will be called three times in a row, and after that never again with these arguments, then it does not make sense to keep this result in the cache. The cache would simply grow without a valid reason.

To deal with this kind of calling behavior, we could try to implement a so called
least recently used cache. This kind of cache would simply remove the oldest element if a new one is to be stored. That way, the memory footprint of the cache is bounded by its capacity, which is specified at compile time. Below is the algorithm for the LRU cache:

The client for an LRU cache needs to specificy the capacity for the cache at compile time.
One way of doing it would be:

Our architect is OK with this approach and says that it would be great if we could support both cache types. This means for us, that we need to come up with a kind of plugin design, where we can later introduce even more complex caches.

But for now, let us concentrate only on these two cache types. For both of them we can use a simple std::map as a storage container. The key type represents then the arguments to the memoized function and the value type is the return type.
We can use a std::tuple for storing the argument types, so for our fooM function, the declaration of the std::map container would be:

We will later see how to do this with any function in a generic way. First, the two cache types need names and representations:

Here MapCache stands for the default cache. The LRUCache type depends on the cache capacity and is provided by the user. The auto specifier for CacheCapacity is a new feature in C++17 and simplifies the usage of non-type template parameters a bit.

LRUCache is the type interface for the user, but for the implementation we need to think of a way to implement the least recently used mechanism.
We could use a std::queue for this purpose, where we store the latest cache value iterator at the end of the queue.
That way we just need to pop-front each time we want to get rid of the oldest value. Below is the emplaceInCache function for this idea:

The overload emplaceInCache for the LRUCache checks if the cache capacity is reached.
If so, it deletes the first entry from the queue and then adds the new key-value pair into the cache. This is done with the other overload in line 41, where the C++17 try_emplace is used.
This overload is also used for the default cache type.

We can now also define the find functions for the two caches. With overloading we provide two implementations, where the one for the LRUCache is implemented in terms of the other for the default cache:

Here, I used std::optional which was also introduced in C++17 and has its roots from boost.
We try to find the value for the key in the cache and if there was a match we wrap the iterator into this optional. Otherwise we return an empty otptional with std::nullopt.

Implementation of the memoized function

With the cache logic in place, we can now implement the broader memoizer functionality.
As shown before, the memoizer takes a callable and returns a function with the same signature. So, without further ado here is the code for creating the memoized version from its original:

The createMemoizer function is parameterized with the Callable and with the Cache type. The puropose of the CacheReturn type as template paramter will be explained later. For now, it just serves as the return type for the memoized function.

As you see, we return a small lambda in which the callable and the cache are stored in its closure via the captures. The lambda takes the arguments from the user call and wraps them into a tuple in line 75. Then in line 76 the cache is asked if there is already a value for these arguments. If yes, then we return the value directly, otherwise we call applyFunction which calculates the value, stores it in the cache and returns it:

The code snippets presented so far sketches the overall design and algorithms.
As it is often the case, we were not aware of some nasty details during the design phase.
These details come to our mind most often while writing tests or playing with the code from
a client side perspective.

Some (nasty) details

I owe you the actual client API implementation of our memoizer. You may remember that the client calls auto fooM = memoize(foo) instead of createMemoizer , which would not work without template parameters anyway. So let’s think about this API function first.

How about a little quiz?
Why are the following functions not good candidates for being memoized?

Here is my take on it:

The first function cannot be memoized, since the return value is void. We cannot cache something which does not exist.
The second function has a return value, but it lacks arguments. So it is a constant function and returns always the same result. Hence, it does not make sense to cache a value which will be returned for all calls.
The third function returns a reference to non-const. This function is error-prone anyway, since returning a local reference will lead to undefined behavior. But more important for our memoizer is the fact, that returning a non-const cache reference is a ticking time bomb, since the client could potentially change the cache value! This will lead to an inconsistent view of the cache for all other clients requesting this value.
So, third must not be memoized.
The fourth function returns a unique_ptr which is a type that is not copy-constructible. That means we cannot store a copy of it in the cache. Therefore, I vote against memoizing fourth. What is your take on this? Please drop a comment.
The fifth function captures someGlobalInt and changes it in the body. So it is a side effect function. My advice is not to memoize these, since it will likely change the semantics of your software. Returning a cached value will skip these side effects.

These rules can be enforced by the type system for the first four functions:

I won’t torture you with boring template juggling, but lines 103 and 110 are worth a second look.
The templates callable_result_t and callable_arguments_t are here to derive the return and argument types from a callable. If you are interested in the details, jump over to this blog-post. Overall, these asserts are here to help clients catch errors early on and with readable text.

There is another detail which might bite us if not considered carefully.
The default cache type uses a std::map under the hood and the lifetime of its values is bound to the lifetime of its memoized function. That means we can safely return a reference-to-const since these references are always valid.

But what about the LRUCache? Here, the lifetime of values is not bound to the lifetime of the memoized function, since we can delete values from the cache. If we were to return a reference to clients here, then these would at some point become dangling. That is the reason we can only return copies of the return type.
To detect the correct return type for the two caches we use these templates:

The primary template result_adaption is for the default cache type, and we always return a reference to non-const.
The specialization is for the LRUCache and removes all const and reference qualifiers with std::remove_cv<std::remove_reference_t> such that only a copy is returned.
Finally, in line 125, we obtain the correct return type with result_adaption_t and pass it the cache type and the original return type from the non-memoized function.

The API functions

We’ve come a long way and can now finally define the missing API functions, which we will expose to our clients:

So, we have two functions, the generic memoize and the convenience wrapper memoizeWithLRU. The latter is here to relief the client of writing the longer memoize<LRUCache<n>> and it just forwards the callable to memoize.
The nodiscard attribute is new in C++17, and it instructs the compiler to issue a warning in cases where our clients discard the return value of memoize:

This statement is not useful and creates just unnecessary overhead.

We already discussed most of the stuff which is inside memoize. First, we assert if a callable is permitted to be memoized at all. Then we adapt the return type for the different caches.
Finally, we create the cache and call createMemoizer which will build the memoized function.

Before finishing this post, let us come back to the problem at the beginning. We needed a way to speed-up a function without refactoring the call sites. Below is an example situation:

This is the situation our architect was referring to. We have three call sites A.cpp, B.cpp and C.cpp, and they all call the expensive foo with the same arguments. A static analysis has shown that foo will never be called again with 14 and "bla". This means, we could potentially save us two expensive functions executions, and instead rely on our memoizer with LRU cache.
If Lib.hpp declares foo like this:

And the definition of foo inside Lib.cpp looks like this:

Then we can change Lib.cpp to:

That way, we do not touch any of the call sites, and we delegate all calls to foo to the memoized fooM, which is local to Lib.cpp only.

Conclusion

Congratulations, you’ve made it that far 😉
With this presentation you will hopefully have more items in your toolbox to address some performance issues in your code base.
It is not a silverbullet though: it is essential to identify the right use cases for some callables.
Applying memoization blindly, you will run the risk of either having a cache that grows too much, or changing the semantics of your program when memoizing callables with side effects.
My hope is that I presented the concepts in a clear manner and you could draw your own conclusions or even come up with your own caching idea.

Code is available on godbolt and github.