Spiria logo.

Optimizing Shared Data

April 9, 2020.

The first rule of optimization is “don’t do it”. However, once you’ve made your code as clear as it can be, written all your unit tests and verified that the actual product works, you’re sometimes left with underwhelming performances.

When faced with a sluggish program, first you take slowness measurements, then you hone in on the problem, and finally you look for a solution. In this article I’ll explain how I saw, measured, located and fixed the problem.

Reality check: I won’t be presenting the kind of amazing, low-level, well-researched, mathematically-proven technique of the likes of Daniel Lemire. For an example of that, see: “How fast can you parse JSON?”.

In this article, I won’t be counting CPU cycles. I’ll simply talk about the simple optimizations an average software programmer can do as part of his routine work.

Seeing the Problem

The program I was working on was a filter for large text files organized as a tree, where the indentation of each line in the file represented the nesting level. And by large files, I mean files routinely processing 100MB to 1GB of log data. This program was written in C++.

The first, simple approach was to read the text into a wstring vector (std::vector<std::wstring>). This starting point was based on the well-known principles of choosing a container in C++. The first principle is:

Always use a std::vector.

(The second principle is “use a sorted vector” and the third is “are you really sure a vector won’t do?”)

It became obvious, when testing with such large files, that both reading and filtering such a large vector of strings was slow going. When it takes nothing more than a stopwatch to tell that the performance just isn’t there, you’re allowed to peek under the hood.

Measuring the Problem

Fortunately, taking that peek is very easy in Visual Studio, as even the community edition comes with a very decent CPU profiler. Right under the “Debug” menu, the entry “Performance Profiler...” (short-cut: Alt+F2) will open a window with your default project target already selected for profiling. From there, just click the “Start” button and your program will run under the CPU profiler. Simply select the operation, measure, and quit. Measurement done!

Here is what the profiler window looks like while it is recording your program activities:

Profiler

Locating the Problem

The profiler will analyze the data and present you with a list of locations where your program is spending most of its time. By default, this list is organized by total CPU time, as follows:

Profiler

While this gives you a bird’s eye view, it also means you see a lot of irrelevant data. You can easily find the problem area by looking for big jumps in the percentage of time spent. In this case, it turns out that a lot of time is spent copying text strings and allocating memory. The memory allocations come from the text strings and from resizing the vector they are kept in. There is also a significant amount of time spent in file I/O when reading the text. Now that we have a few clues, we’re in a good position to figure out how to solve the performance problem.

Borrowing a Few Tricks

A good starting point when optimizing is knowing what tricks others have used. One obvious thing to remember is that filtering doesn’t change the text; it only changes what is kept or not. That means we can borrow a few tips from functional languages:

  • Constants can be shared.
  • Avoid moving data.
  • Delegate managing memory.

The trick is to make the text shareable, to take it from disk as directly as possible and to take away the management of the memory it resides in. So, we will do the following:

  1. Constants can be shared: read the data from disk in large buffers of non-modifiable data.
  2. Avoid moving data: scan these buffers for text lines and keep pointers directly in the buffer.
  3. Avoid moving data: share these buffers and text lines in the result of filtering.
  4. Delegate managing memory: share these buffers when making copies of the text.

In short, we’ve gone from the code:

struct VectorOfWStringTextHolder
    {
    using Line = std::wstring;

    std::vector<Line> Lines;
    };

To the code:

struct VectorOfSharedBuffersTextHolder
    {
    using Buffer = vector<wchar_t>;
    using BufferPtr = shared_ptr<Buffer>;
    using Buffers = vector<BufferPtr>;

    Buffers TextBuffers;

    using Line = wchar_t *;

    std::vector<Line> Lines;

The idea here is to use a vector of shared buffers so that the addresses of the buffers never change, and so that they can easily be shared between multiple instances.

Of course, these changes have made the code more complex. However, this complexity can be hidden from view, as the buffers can be kept private while only a vector of lines is exposed publicly. In this instance, the code was simplified to make sure the benchmarks measured what we wanted to measure.

The Results

Yes, you did read “benchmarks”! After all, there is no point in optimizing if you cannot verify the improvement. Usually, you would simply reuse the CPU profiler and see the new results. However, for the purposes of this article, I went further and extracted the gist of the old and new code and wrote a small benchmark program that loads a large file and makes partial copies of the text lines to mimic the filtering.

The advantage of writing an explicit benchmark is that you can put the timing measurements exactly where you want them, in order to measure only what you intend to. With CPU profiling, the whole program is measured and it is harder to extract the relevant data.

Here are the results. (I excluded the I/O times, but reading in large buffers and parsing the buffers directly gave a similar gain.)

Data Format Time
Vector of strings 13.28s
Shared buffers of shared lines 1.2s

The above shows a tenfold improvement in speed. But in actual fact, we got even more improvement on larger files, as re-allocating a very large vector is very expensive.

The Code

Below, I have provided the code showing various benchmarks, as well as another approach using std::deque. I also did some benchmarks in a tree structure that I haven’t mentioned in this article, since the change did not improve performance, but avoided another unrelated problem having to do with recursion depth and stack overflow when destroying deeply linked structures of shared pointers.

You can find the code on GitHub.

Note that the real use case for this was in my text tree filtering project, also available on GitHub. The optimization is done in the TextTree and holds the text data through a TextHolder. It also uses a similar trick to have stable addresses for its tree nodes: the (often-overlooked) fact is that a std::deque does not invalidate pointers when appended to. The code is here.