Julien Jorge's Personal Website

Effortless Performance Improvements in C++

Wed Feb 22, 2023

C++ is a great language for writing performant programs by default, that’s why we love it. Or is it? In this series of blog posts we are going to explore the implications on processing time for typical C++ way-of-doing by implementing a non-trivial task using idiomatic C++. Then we will benchmark this program and improve its performance.

Optimization is a never-ending quest that can lead to huge sacrifices in terms of code readability. Readability is relative, i.e. when it is the kind of code you see everyday, it becomes quite readable for you, so here by readability I mean “how hard it is for an experienced programmer to grasp the intent from the code”.

In these blog posts we will restrict ourselves into transformations that do not diminish the expressiveness of the existing interfaces, and we will stay withing the scope of the standard library. We will see that even within these limits one can easily reduce processing times by a factor of two, and maybe three with some compromises.

As Donald Knuth put it in his famous article Structured Programming with go to Statements:

In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering.

[A good programmer] will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.

Consequentely, every code sample you will see in the following parts will go through a benchmark, measuring the processing time for multiple input sizes. Then the performance of every code modification will be compared with the previous implementation. All benchmarks are executed on my laptop, which has an Intel i7-8665U CPU and 8 GB of RAM. L1 is 256 KB, L2 is 1 MB, and L3 is 8 MB. CPU-scaling and turbo are disabled. All programs are compiled with GCC 12, optimization level O3, and with LTO. The host system is Ubuntu 22.10.

Finally, a very important prerequisite before doing any optimization is to have working tests. This is often implicit but this cannot be ignored: there’s no point in having a fast but broken algorithm. Even though we will not talk about the tests in the following sections, be assured that the actual implementation has many tests. You can check by yourself by browsing the repository of this benchmark.

The program

Our application will be a string processing service with the following features.

The input is a multi-line string made of records, one record per line. The output is a multi-line string made of processed records. The records and their associated actions are:

  • set:<key>:<value>: assign <value> to <key>, no output is emitted.
  • value:<key>: output a line with format value:<v> where <v> is the value previously assigned to <key>, or an empty string if there is none.
  • params:<csv>: output params:<r> where <r> is the string obtained by doing the following substitutions in <csv>:
    • ${param} -> 0
    • ${tag} -> 123
    • ${id} -> foobar
    • ${line_count} -> the number of lines emitted in the output so far.

If a record does not match any of the above rules then it is transfered verbatim in the output.

We want to stay with very simple idiomatic C++ so we will start with the following code layout:

std::string service(const std::string& in)
{
  std::string result;
  std::uint64_t line_count = 0;
  std::istringstream iss(in);
  std::string line;

  while (std::getline(iss, line))
    {
      std::vector<std::string> tokens = tokenize(line);

      if (tokens[0] == "params") { /* ... */ }
      else if (tokens[0] == "set") { /* ... */ }
      else if (tokens[0] == "value") { /* ... */ }
      else
        {
          result += concatenate_tokens(tokens) + '\n';
          ++line_count;
        }
    }

  return result;
}

The entry point is a service function receiving the string to process and returning the resulting string. We are turning the input string into an std::istringstream such that we can call std::getline to process the input line by line. Each line of the input is split on colons by a call to tokenize, that we have yet to define. Then we check the string in the first token and process the line accordingly. If the first token is unknown we restore the line in the result by concatenating the tokens, with colons, via a call to concanetate_tokens, which has yet to be defined.

The tokenize function is implemented as follows. Again, very basic C++.

std::vector<std::string> tokenize(const std::string& s)
{
  std::vector<std::string> result;

  std::string::size_type from = 0;
  std::string::size_type colon = s.find(':');

  while (colon != std::string::npos)
    {
      result.push_back(s.substr(from, colon - from));
      from = colon + 1;
      colon = s.find(':', from);
    }

  result.push_back(s.substr(from));

  return result;
}

The function takes a string as its input. The scan begins on the first character of the string, then a colon is searched in the input. If one is found, then the part between this colon and the place where the scan began is inserted at the end of a vector. Then the process moves to the character following the colon and another colon is searched. These steps are repeated until no colon is found. Finally, the part between the last colon, excluded, and the end of the string is appended in the vector. The vector is returned at the end of the function.

Now that we have the tokens from the input string, we can implement the processing of the records. For a record of type "set" we will process as follows:

std::map<std::string, std::string> entries;

/* … */

else if (tokens[0] == "set")
  entries[tokens[1]] = tokens[2];

/* … */

If the record is of type "set" we store in an std::map the value token[2], assigned to the key token[1].

For a record of type "value" we will use the map defined above:

/* … */

else if (tokens[0] == "value")
  {
    if (entries.find(tokens[1]) != entries.end())
      result += "value:" + entries[tokens[1]] + '\n';
    else
      result += "value:\n";

    ++line_count;
  }

/* … */

If the record is of type "value" then we check the previously inserted key-value pairs. If the key is known, then we append in the result a line made of the prefix “value:” followed by the value associated with the provided key. Otherwise, if the key is unknown, a line with a single “value:” is inserted. In both cases the counter of lines emitted in the output is incremented by one.

The last kind of records, of type "params", is processed as follows:

/* … */

if (tokens[0] == "params")
  {
    replace(tokens[1], "${param}", "0");
    replace(tokens[1], "${tag}", "123");
    replace(tokens[1], "${id}", "foobar");
    replace
      (tokens[1], "${line_count}", std::to_string(line_count).c_str());

    result += concatenate_tokens(tokens) + '\n';
    ++line_count;
  }

/* … */

If the record is of type "params" then the replacements are done in the second token with a replace function that we have yet to define. For the ${line_count} case, the counter of lines emitted in the output is converted into an std::string using a utility function provided by the standard library. Then we get the C-string from this std::string for the replacement. Once all replacements are done the tokens are concatenated with colons and the resulting string is appended as a new line in the output. Finally, the counter of lines emitted in the output is incremented by one.

We have implemented the processing of all record types, now we have to give the detail of the remaining utility functions. The first one is replace:

void replace(std::string& s, const char* from, const char* to)
{
  std::size_t pos = 0;

  while ((pos = s.find(from, pos)) != std::string::npos)
    {
      s.replace(pos, strlen(from), to);
      pos += strlen(to);
    }
}

The replace function receives an std::string by address as well as two C-strings. The function places a cursor on the first character of the std::string, then searches for the first C-string, the token, from there. If the token is found, it is replaced in-place in the std::string. Then the cursor is moved on the first character following the replacement and the process is repeated.

Finally, the last utility function is concatenate_tokens which turns a vector of strings into a single string whose parts are the items from the vector separated by colons:

std::string concatenate_tokens(const std::vector<std::string>& tokens)
{
  std::string result;

  for (const std::string& token : tokens)
    if (result.empty())
      result = token;
    else
      result += ":" + token;

  return result;
}

The concatenate_tokens functions loops over all strings from the vector. If the resulting string is empty it means that this is the first item in the vector, so we just copy it in the resulting string. Otherwise, we have already stored a string in the resulting string so we will append a colon then the item from the vector. The function returns the resulting string when all items from the vector are processed.

Is it efficient?

So what do you think, is it an efficient implementation? I expect some of you to be grinding their teeth at the code above. “Why did you <insert random bad code here>?” are you wondering. Interestingly, even though I would have naturally written things differently, the inneficiencies were not always where I expected. We’ll go into it in the next posts. For now, here are some measurements on this algoritm.

Input lines Processing time Throughput
10 (~200B) 108 µs. 1 MB/s.
100 (~2KB) 336 µs. 5 MB/s.
1000 (~20KB) 1 ms. 14 MB/s.
10’000 (~200KB) 13 ms. 15 MB/s.
100’000 (~2MB) 72 ms. 29 MB/s.
1’000’000 (~22MB) 812 ms. 26 MB/s.
10’000’000 (~225MB) 9 s. 23 MB/s.
100’000’000 (~2.6GB) 2 minutes 20 MB/s.

Note that the measurements here are those of the main service function, where the input has already been copied into a string in memory. There is no IO in these metrics.

All in all, everything seems relatively fast, e.g. waiting 10 seconds to get a result is not so long. For a human. Depending on how frequently the program is used, however, it may or may not appear so fast. However, if the program is a single piece in an otherwise long chain of services, or if it is executed repeatedly, we may want it to be a bit faster.

Another way to look at the efficiency of a program is by looking at its throughput. While the interpretation of the processing time may be subjective, a throughput can be objectively compared. For example, we can compare the throughput we got on the largest file with the data rate of copying the largest input into another file. It is not the same processing but it gives us an idea of how fast such a file can be scanned, as a base line.

Case in point, on the machine used for the above measurements, copying the largest input file with rsync --progress shows a data rate at 500 MB/s. If I copy it on another computer on my local network, it goes at 120 MB/s. Now if the target is a remote GCP instance, the transfer goes at 23 MB/s. Our implementation is as fast as sending a large file on a distant computer.

Looking at the throughput it is also clear that the program peaks around a few megabytes of input, after which the throughput is going down. Here we are clearly hitting memory limits. Indeed, all inputs up to 100'000 rows can fit in the memory embedded in the CPU while the other inputs need to access the (slower) RAM.

Note that this peak is not visible when considering the processing time only.

Anyway, our goal is not to get the most efficient implementation. Instead we want to understand where the time is spent, see if we can easily get better performance, and try to get better coding practices by default.

During this series of blog posts we are going to use the perf tool to extract the top bottlenecks for our code. By default the tool profiles the number of cycles spent in the program. Measuments will be done on the input with 10 million rows. Here is where the initial implementation spends its time:

  • 27% in std::map<std::string, std::string>::find.
    • Under which the top offender is std::string::operator<.
  • 16% in std::map<std::string, std::string>::operator[].
    • Again with std::string::operator< as the top offender.
  • 13% in tokenize.
    • Mostly in std::vector<std::string>::push_back.
  • 5% in replace.
    • With std::string::find as the top offender.
  • 5% in std::getline.

Optimizing a profile like this one is not obvious. Everything happens in code from the standard library, and I am not going to tweak its code.

The code from the standard library is as efficient as it can, so here we must consider a very disappointing track: we are using it wrong. Let’s see if we can do something about it in part 2.