C++ is a great language for writing performant programs by default, that’s why we love it. Or is it? In this series of blog posts we are going to explore the implications on processing time for typical C++ way-of-doing by implementing a non-trivial task using idiomatic C++. Then we will benchmark this program and improve its performance.
Optimization is a never-ending quest that can lead to huge sacrifices in terms of code readability. Readability is relative, i.e. when it is the kind of code you see everyday, it becomes quite readable for you, so here by readability I mean “how hard it is for an experienced programmer to grasp the intent from the code”.
In these blog posts we will restrict ourselves into transformations that do not diminish the expressiveness of the existing interfaces, and we will stay withing the scope of the standard library. We will see that even within these limits one can easily reduce processing times by a factor of two, and maybe three with some compromises.
As Donald Knuth put it in his famous article Structured Programming with go to Statements:
In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering.
[A good programmer] will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
Consequentely, every code sample you will see in the following parts will go through a benchmark, measuring the processing time for multiple input sizes. Then the performance of every code modification will be compared with the previous implementation. All benchmarks are executed on my laptop, which has an Intel i7-8665U CPU and 8 GB of RAM. L1 is 256 KB, L2 is 1 MB, and L3 is 8 MB. CPU-scaling and turbo are disabled. All programs are compiled with GCC 12, optimization level O3, and with LTO. The host system is Ubuntu 22.10.
Finally, a very important prerequisite before doing any optimization is to have working tests. This is often implicit but this cannot be ignored: there’s no point in having a fast but broken algorithm. Even though we will not talk about the tests in the following sections, be assured that the actual implementation has many tests. You can check by yourself by browsing the repository of this benchmark.
The program
Our application will be a string processing service with the following features.
The input is a multi-line string made of records, one record per line. The output is a multi-line string made of processed records. The records and their associated actions are:
set:<key>:<value>
: assign<value>
to<key>
, no output is emitted.value:<key>
: output a line with formatvalue:<v>
where<v>
is the value previously assigned to<key>
, or an empty string if there is none.params:<csv>
: outputparams:<r>
where<r>
is the string obtained by doing the following substitutions in<csv>
:${param}
->0
${tag}
->123
${id}
->foobar
${line_count}
-> the number of lines emitted in the output so far.
If a record does not match any of the above rules then it is transfered verbatim in the output.
We want to stay with very simple idiomatic C++ so we will start with the following code layout:
std::string service(const std::string& in)
{
std::string result;
std::uint64_t line_count = 0;
std::istringstream iss(in);
std::string line;
while (std::getline(iss, line))
{
std::vector<std::string> tokens = tokenize(line);
if (tokens[0] == "params") { /* ... */ }
else if (tokens[0] == "set") { /* ... */ }
else if (tokens[0] == "value") { /* ... */ }
else
{
result += concatenate_tokens(tokens) + '\n';
++line_count;
}
}
return result;
}
The entry point is a service
function receiving the string to
process and returning the resulting string. We are turning the input
string into an std::istringstream
such that we can call
std::getline
to process the input line by line. Each line of the
input is split on colons by a call to tokenize
, that we have yet to
define. Then we check the string in the first token and process the
line accordingly. If the first token is unknown we restore the line in
the result by concatenating the tokens, with colons, via a call to
concanetate_tokens
, which has yet to be defined.
The tokenize
function is implemented as follows. Again, very basic C++.
std::vector<std::string> tokenize(const std::string& s)
{
std::vector<std::string> result;
std::string::size_type from = 0;
std::string::size_type colon = s.find(':');
while (colon != std::string::npos)
{
result.push_back(s.substr(from, colon - from));
from = colon + 1;
colon = s.find(':', from);
}
result.push_back(s.substr(from));
return result;
}
The function takes a string as its input. The scan begins on the first character of the string, then a colon is searched in the input. If one is found, then the part between this colon and the place where the scan began is inserted at the end of a vector. Then the process moves to the character following the colon and another colon is searched. These steps are repeated until no colon is found. Finally, the part between the last colon, excluded, and the end of the string is appended in the vector. The vector is returned at the end of the function.
Now that we have the tokens from the input string, we can implement
the processing of the records. For a record of type "set"
we will
process as follows:
std::map<std::string, std::string> entries;
/* … */
else if (tokens[0] == "set")
entries[tokens[1]] = tokens[2];
/* … */
If the record is of type "set"
we store in an std::map
the value
token[2]
, assigned to the key token[1]
.
For a record of type "value"
we will use the map defined above:
/* … */
else if (tokens[0] == "value")
{
if (entries.find(tokens[1]) != entries.end())
result += "value:" + entries[tokens[1]] + '\n';
else
result += "value:\n";
++line_count;
}
/* … */
If the record is of type "value"
then we check the previously
inserted key-value pairs. If the key is known, then we append in the
result a line made of the prefix “value:” followed by the value
associated with the provided key. Otherwise, if the key is unknown, a
line with a single “value:” is inserted. In both cases the counter of
lines emitted in the output is incremented by one.
The last kind of records, of type "params"
, is processed as follows:
/* … */
if (tokens[0] == "params")
{
replace(tokens[1], "${param}", "0");
replace(tokens[1], "${tag}", "123");
replace(tokens[1], "${id}", "foobar");
replace
(tokens[1], "${line_count}", std::to_string(line_count).c_str());
result += concatenate_tokens(tokens) + '\n';
++line_count;
}
/* … */
If the record is of type "params"
then the replacements are done in
the second token with a replace
function that we have yet to
define. For the ${line_count}
case, the counter of lines emitted in
the output is converted into an std::string
using a utility function
provided by the standard library. Then we get the C-string from this
std::string
for the replacement. Once all replacements are done the
tokens are concatenated with colons and the resulting string is
appended as a new line in the output. Finally, the counter of lines
emitted in the output is incremented by one.
We have implemented the processing of all record types, now we have to
give the detail of the remaining utility functions. The first one is
replace
:
void replace(std::string& s, const char* from, const char* to)
{
std::size_t pos = 0;
while ((pos = s.find(from, pos)) != std::string::npos)
{
s.replace(pos, strlen(from), to);
pos += strlen(to);
}
}
The replace
function receives an std::string
by address as well as
two C-strings. The function places a cursor on the first character of
the std::string
, then searches for the first C-string
, the token,
from there. If the token is found, it is replaced in-place in the
std::string
. Then the cursor is moved on the first character
following the replacement and the process is repeated.
Finally, the last utility function is concatenate_tokens
which turns
a vector of strings into a single string whose parts are the items
from the vector separated by colons:
std::string concatenate_tokens(const std::vector<std::string>& tokens)
{
std::string result;
for (const std::string& token : tokens)
if (result.empty())
result = token;
else
result += ":" + token;
return result;
}
The concatenate_tokens
functions loops over all strings from the
vector. If the resulting string is empty it means that this is the
first item in the vector, so we just copy it in the resulting
string. Otherwise, we have already stored a string in the resulting
string so we will append a colon then the item from the vector. The
function returns the resulting string when all items from the vector
are processed.
Is it efficient?
So what do you think, is it an efficient implementation? I expect some of you to be grinding their teeth at the code above. “Why did you <insert random bad code here>?” are you wondering. Interestingly, even though I would have naturally written things differently, the inneficiencies were not always where I expected. We’ll go into it in the next posts. For now, here are some measurements on this algoritm.
Input lines | Processing time | Throughput |
---|---|---|
10 (~200B) | 108 µs. | 1 MB/s. |
100 (~2KB) | 336 µs. | 5 MB/s. |
1000 (~20KB) | 1 ms. | 14 MB/s. |
10’000 (~200KB) | 13 ms. | 15 MB/s. |
100’000 (~2MB) | 72 ms. | 29 MB/s. |
1’000’000 (~22MB) | 812 ms. | 26 MB/s. |
10’000’000 (~225MB) | 9 s. | 23 MB/s. |
100’000’000 (~2.6GB) | 2 minutes | 20 MB/s. |
Note that the measurements here are those of the main service
function, where the input has already been copied into a string in
memory. There is no IO in these metrics.
All in all, everything seems relatively fast, e.g. waiting 10 seconds to get a result is not so long. For a human. Depending on how frequently the program is used, however, it may or may not appear so fast. However, if the program is a single piece in an otherwise long chain of services, or if it is executed repeatedly, we may want it to be a bit faster.
Another way to look at the efficiency of a program is by looking at its throughput. While the interpretation of the processing time may be subjective, a throughput can be objectively compared. For example, we can compare the throughput we got on the largest file with the data rate of copying the largest input into another file. It is not the same processing but it gives us an idea of how fast such a file can be scanned, as a base line.
Case in point, on the machine used for the above measurements, copying
the largest input file with rsync --progress
shows a data rate at
500 MB/s. If I copy it on another computer on my local network, it
goes at 120 MB/s. Now if the target is a remote GCP instance, the
transfer goes at 23 MB/s. Our implementation is as fast as sending a
large file on a distant computer.
Looking at the throughput it is also clear that the program peaks around a few megabytes of input, after which the throughput is going down. Here we are clearly hitting memory limits. Indeed, all inputs up to 100'000 rows can fit in the memory embedded in the CPU while the other inputs need to access the (slower) RAM.
Note that this peak is not visible when considering the processing time only.
Anyway, our goal is not to get the most efficient implementation. Instead we want to understand where the time is spent, see if we can easily get better performance, and try to get better coding practices by default.
During this series of blog posts we are going to use the perf
tool
to extract the top bottlenecks for our code. By default the tool
profiles the number of cycles spent in the program. Measuments will be
done on the input with 10 million rows. Here is where the initial
implementation spends its time:
- 27% in
std::map<std::string, std::string>::find
.- Under which the top offender is
std::string::operator<
.
- Under which the top offender is
- 16% in
std::map<std::string, std::string>::operator[]
.- Again with
std::string::operator<
as the top offender.
- Again with
- 13% in
tokenize
.- Mostly in
std::vector<std::string>::push_back
.
- Mostly in
- 5% in
replace
.- With
std::string::find
as the top offender.
- With
- 5% in
std::getline
.
Optimizing a profile like this one is not obvious. Everything happens in code from the standard library, and I am not going to tweak its code.
The code from the standard library is as efficient as it can, so here we must consider a very disappointing track: we are using it wrong. Let’s see if we can do something about it in part 2.