Julien Jorge's Personal Website

Code Review of toml++

Sun Sep 20, 2020

Here’s a new game: pick a random software library and review it.

A long time ago Mark Gillard posted in r/cpp a link to his C++ library to handle TOML, with a comment saying “As always, feedback welcome!”.

Hey, good thing! I don’t know anything about TOML but I know a few things about C++ and software design!

Moreover, I like doing code reviews; both as a reviewer and as a reviewee. Code reviews are great for the reader to learn new things, to think, to empathize (a.k.a “wtf is going on in this file?”). And for the reviewee it allows to get feedback, improve, and think.

Thinking is great. When you know that someone will check your code you think twice about its design.

So let’s see what it is about.

Mark has accepted to react to the points listed in this article. His comments are prefixed by MG. Thank you very much Mark for accepting to play this game :)

TOML, the language

TOML stands for “Tom’s Obvious Minimal Language”. It is a language for configuration files, similar to the popular .ini files you often find on Windows, but with an actual specification.

It is mostly made of your typical key = value format, where the value can be an integer, a float, a string, an array, a table… or a date (with time).

At first I was quite surprised to see a date in the basic value types, and actually I still I don’t get it. In my opinion it is already too high level for a basic type. I mean, if date is supported by default, why not other things like an IP, a URI, or the days of week? So I reached to Tom Preston-Werner, the author of the TOML specification, to get more insight on this design decision, and here is what he answered:

Literally.

Well, ok, Tom has certainly more interesting things to do. Hopefully Mark redirected me to this TOML issue which is exactly my point. Feel free to read it to have more details about the usefulness of this feature.

Now that we know a bit more about TOML, let’s review the implementation of the toml++ library.

The toml++ library

toml++ is a header-only C++17 library to parse TOML files, manipulate their content, and serialize them. It can work with and without exceptions, handles UTF-8 and can also serialize to Json. It has no dependency except the STL. It can also optionally be used as a single-header library.

I’m going to discuss the coding style, the architecture, and various implementation details. Obviously, since I have just discovered TOML with this exercise I may miss some points. Also the review concerns a subset of the library as I did not read everything in details, mostly because I have only one life (as far as I can tell).

The less controversial (·_·): the coding style

Overall the code is quite clear and easy to read. Nevertheless, some details make it harder than necessary to enter the project.

Header extensions

.h or .hpp, you should pick one. If I understand correctly, the author choose to use .h for actual headers and .hpp for implementations. I guess that he went this way for the single-header stuff. I would suggest to change the extensions for something less ambiguous.

In my opinion C++ headers should use the .hpp extension. Classic, unambiguous, widely accepted. For the implementation I would suggest to move the files in include/toml++/impl/ and/or to change the extension to either .cpp or .ipp.

MG: I prefer .h, personally. The only reason I use .hpp at all in the library is to indicate which is a public include, and which is ‘implementation’, though I agree that simply moving the implementation headers would be better. It’s something I plan to fix in a future break-all-the-things v3.0.

Tabs versus spaces

The author picked tabs and the alignment is wrong, as expected ;) I know, I know, some tools will parse the .editorconfig file and display the code correctly. Unfortunately not all tools do it, and actually most text editors or viewer do not display the file correctly.

As examples of some occurrences of weirdly formatted code: in toml_instantiations.hpp, check the backslashes at the end of the lines of #define TOML_INSTANTIATE(name, T). Also, in toml_preprocessor.h everything is askew.

If you’re not convinced yet, you can try in other tools too. See for example how it is displayed when reviewing a commit in GitHub.

It’s also unreadable when I display the file or a diff with git in my terminal.

My suggestion: use spaces for consistent formatting among tools, break the lines at 80 columns for safe visualization (seriously, I can’t even see the end of the lines in the above capture, and the window takes the full screen).

MG: There’s no right answer here; I prefer tabs because you can override the tab width, and personally favour a tab width of 4. The rest of your points are just making excuses for shitty tooling.

The github example shows a particularly shit piece of tooling. You can improve it using the Refined Github extension. The same code from my browser:

MG (about the 80 columns): It’s 2020. I have a widescreen monitor, as does just about every other computer user on the planet. The ’80 columns’ meme needs to die a fiery death.

(I’m not a complete savage, though. I’ve kept within 120 columns throughout the library.)

Overall design

Here I’ll talk about the library’s design and architecture without going into the details.

Configurability

There are many preprocessor definitions to configure the code, for very few benefit in my opinion. Just with TOML_IMPLEMENTATION, TOML_HEADER_ONLY and TOML_INTELLISENSE, there’s already 8 combinations of how the files are parsed. And we haven’t passed the preprecessor step yet. My suggestion: reduce the combinations of features. In general, the use of many features means a hard to test program. Keep it to the minimum for more robust software.

MG: TOML_INTELLISENSE is used internally and not intended to be user-configurable. The only configuration points offered to users are the ones specified on the ‘Configuration’ page.

Regarding configurability in general: Other header-only libraries I’ve used in the past have leaned heavily into configurability (e.g. the stb libs), and it’s a quality that I’ve always appreciated. You’re right that it makes testing a bit harder, but the tests actually do take this into account by building many permutations. It was pretty easy to automate.

File naming

Prefixing file names with toml_ is noise. The files are already in a toml++ folder, there should be no ambiguity in the include directives neither this way.

MG: I agree. This was a poor decision made early and now I’m stuck with it. Another thing I plan to fix in v3.0.

Single-header library

The header takes too much time to parse:

# echo '#include <toml.hpp>' \
    | time -p g++ -x c++ -c - -I. -std=c++17
real 1.34
user 1.29
sys 0.05

It’s not surprising that parsing a single-header library takes a long time. Yet, it feels to me like a huge cost to pay for the service.

In my humble opinion, the single-header hype is a design for lazy users, and not the good kind of lazy. If I can understand this decision in order to ease the adoption of the library, I think nevertheless that it hurts its users in the not so long run, as well as the C++ community who will follow the lead and produce single-header libraries too. Thus increasing again and again the compilation time of the projects everywhere.

By way of comparison, since TOML and Json have similar yet not identical scope, I have counted how many times I included json/value.h from JsonCpp in a project of mine. There were 69 occurrences. If I used toml++ instead, it would have been a whole minute and a half just dedicated to parsing the header library.

My suggestion: split the library in multiple headers. One concept in one header, one header for one concept. No noise, low coupling. Put the implementation in .cpp files and provide a static library.

MG: The multi-file version of the library is already structured this way. (the single-file version is generated from the multi-file version using a python script).

Users can already include more specific toml++ headers now if they wish, though I don’t believe it makes much sense to do so. The library serves a highly-specialized purpose; there’d be little benefit to encouraging this. One exception to this would be providing a lightweight header just for forward declarations; the closest thing to this currently is toml_common.h.

Regarding your example of including json/value.h 69 times, I’d suggest this is just poor structure. Include it once. A single source of truth for third-party includes is always better.

Toml++ is likely to always be header-only. Users can control where the implementation gets compiled, though; see the documentation for TOML_HEADER_ONLY.

Testing

It’s good to find tests in the repository. After running them I see that all tests requiring a non installed locale fail:

$ ninja test
Ok: 48
Expected Fail: 0
Fail: 96
Unexpected Pass: 0
Skipped: 0
Timeout: 0

My suggestion: skip the tests if the locale is not installed.

MG: This would be unwise; letting tests of a text-based library ‘pass’ without testing locales with different number formatting etc. is just inviting carelessness. Setup requirements for testing are documented in CONTRIBUTING.md, so tests that fail because a contributor hasn’t followed fairly simple setup instructions are what I would happily deem ‘user error’.

See this issue for background on why I feel this way.

The details

Here are a few notes and questions that came while I was reading the code. Sometimes I found answers to my questions but I still write them for reference. Maybe you’ll have the same questions when reading this library.

In toml.h, macros are undefined only if TOML_UNDEF_MACROS is defined. Why is it optional? I would have thought that it was good practice to clean up the defines, at least those not intended to be defined by the user code.

MG: Just as with TOML_INTELLISENSE above, TOML_UNDEF_MACROS is used internally and not intended to be user-configurable. It’s automatically set to 1 if not already defined:

#ifndef TOML_UNDEF_MACROS
   #define TOML_UNDEF_MACROS 1
#endif

So, it already behaves the way you want it to. The only reason it’s optional is so that I can use the internal warning management macros and some function attributes in the unit tests without having to reinvent those wheels or copy and paste a bunch of stuff.

In toml_date_time.h, about struct date:

  • I’m not sure that pack() is more interesting than using  std::make_tuple(year, month, day) or comparing the fields one by one. (Actually it is more interesting. It makes more instructions but less branches: https://godbolt.org/z/vfTrvW)
  • Why operator== and operator!= compare the members instead of using pack() like the ordering operators? (Because it makes less instructions.)
  • I think & should by used instead of && in order to avoid branches in operator==. Or maybe it has no impact, I’m not sure.

Similar comments apply to struct time.

About struct time_offset:

  • What is the benefit of initializing the members to zero in time_offset::time_offset() instead of using the default implementation which would do nothing? I would prefer to let the user set the fields to zero if he needs to.

MG: Constexpr. Uninitialized fundamental types are illegal in constexpr, and since that class has a non-default constructor, I have to explicitly provide the default one too. Simply having it be = default wouldn’t suffice either as it wouldn’t be valid constexpr (because the field would be uninitialized), so here we are.

  • In time_offset::time_offset(int8_t, int8_t), there’s a useless cast to int16_t. The type of h * 60 + m is int and the result will be cast implicitly when constructing minutes.

MG: If you compile with high warning levels, implementing it as you suggest causes a narrowing warning, hence the explicit cast. In general I prefer to be explicit and not pass compiler warning burden onto users. (I probably should have just made the parameters int in the first place…)

About struct date_time,

  • Same comment than for time_offset about initializing the members to zero.
  • There’s some inconsistency: the implementation of operator{!=,>,>=(date_time, date_time) is interesting. Why not use this implementation for date, time, time_offset?

MG: Not sure what you mean here. If you mean “why do I implement some of these operators in terms of others in date_time, but implement them all explicitly in time and date?”, then the answer is “ time and date are sufficiently simple that there isn’t really any difference in maintainability or code-size either way; they’re all one-liners”.

In toml_common.h:

  • using string_map = std::map<…>. I must ask: why not std::unordered_map? The string_map is used only in toml_table.h and the order is not important here. I guess it is of interest for display, in print(const table&) from toml_default_formatter.h. If it is the only reason I would suggest to use an std::unordered_map and do the sorting in the print() function only.

MG: By snipping out the full typedef, you’ve snipped out the answer to your question. The full typedef is:

template <typename T>
using string_map = std::map<std::string, T, std::less<>>;

The std::less<> part enables heterogeneous lookup. In C++20 this can also be done with unordered_map, but not in C++17.

In toml_node.{h,hpp}:

  • the copy constructor and the assignment actually do not copy, and there’s a comment “this is not an error”. I don’t get why.

MG: The full comment provides an indication (though could be more detailed):

// does not copy source information
// this is not an error

The only state stored in the base class is the source_region, which doesn’t make sense to copy (see this discussion for more info).

  • node::operator node_view<>(), I don’t get why the operator is declared here instead of switching node_view::node_view(viewed_type* node) to public visibility. Since the latter exists we may as well make it public and drop the operator from node.

MG: You must have reviewed an older version of the library; that constructor is indeed public in the current version.

  • There’s two representations of the type of a value: the values in the node_type enum and the subclasses of node.

MG: Correct. The former is to avoid RTTI (lots of people aren’t fond of it, evidently).

  • Having a class with both virtual functions and type testing functions (e.g. is_table()) is bad OOP: a root class should know nothing about its subclasses.

MG: It’s only bad OOP in an open system. The set of meaningful subclasses is finite, small, and the library implements them all. There’s no user-extensibility here.

  • Instead of relying on inheritance to represent the value types I’m quite sure we could have used a union and no inheritance.

MG: Indeed, retrospectively this was a poor design decision and is something I’d change in a future v3.0 (I note as much here).

In toml_node_view.h:

  • I’m not sure the helpers in node_view are worth 600 lines, compared to simply using a node*.

MG: A node view is not at all similar to a node*. I suggest you read the node_view documentation and revisit this point.

In toml_utf8.h:

  • Most functions could be implemented using is_match(). For example:
bool is_ascii_whitespace(codepoint)
{
  return is_match(codepoint, U'\t', U' ');
}

MG: I don’t think anything would be gained by doing so. Debug performance matters; I’d much prefer a simpler call stack than a slightly simpler function body, if the function itself is trivially simple either way.

  • The name of is_unicode_whitespace() is misleading since it does not test U'\t' and U' ', which are valid unicode characters too.

MG: True; it was originally called is_non_ascii_whitespace but that’s a bit of a mouthful, despite being more precise. “Unicode” here means “Unicode, excluding the ASCII range”. Same applies for all the is_unicode_XXXX functions.

  • Same problem for is_unicode_line_break().
  • What is the point of the template argument and the if constexpr in hex_to_dec()? Since the function calls itself recursively by casting the argument to std::uint_least32_t if T is not already this type, then why not just declare it as hex_to_dec(std::uint_least32_t)? The only reason I could see is that it allows to silence a warning about loss of precision in a call of the later with, say, an std::uint_64_t, but why would we want to hide the loss of precision to the caller?

MG: Yeah, that is weird. I don’t remember exactly how that function came to be implemented that way, though I suspect you’re right in that I was just suppressing warnings. It’s called in a few places in both the parser and formatters with char, char32_t, utf8_codepoint and uint_least32_t, so it’s likely there was some compiler spam I got frustrated with.

  • I’m skipping is_unicode_letter(), because… meh. It’s not that it’s problematic, just that it’s hard to read :) I have no better implementation to suggest, though.

MG: That function is machine-generated and not meant to be human-readable. It’s literally write-only code. I wish that weren’t the case, but when it comes to ‘proper’ unicode handling there isn’t really any alternative; at some point in any compliant unicode stack you’re going to find machine-generated code, massive lookup tables, or both. I’d rather just use something provided by the language, but alas, C++…

  • utf8_decoder::state_table: there seems to be an attempt at formatting the table but it looks like it did not work.

MG: That decoder is based on this: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/. Specifically, the 2010 ‘optimized’ version about two-thirds down the page; the formatting comes directly from the state table. I’m not going to pretend to fully understand the implementation of the decoder, but the author thought it meaningful to split the state table into four ‘blocks’, who am I to disagree?

  • utf8_decoder::has_code_point(): the statement return state == uint_least32_t{} feels like a very convoluted way to test against zero. Why not return state == 0?

MG: On the contrary; state == uint_least32_t{} is uint_least32_t == uint_least32_t, whereas state == 0 is uint_least32_t == int. Some compilers are smart about this and treat the 0 literal in a sane way, some compilers are dicks about it and emit a signed/unsigned comparison warning. The more explicit version handles the latter.

In toml_utf8_streams.h:

  • In utf8_byte_stream::utf8_byte_stream(sv), the loop header for (size_t i = actual_len; i --> 0_sz;) is cryptic and confusing. Use the more classical form for (size_t i = actual_len - 1; i != 0_sz;) {--i; …} so even a beginner can understand the loop’s range without thinking twice.

MG: Your proposed implementation is incorrect; it skips the first value (godbolt), and overflows when the starting length is already 0 (godbolt). I agree that the way I’ve written it is a bit confusing at first, but it’s correct in all cases.

  • same function, in if (source[i] != Char{}) // not '\0': drop the comment and actually write zero: if (source[i] != 0).

MG: See above. Using Char{} is always exactly correct, anything else invites compiler warning spam.

  • if (source.length() != actual_len) // not '\0': the comment makes no sense.

MG: Yup. I think this is a copy-paste leftover from some refactoring I did in this area.

  • Calls to source.length() can be saved by copying actual_len in another variable, like const size_t announced_len(actual_len), then comparing both values in the first outer if and also by using actual_len in the last if.

MG: Yeah, true.

With these changes the function would look like:

explicit constexpr utf8_byte_stream(std::basic_string_view<Char> sv) noexcept
  : source{ sv }
{
  // trim trailing nulls
  size_t actual_len = source.length();

  const size_t announced_len = actual_len;

  for (size_t i = actual_len; i != 0_sz;)
  {
    --i
    if (source[i] != '\0')
    {
      actual_len = i + 1_sz;
      break;
    }
  }

  if (announced_len != actual_len)
    source = source.substr(0_sz, actual_len);

  // skip bom
  if (actual_len >= 3_sz
      && memcmp(utf8_byte_order_mark.data(), source.data(), 3_sz)
         == 0)
    position += 3_sz;
}

toml_parser.h:

  • parse_result’s move constructor: the line parse_error{std::move(res).error()} makes no sense to me. We cast res to a parse_result&& using std::move, then we call error() on it and create a new parse_error with the result. I was expecting parse_error{std::move(res.error())}. (Actually the std::move(res).error() will call the rvalue overload, i.e. the one that returns a parse_error&&, so it makes sense, but I don’t see any difference with std::move(res.error()) then).

MG: Yup, you’re right; given that parse_error has rvalue overloads, there’s no difference between parse_error{ std::move(res).error() } and parse_error{ std::move(res.error()) }. It’s entirely a stylistic choice.

  • parse_file(std::basic_string_view<Char>): the author provides a way to support different character types in the string view but also don’t take into account the second type parameter of std::basic_string_view<CharT, Traits>.

MG: True, though this is by design; limiting string view inputs to the ‘canonical’ ones reduces the maintenance surface area of the library; I don’t want to have to care about implementations with custom char traits, and people using them are likely to have more specialized needs anyway.

  • There’s the same problem where requiring an std::basic_istream<Char>.

MG: As above. Additionally, the stream-based machinery being templated is perhaps a bit misleading; it’s a trick to delay instantiation so I can avoid publicly including the various stream headers when TOML_HEADER_ONLY == 0. I realize this is, ahem, hack as fuck. It’s something I’ll revisit in a future v3.0.

Conclusion

My overall feeling of the toml++ library is that it is a quite well engineered tool with a little bias of trying to handle too many use cases : single-header, multiple-headers, support for wchar on Windows, template character type, IntelliSense, support for “large” files, exceptions, etc. It’s hard to please everybody.

According to the library’s website the header-only feature is optional but unless I missed the way to build a static library I think it is not: the single header feature is optional but the alternative is still a header-only version, just with multiple headers.

MG: This refers to the ability to control where the implementation is compiled when TOML_HEADER_ONLY == 0. Doing so transforms it into a more traditional split header/implementation model.

I really appreciate the use of recent C++ features and there seems to be a real intent to make an easy-to-use library: there’s documentation, code samples, and the code is globally clean (except when the indentation fails or and the lines never break). As we can see from Mark’s comments it is a well thought library developed with care.

Since I have never used TOML files I cannot tell you if this library does the job or not, but from what I’ve read on the web it seems that its users are very satisfied. If you did try it or any other TOML library, feel free to share your thoughts in the comments. Also, if you think of another small open source library that would need a review, feel free to share !