PIMPL idiom in C++

Thu, Aug 13, 2020

Developers working on a “software library” in C++ (or any other native language) should follow a set of critically important rules, or their customers will soon be in trouble using their product. Some of these rules are for example Semantic Versioning, Good API Design and Keeping Backward Compatibility. The later one has many aspects and requires its detailed post. In this post, I will discuss one of the vastly used techniques in C++ which helps developers to keep Backward Compatibility in binary (ABI) level.

What is Backward Compatibility?

Backward Compatibility is a promise made by the creator of an interface. Briefly speaking, it means that the newer version of a product (which expose and interface) is still applicable for all users of the older versions. The term has been widely used to describe properties of various systems, especially in telecommunications and computing. As an example, consider the Universal Serial Bus protocol. A host that implements USB 3.0 will work properly with older gadgets that implement USB 2.0 or 1.1.

Speaking of native libraries written in C++, there are several aspects of backward compatibility. Most commonly source compatibility (in API level) and binary compatibility (ABI) are considered.

On the source level, BC usually means to follow proper versioning rules. That is, for example, a software that links against version 1.2.3 of a library, should be able to link as well to version 1.2.4 or 1.3.0 of it. According to the semantic versioning rules, as long as the MAJOR version number (first digit) is not changed, backward compatibility is present. API compatibility guarantees that the client source does not need any modification to use the compatible new version. Though it does not mean that the binary artefacts are interchangeable. One may be or be not able to use the new library as a drop-in replacement for the older version. That points us to the ABI compatibility. If ABI compatibility is broken, then the client needs to re-compile and link against the new version, even though they have not changed a single line of their code.

Binary compatibility promise guarantees that a newer version of a library is a drop-in replacement for the older one, and the client code does not need to re-compile. This kind of BC is of grave importance! Because there are many scenarios in which the lack of ABI BC will cause great trouble. Consider for example security updates. Let’s assume a vulnerability is found in your library. Naturally, you fix the bug as soon as possible –while keeping API intact– then deploy your changes. If you fail to keep ABI compatible with the previous version, you’ll be in big trouble. You will have to re-compile the entire code base that links against your library. Most probably, you don’t own all of the client codes, so you will have to ask them to re-compile their code, just to adapt your tiny little fix! But if you keep ABI intact, then the update is merely replacing a binary in the customer site. (That is to replace a DLL file in Windows for example)

Methods of Keeping ABI Backwards Compatible

There are several ways to make it easier to keep BC promises on ABI level. Most important one is to produce Position Independent Code. This could be done by adding a lookup table (GOT) and calculate function addresses at runtime before calling them. This way if one re-orders member functions of a class or add new ones, they can keep backward compatibility (despite the fact that the actual addresses have been changed already). In Linux-like systems, adding -fPIC flag to compiler produces position independent code. That is useful especially when building shared libraries. In fact, almost all shared libraries in common repositories do this.

Though using -fPIC is not the whole story. There exists situations in which you will have to modify memory layout of an existing class. One for example is to add a new member variable. That will change all addresses if you add it to the beginning of the members list.

PIMPL Idiom

Pointer to Implementation (pimpl) is a programming technique that helps developers to preserve ABI across versions. Using pimpl enables developers to keep ABI BC in a vast variety of scenarios. That means using this method, you may add as many new members you need the ABI will remain intact.

In order to use pimpl idiom, you must put all of the member variables of the class, that could be subject to change in the future versions, inside a non-API class / struct; then point to an instance of it on the heap (using either a raw or smart pointer). Let’s have a look at an example (:

A Simple Example

Let’s have a look at a simple example to see how the pimpl can help to preserve BC. This example shows what the problem is with breaking BC and how to fix it using the pimpl idiom. To do so, I will introduce a very simple class named person which keeps name and last name for an individual, and does nothing else. For the sake of simplicity I have removed many details like symbol exporter #define, modifiers, etc. So our little class looks like this:

class LIBFOO_API person {
public:
  person(const std::string& name, const std::string& last);
  ~person() = default;
  std::string name() const;
  std::string last() const;
private:
  std::string m_name;
  std::string m_last;
};

Like any common shared library in Linux, I am going to compile it with gcc, adding -fPIC:

g++ -DLIBFOO_EXPORT -shared -fPIC -fvisibility=hidden -o libfoo.so ./libfoo.cpp

To demonstrate how BC works, I am also going to need a simple program that uses this library. Let me write it like this:

#include <iostream>
#include "libfoo.hpp"

int main(int argc, char* argv[]) {
  person people[3] {{"Dexter", "Fortescue"},
                    {"Armando", "Dippet"},
                    {"Albus", "Dumbldore"}};
  for(int i=0; i<3; ++i)
      std::cout << "Hello " << people[i].name() << "!\n";
      
  return 0;
}

This program can be considered as client’s code, which uses our library. To compile and link it the customer may invoke this:

g++ -o program ./main.cpp -L. -lfoo

So the output user expects from the program is like this:

$ ./program 
Hello Dexter!
Hello Armando!
Hello Albus!
$ echo $?
0

Everything looks fine. Let’s assume this is the version 2.1.4 of the libfoo and in the next version, we are required to keep age of the individuals. So we are going to modify person class and add member variables accordingly. This change is not a breaking change in API level. So the new version number will be 2.2.0:

class LIBFOO_API person {
public:
  person(const std::string& name, const std::string& last);
  person(const std::string& name, 
         const std::string& last, 
         const uint16_t age);
  ~person() = default;
  uint16_t age() const;
  std::string name() const;
  std::string last() const;
private:
  uint16_t m_age;
  std::string m_name;
  std::string m_last;
};

So I am going to compile the library as before. The client code has no information about the change and has no way to know about age variable. The part of API that client has been aware of, has not changed at all. All functions, the constructor and members, from an API point of view is the same as before. So we would expect the code to run properly with no change. Sadly that’s not the case. Now if I try to run client’s software, it will crash:

free(): invalid pointer
Aborted (core dumped)
$ echo $?
134

To see why this happens, we must take a look at the memory layout of objects in use. First let’s see initial version of the library (2.1.4):

 0 | class person
 0 |   class std::__cxx11::basic_string<char> m_name
32 |   class std::__cxx11::basic_string<char> m_last
   | [sizeof=64, dsize=64, align=8,
   |  nvsize=64, nvalign=8]

After adding age, we can observe how memory layout has changed. The address of m_name and m_last now differ:

 0 | class person
 0 |   uint16_t m_age
 8 |   class std::__cxx11::basic_string<char> m_name
40 |   class std::__cxx11::basic_string<char> m_last
   | [sizeof=72, dsize=72, align=8,
   |  nvsize=72, nvalign=8]

But why crash happens? Although no member variable has been used directly by the client code, and the fact that the library is a position-independent code, one may expect the client code to work with the new version. The reason for this behaviour is the ABI break caused by different sizes of the class. The stack on the client-side is now corrupted and destructor call for people will corrupt memory. There are other examples we can demonstrate that clients code will directly segfault instead. There exist even more complicated situations in which there is absolutely no change on the API but the ABI breaks.

Fix ABI breaks using PIMPL

In order to implement PIMPL idiom, we’ll need to change the person class like this:

class LIBFOO_API person {
public:
  person(const std::string& name, const std::string& last);
  ~person() = default;
  std::string name() const;
  std::string last() const;
private:
  struct details {
      details(const std::string& name, const std::string& last);
      std::string m_name;
      std::string m_last;
  };
  details* m_impl;
};

You can use smart pointers like std::uniqur_ptr instead of raw pointers. Also, you may move the definition of details to another domain, like a non-API header or the beginning of the source file as well. That would provide a stronger level of encapsulation and separation of implementation. Note that details already is out of public API since it has private access level. If you do apply both aforementioned changes, the final class would look like this:

class LIBFOO_API person {
public:
  person(const std::string& name, const std::string& last);
  ~person() = default;
  std::string name() const;
  std::string last() const;
private:
  struct details;
  std::unique_ptr<details> m_impl;
};

Let’s go ahead and compile our new PIMPL-ready library and also the client’s code. Now we have a version of the library which provides an ABI, resilient to changes. The memory layout now looks like the below code. We can observe that the layout includes only a pointer to implementation. There is no trace of any data whatsoever.

 0 | class person
 0 |   struct person::details * m_impl
   | [sizeof=8, dsize=8, align=8,
   |  nvsize=8, nvalign=8]

Obviously, we must also modify the implementation details. Now all functions need to pass an extra level of indirection to access underlying data. For example name() would look like this:

std::string person::name() const {
    return m_impl->m_name;
}

Then client codes compiles and links against libfoo just like before.

Now, let’s say we need to deploy a new version of libfoo containing age. We will need to add a variable to details class, then update our API accordingly. The code for new version of person class will look like this (changes are highlighted):

class LIBFOO_API person {
public:
  person(const std::string& name, const std::string& last);
  ~person() = default;
  std::string name() const;
  std::string last() const;
  // New members
  person(const std::string& name, 
         const std::string& last, 
         const uint16_t age);
  uint16_t age() const;
private:
  struct details {
      details(const std::string& name, 
              const std::string& last, 
              const uint16_t age);
      uint16_t m_age;
      std::string m_name;
      std::string m_last;
  };
  details* m_impl;
};

Note that there is no need to keep API of details compatible, meaning no new constructor is needed. That is because details is not part of libfoo’s public API, so we can break things as we like. Now if we look at the memory layout of new version, we can observe that it is exactly same as the previous one. So the client’s program, can use this new library without a re-compile. They just need to replace libfoo’s binary artefact with its predecessor. We already did exactly that by re-compiling the library (which replaces libfoo.so).

Pros and Cons

Besides easing BC of ABI, using PIMPL idiom is beneficial in some other ways. Amongst pros of PIMPL-enabled classes are:

Interface Segregation: PIMPL provides a better encapsulation of data, since it hides entire implementation details from public API. Using PIMPL one can even hide some dependencies, which is useful for the client.
Compile Time Improvement: Since the main class hides its implementation details, it does not need to #include their details. Therefore client’s code can be spared of having some unnecessary details in its domain. Size of a PIMPL-enabled header file after pre-processing can be significantly reduced with forward declaring type information; resulting in better compile time.

As a fundamental rule of Theory of Information, there exist no cost-free abstraction, right? PIMPL adds another layer of indirection. So let’s see what is the cost:

Development Effort: Using PIMPL in most cases requires a non-default deconstructor, copy and move constructors and their corresponding assignment operators. In terms of development effort, PIMPL adds a cost.
Performance: As mentioned before, PIMPL adds a level of indirection. However, since the pointer to details has a lifetime bound to the actual instance’s lifetime, most compilers can optimise away de-reference cost to it. Using PIMPL also moves data away from it’s creation point (onto heap memory actually) probably resulting in less cache-friendly code.

Final Thoughts

You must note that adding PIMPL does not resolve your library’s problems automagically! You can not apply PIMPL to a non-PIMPL class without actually breaking ABI. Also, there are situations in which you have no other way than breaking ABI to provide a feature or fix a bug.

If your library has a huge user base, then PIMPL can be a saviour. Otherwise, it may even not worth the effort, for example, to keep ABI intact for an in-house tool or a very specific library for your teammates in your company. Some developers add an empty raw pointer (a void* or a pointer to a forward-declared, non-existing type) just in case. That’s considered good practice.

Final point is that there is no reason to have all members be hidden in implementation details. You can always have members which are not subject to change, alongside with a pointer to implementation details.