The original motivation to revisit my Boggle project back in January was that I wanted to experiment with mixing Python and C++. This had the potential to be a best-of-both-worlds combination: the developer productivity and ergonomics of Python and the performance of C++ for the parts where it really matters.
(For an explainer on the Boggle project, check out my announcement post or Ollie Roeder’s Financial Times article about it.)
Boggle is all about finding performance optimizations. I pretty quickly settled on a high-level strategy:
Overall this worked extremely well, with a few interesting exceptions. This blog post walks through the details. I hope it will give you a sense for whether this is a combination you’d like to use and, if you do, to avoid some of the lessons I learned the hard way.
TL;DR: pybind11 is great, it lets you mix Python and C++ with a minimum of fuss. But performance in Python is only loosely correlated with performance in C++. A refactor that gets a 5x speedup in Python may be neutral or negative in C++.
C++ is not a new language, nor is it fashionable in 2025. I’ve dabbled in Rust, but C++ was my primary language from roughly 2006–2011. It’s familiar to me, and learning a new low-level language wasn’t my goal for this project. Plus, I had lots of Boggle code from 2009 that was already written in C++, and I wanted to take advantage of that.
Why pybind11? I tried Cython, but I was able to get a toy project up and running more easily with pybind11. Cython seems more geared towards gradually adding C++ bits to a pure Python project, whereas I wanted to bring existing C++ code into a Python project.
There’s also nanobind, a newer tool from the same developer. This is more opinionated than pybind11 and claims to be faster. One of those opinions is that you should set up a C++ build system (such as CMake or bazel), and I didn’t want to do that.
pybind11 is a descendent of Boost.Python. Despite the name, it works just fine with newer version of the standard (I used -std=c++20
).
pybind11 automates the process of creating Python wrappers for your C++ functions using C++ template metaprogramming. In particular, it binds all the STL containers to their Python equivalents. Your C++ function returns a set<pair<string, vector<int>>>
? No problem. You’ll get a set[tuple[str, list[int]]]
in Python.
For example, here’s an (abridged) version of the C++ Boggler class, which takes a Trie (Prefix Tree) and a Boggle board and finds all the words on it:
// trie.h
class Trie {
public:
Trie();
~Trie();
static unique_ptr<Trie> CreateFromFile(const char* filename);
private:
// ...
}
// boggler.h
class Boggler {
public:
Boggler(Trie* t);
// Find the sum of the scores of all words on this board.
int Score(const char* lets);
// Find the paths to all words on this board (for the web UI).
vector<vector<int>> FindWords(const string& lets);
private:
// ...
}
(This 2007 post explains how to use a Trie to find all the words on a Boggle board.)
Here’s the pybind11 wrapper code:
// cpp_boggle.cc
PYBIND11_MODULE(cpp_boggle, m) {
py::class_<Trie>(m, "Trie")
.def_static("create_from_file", &Trie::CreateFromFile)
py::class_<Boggler>(m, "Boggler")
.def(py::init<Trie *>())
.def("score", &BB::Score)
.def("find_words", &BB::FindWords);
}
To build a C extension, you run your C++ compiler with the appropriate include paths, which the pybind11 executable will provide for you:
$ gcc -std=c++20 -fPIC -O3 -undefined dynamic_lookup \
$(poetry run python -m pybind11 --includes) \
cpp_boggle.cc trie.cc boggler.cc \
-o ../cpp_boggle$(python3-config --extension-suffix)
(Why -undefined dynamic_lookup
? See this issue, it seems to be a macOS thing.)
This produces a build artifact that includes your system architecture and Python version:
$ ls -l cpp_boggle.cpython-31*
-rwxr-xr-x 1 danvk staff 1.0M Aug 27 10:01 cpp_boggle.cpython-313-darwin.so
Here’s a first gotcha: make sure you use the same Python version when you build your C++ and execute your Python. If you switch Python versions in one terminal but not another, you can wind up in a confusing situation where you build a new version of your C++ but execute an old one.
Finally, here’s the Python code:
from cpp_boggle import Trie, Boggler
t = Trie.create_from_file("wordlists/enable2k.txt")
b = Boggler(t)
print(b.score("abcdefghijklmnop")) # 18
print(b.score("perslatgsineters")) # 3625
A few things to note here:
score
function is 20-30x faster than the equivalent in pure Python. This speedup is typical.str
and C++ const char*
. It’s also happy to convert the STL types const string&
and vector<vector<int>>
.CreateFromFile
returns a unique_ptr
, a C++ smart pointer. When you return a unique_ptr
(or a raw pointer), pybind11 will have Python’s garbage collector track it. When t
goes out of scope (in Python), it will be destroyed. This is usually what you want but, if it’s not, you can disable this with a return_value_policy::reference
annotation in the pybind11 wrapper:
// trie.h
Trie* FindWord(const char* wd);
// cpp_boggle.cc
py::class_<Trie>(m, "Trie")
.def_static("create_from_file", &Trie::CreateFromFile)
.def("find_word", &Trie::FindWord, py::return_value_policy::reference)
When I got segfaults from Python, a missing return_value_policy
was usually the culprit.
The fear with mixed solutions is always that they’ll be leaky abstractions. In this case, that would mean that you’d need to be proficient with Python, C++, and pybind11, rather than just Python and C++.
That fear didn’t materialize. Once I understood the return value policy, I could mostly just write Python and C++ and forget that pybind11 was in the middle. If I added a class or method, I’d need to add a wrapper, but this was pretty mechanical. I never ran into a performance issue that was due to pybind11.
I really liked being able to write all my code that wasn’t performance sensitive in Python. This included CLI wrappers, data serialization and unit tests. Using pytest to write unit tests for my C++ code was particularly nice.
In the end, you are still writing C++, of course. You still have to worry about segfaults, alignment and the layout of your structs. This is the cost of that 20-30x speedup. But at least you only pay it for the small part of your code that’s truly performance critical.
For an open-ended project like this, you wind up exploring lots of ideas that don’t pan out. In the end, at least 90% of my ideas for speeding up BoggleMax either didn’t work or eventually got replaced with something better. If you can explore an idea and reject it in Python, without paying the cost of implementing it in C++, then you save a lot of time and mental energy. One thing I remembered from working on this in 2009 was that it was a real bummer to sink a week or two of spare time into implementing an algorithm in C++, only to not have it work out. I wanted to avoid that.
This worked as I hoped some of the time. Prototyping an idea in Python was a good way to hash out the details and learn about unforeseen complications. And having a reference Python implementation made porting to C++ much easier. I could use the same tests to ensure that the two implementations matched. LLMs are very good at porting code between languages, so I often didn’t need to do this myself.
Where this didn’t work at all was in prototyping performance optimizations. You optimize for the system you develop in. If you make a change and your code gets slower, you’ll reject it. If it makes it faster, you’ll keep it. Unfortunately, changes that make your Python code faster may or may not have that effect on the C++ version. This effect is not subtle. I found a few 5x speedups in Python that wound up being completely neutral in C++.
Why does this happen? At a high level, running pure Python code is slow. Anything you can do to shift the bottleneck to C/C++ code is going to be a win. Porting your code to C++ is an obvious way to do that, but there are more subtle ways.
For example, I found that memoizing a particular function in Python was a big win. But memoizing the same function in C++ had no effect. Looking at a CPU profile revealed why. By memoizing the Python function, you shift more of the computation to the hash
builtin, which is written in C. So memoization is a sneaky way of migrating your work from Python to C.
Another subtle effect: I developed on my M2 Macbook, but I did the big runs on Intel chips on Google Cloud. These chips have different memory bandwidth, cache sizes and branch prediction behaviors. If I’d developed on Intel, I might have made different choices.
Sometimes an idea would be easy to implement in Python, then I’d port it to C++ and realize it depended on garbage collection in a way that I hadn’t thought about. Memory management matters. Using an arena wound up being a huge performance win, but there’s no way to prototype that in Python.
In the end, this meant that it was still a lot of work to build out an idea in C++ before I found out whether it was a performance win. Still, having the reference Python implementation did make it easier to find bugs and keep my sanity along the way.
Overall, combining C++ and Python worked very well. For the most part, I was able to get the best of both worlds: the ergonomics and developer productivity of Python for 90% of my code, and the speed of C++ for the 10% where it mattered.
If you’re comfortable with C++ and Python, I’d highly recommend using pybind11. (If you like CMake, you might try nanobind.) Be aware of its behavior when you return a pointer. But otherwise, it really just works.
Prototyping in Python and then porting to C++ was more of a mixed bag. It did help to hammer out the details of an idea and get a correct reference implementation. But it didn’t provide useful guidance on what would be a performance win. C++ isn’t always exactly 20-30x faster than Python. Speedups in Python are only loosely correlated with speedups in C++. If performance is the goal, you’ll still have to get a complete C++ implementation before you know if you have a win.
Please leave comments! It's what makes writing worthwhile.
comments powered by Disqus