Last Friday we encountered a bug in our codebase whose origin was of such a peculiar nature, and whose eventual resolution proved so instructive, that I thought I would record it here.
First, some brief background. EndlessOS is built for two different computer architectures: x86 and ARM. With a few notable exceptions, personal computers generally run x86, while tablets and smartphones run ARM. In an effort to bring down the cost of PCs (ARM chips tend to be cheaper) Endless has released a desktop computer running ARM. Since the chipsets are designed differently, compilation – that is, the process by which we translate source code to machine code – must happen separately for each architecture. Our OS, with all its accompanying apps and tools, must be built twice, and, of course, tested twice. However, application code, living as it does in the cushy land of user space, usually does not exhibit different behaviour when running on ARM versus running on x86. In fact, only if you are doing low-level processor instructions, such as bit manipulation, should you even care what chip architecture you are running on. It was, therefore, to our surprise (and, indeed, horror) when we discovered, late in a release cycle, that a number of our apps had been shown to work on the x86 machines, but not on the ARM machines. The QA team reported that upon opening one of these problematic apps, the user would never be able to access a subset page. Without going into too much detail, suffice it to say that the data in our apps is organized into ‘sets’ and ‘subsets’; clicking on the title of a subset ought to take you to the subset page.
Our apps are built with a homegrown framework we have at Endless called the Knowledge Library. The framework provides a number of components, or ‘Modules’, which you can use to rapidly build complex, data-driven applications. Apps are built in Facebook’s Flux paradigm, complete with a dispatcher, history store, and of course, views in the form of GtkWidgets. I knew (because we had built it) that the module responsible for transitioning between pages in an app was the Pager. Examining the Pager code, we found the line responsible for displaying the subset page:
if (SetMap.get_parent_set(item.model) && this._subset_page) { this._show_page_if_present(this._subset_page); }
If the subset page wasn’t showing up, it meant that one of the two expressions in the if condition was evaluating to false when it shouldn’t do. The second condition this._subset_page simply checks to see if you have a subset page at all in your app, which I knew we did. The first condition queries the SetMap – an in-memory graph which maps relationships between set models in the database – and asks if the parent set of the model in question is defined. In essence, this is equivalent to asking if model has a parent set – if it does, then we know that the model is a subset, and that therefore it is appropriate to show the subset page. We suspected, then, that something was up with the contents of this SetMap – specifically, we guessed, based on knowing how SetMap was written, that it had no data in it at all, and was just mindlessly returning null for any and all set queries we gave it.
We decided to take a look at where SetMap was getting its data from, and found this function:
initialize_set_map: function (cb) { Eknc.Engine.get_default().query(Eknc.QueryObject.new_from_props({ limit: -1, tags_match_all: ['EknSetObject'], }), null, (engine, res) => { let results; try { results = engine.query_finish(res); } catch (e) { logError(e, 'Failed to load sets from database'); return; } SetMap.init_map_with_models(results.models); this._window.make_ready(cb); }); }
The relevant context here is that the Engine retrieves data from our database according to the parameters of the query it is given. The query takes the form of a JSON object, and, in this case, is simply:
{ limit: -1, tags_match_all: [‘EknSetObject’], }
What we’re saying here is fetch me content tagged as ‘EknSetObject’, and don’t impose a limit. Now, if our hypothesis was correct, and the SetMap was empty, that meant that this query was returning nothing. However, we knew that we were seeing sets on the home page, and the query which returned those sets was remarkably similar:
{ limit: limit, tags_match_all: ['EknSetObject'], sort: Eknc.QueryObjectSort.SEQUENCE_NUMBER, }
Both queries were asking for sets, but one was getting some and the other was getting none. What was different between them? Well, one difference we can see here is that the home page query requests its results to be sorted by the SEQUENCE_NUMBER key. Our Xapian backend database allows for sorting by predefined keys, and the SEQUENCE_NUMBER key describes the order in which models were added to the database. Having the results in that order is important when showing them on the home page, but in the case of a SetMap, where data isn’t going to be arranged linearly anyway, sorting is obviously irrelevant, so no sort field is included. In any case, whether data is sorted or not shouldn’t affect the quantity of said data that is returned, so we dismissed that as being the source of the problem.
What about that limit field though? Limits certainly do affect the quantity of things returned, and that -1 numeral seemed particularly suspect. I knew we had been using -1 as a special value to indicate ‘no limit’, but couldn’t quite recall how that worked. Just as a test, we changed the limit to a positive value, 10, and ran the app again. The bug was gone! The limit field was indeed the culprit, but our investigation had now only just begun. We certainly couldn’t leave this value at 10 – remember we wanted all sets returned, not just 10. Furthermore, this discovery seemed only to raise more questions: we had been using -1 to denote ‘no limit’ for many releases now, why were we only seeing that it was problematic now? If -1 was a problem for the query to handle, why was it not throwing an error, why instead return no results? And, perhaps most troublingly still, why were we only seeing this bug on the ARM architecture machines?
We dug more into the use of this limit field. Along with the rest of the query, the limit field eventually gets serialized into a JSON string and sent over an HTTP bridge to the local Xapian database server. Once on the server, which, unlike the app, is written in C, it gets parsed with the following line:
limit = (guint) g_ascii_strtod (str, NULL);
Here we were parsing the limit string as a double (the g_ascii_strtod() function converts strings to doubles) and then casting that double as an unsigned integer. Several things seemed awry at first glance: why were we parsing this limit field as a double, when doubles are meant only for storing floating point numbers? And, having parsed it as a double, why then cast it as an unsigned integer? Moreover, if it is meant to be an unsigned integer, how is -1 a valid input?
We pulled this line of code out into its own source file, ran it to see what happens when str is -1:
#include <stdio.h> #include <string.h> #include <stdlib.h> #include <glib.h> #include <stdint.h> int main (int argc, char *argv[]) { unsigned int limit = g_ascii_strtod ("-1", NULL); fprintf (stderr, "limit := %u\n", limit); return EXIT_SUCCESS; }
Running on my intel machine this program printed out 4294967296, more commonly known as 232 – the maximum value expressible in a 32 bit integer. Aha! So here is where the -1 made good on its promise to be a ‘special value’ for fetching all sets. By casting it from a double to an unsigned int, we were exploiting the Two’s Complement method of storing integers to ‘transform’ the same binary number (in this case, 32 consecutive 1s) from a ‘-1’ to MAX_UINT. Setting the limit to be such a large value was effectively the same as saying ‘impose no limit’, and that is how we would get all sets returned.
We ran the same code snippet on ARM, and lo and behold, it printed out 0. We knew now that it was this line suffering from architecture dependent behaviour. But why? Why didn’t the ARM machine also exhibit the integer underflow trick? It turns out that, in the C programming language, underflowing an unsigned integer, far from being a clever trick, is in fact undefined behaviour. I had always vaguely been aware of this fact, but had not, until last week, quite appreciated the import of what that term carries. What I had understood about integer underflow, and other undefined behaviours, was that they were inadvisable, bad practice, perhaps even that they would trigger an error in certain cases. The truth is much more sinister than that, for the C spec is written not so much for you, the C programmer, as it is for the compiler. How operations behave and what constitutes valid syntax are built into the compiler. What then, does the C spec say about ‘undefined behaviour’:
Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
This blog post by a CS professor at the University of Utah provides an excellent overview of undefined behaviour in C and C++, but the bottom line is that once you invoke undefined behaviour, all bets are off: the entire program becomes meaningless. As long as it adheres to the explicit rules in the spec, compilers have enormous freedom. We violated those rules the moment we included an integer underflow. At that point the compiler could do whatever it pleases and still claim to be a valid C compiler. It could have refused to compile; it could have compiled and then promptly crashed; it could have compiled and then printed out the complete works of Charles Dickens to stdout. All of these are valid executions of the program because the program has now lost all meaning.
One might be tempted to think: ‘but surely only that single line was invalid. Even if it does do something bizarre when it reaches that line, shouldn’t the rest of the code up to that point function normally?’ This objection (which I myself raised) has both a theoretical and a practical answer. Theoretically, the answer is simply no that’s not how computer languages work – once you include undefined behaviour, the whole program is said to be meaningless. Furthermore, practically speaking, the notion of ‘reaching that line’ itself betrays a misunderstanding of how the compiler works: a compiler is at liberty to rearrange lines of code (which it does indeed do to optimize performance) provided such rearrangements will not have observable side effects on the execution of the program. Of course, a real world compiler probably won’t rearrange things such that that integer underflow line screws up the rest of the program, just as a real world compiler probably won’t print out the complete works of Charles Dickens when you try to divide by zero, but the point is that you can’t assume it won’t do those things, and there are lots of things it could do that lie in between that and normal execution. It could, for example, when it sees an undefined integer operation, choose to assign a zero value to the unsigned int. This is what apparently did happen on our ARM machines, and the reason why might have something to do with the ARM processor’s ability to store memory in both big-endian and little-endian format. In any case, when we consider the infinite domain of options that our illegitimate operation gave the compiler license for, returning zero now actually seems not unreasonable.
As is often the case, once we understood the problem, the patch itself was quite simple. Given that we were already late into the release cycle, the aim was to keep the diff on this patch small, so we simply added a check for the negative value
/* str may contain a negative value to mean "all matching results"; since * casting a negative floating point value into an unsigned integer is * undefined behavior, we need to perform some level of validation first */ { double val = g_ascii_strtod (str, NULL); /* Allow negative values to mean "all results" */ if (val < 0) limit = G_MAXUINT; else limit = CLAMP (val, 0, G_MAXUINT); }
This will ensure that the limit is never cast directly from a negative value to an unsigned int, and that its value always lives between 0 and MAX_UINT.
As any programmer will tell you, all intense debugging sessions always end with the desperate phrase, “but how did this ever work?”. The same was true for us, and what still bothered me was why we had never seen this error before, since that xapian-bridge code hadn’t been changed for several release cycles now. What had changed this time round was that the part of the codebase that actually handled these queries client-side had recently been ported from JavaScript to C. Grepping through that code base for any sign of the ‘limit’ field, we came across this line:
g_autofree gchar *limit_string = g_strdup_printf ("%d", limit);
The limit field had always had a GObject type of unsigned int, yet here we were treating it as a signed int. Back when this codebase was in JavaScript this line didn’t even exist, because why explicitly convert each property in an object to a string when we can just call JSON.stringify() directly? In C that serialization had to be done manually, and in doing so we mischaracterized an unsigned integer as a signed one. That’s why -1 was even getting sent to the xapian server in the first place – it never happened under Javascript because Javascript isn’t as low-level as C and hence never needed to ask us whether this was a signed integer or not – it ‘just knew’.
Patching this code up was a mere one character change (switch %d to %u ). Last, but not least we fixed the original query to request GLib.MAXUINT32 as its limit. Even though our server now gracefully handled the -1 value, it had always been bad practice to rely upon it. The fact that when I first delved into the code I found that line confusing seemed justification enough to clean it up.
So that was it: three separate bugs, working in tandem to produce this baffling, processor dependent result. I found the entire saga illuminating not only as regards to computer architecture but to the power of paired – or in this case quartet – programming. Each step along the way went faster because someone on our team had domain-specific knowledge that could be shared with the rest of the group. Here’s to more collaborative coding!