Monthly Archives: December 2016

Dual problems in dual boot

Background

At Endless we distribute an installer that enables users to run both our own OS and Windows side by side, in a setup commonly known as dual-boot. Dual-boot allows users to try out our OS on their own hardware, without giving up the Windows system with which they are familiar or which they need to retain in order to run some Microsoft applications. A dual-boot enabled computer will, after powering on, display a simple menu screen, at which point the user can select which OS to run.

A grub menu screen showing two boot options: Endless OS and Windows

A grub menu screen showing two boot options: Endless OS and Windows

Most dual-boot installations work by overwriting the bootloader. You can think of the bootloader as the prima causa of all the software running on your computer. Every other piece of software that gets run (including the OS itself) must be loaded and executed by some other piece of software. The chain of causation stops at the bootloader, our unmoved mover. It sits right at the beginning of a hard disk, in the boot sector, and our hardware is physically configured to run whatever binary code is in that boot sector on startup.

What the Endless installer does is write into that boot sector a program called GRUB, the GRand Unified Bootloader, which, as you might guess from its grandiose title, is a powerful and flexible tool which coalesces into one program the ability to boot a variety of free operating systems, including Endless OS. Of course, one of the operating systems it does not know how to boot is Windows, so to get into Windows we use a process called chainloading. Before we write GRUB into the boot sector, we first copy out what was already there – namely, the Windows bootloader – and save it on the Windows partition. When a user chooses to boot into Windows, GRUB hands over control to the pre-saved Windows bootloader (this is the part known as chainloading) and the Windows boot sequence can begin.

Immediately after the boot sector usually comes the partition table, where information about the size and location of each of the disk’s partitions is stored. Most systems have four partitions, though more advanced ones can have as many as 16.

The structure of a Master Boot Record (MBR) showing the bootstrap code and the subsequent partition table entries. (Wikipedia)
The layout of a disk drive, showing the boot record at the start.

The layout of a disk drive, showing the boot record at the start.

Unlike some other Linux distros, such as Ubuntu, Endless does not create a separate partition on which to place its OS image. Rather, we stick the image into the C: drive of Windows itself, extending it to provide adequate space for the OS to work with, and hiding it with ACLs so that the user does not accidentally delete it while back on Windows! In fact, what we put in C: is an entire endless directory, which contains the endless.img image file, as well as a GRUB subdirectory with the grub.cfg file required to complete GRUB’s boot process. To identify the C: drive, GRUB looks for that partition which has the \endless\endless.img file on it. Once the partition is found, it can retrieve the grub.cfg file and complete the boot process. To boot into Endless, we launch the kernel, passing it these extra parameters:

endless.image.device=[id for partition holding endless.img] endless.image.path=/endless/endless.img

The disk layout on a computer with Ubuntu and Windows in dual boot.

The disk layout on a computer with Ubuntu and Windows in dual boot.

The disk layout of a computer running EndlessOS and Windows in dual-boot.

The disk layout of a computer running EndlessOS and Windows in dual-boot.

As one might have surmised, because we are modifying the part of your computer that actually boots the rest of the OS, the stakes are rather high. In contrast to a bug in application code, which might – at worst – crash the application but leave the rest of your system intact, a bug here could render your entire computer unbootable. You could turn it on, but instead of an OS you would just get a dreary GRUB error screen. That is exactly what happened to a user of ours here in Brazil.

The Problem

A user reported that he had completed the installation process, but, upon reboot, had hit a black screen with a cryptic GRUB error message.

This error screen appeared after a user installed Endless and rebooted.

This error screen appeared after a user installed Endless and rebooted.

By examining the log file of another computer which had hit the same problem, we were able to figure out what went wrong. What appears to have happened is the following: our user completed the installation process successfully, but then, before exiting the installer, he closed his laptop lid, which puts the computer ‘to sleep’, or, more technically, tells Windows to broadcast a PBT_APMSUSPEND event signal to currently running applications. Closing the laptop was of course a perfectly reasonable thing to do, but due to a bug in our installer code, this signal caused the installer application to move into an error state, and when our user reopened his laptop lid later in the day, he saw the installer on an error page:

 22:06:48 - Analytics: response code 200
 22:20:17 - Received WM_POWERBROADCAST with WPARAM 0xA LPARAM 0x0
 01:40:39 - Received WM_POWERBROADCAST with WPARAM 0xA LPARAM 0x0
 01:40:40 - Received WM_POWERBROADCAST with WPARAM 0x4 LPARAM 0x0
 01:40:40 - Received PBT_APMSUSPEND so canceling the operation.
 01:40:40 - EndlessUsbToolDlg.cpp:4174 CEndlessUsbToolDlg::CancelRunningOperation
 01:40:40 - DownloadManager.cpp:439 DownloadManager::ClearExtraDownloadJobs
 01:40:40 - CEndlessUsbToolDlg::CheckInternetConnectionThread cancel requested
 01:40:40 - DownloadManager.cpp:452 Error calling EnumJobs. (GLE=[0])
 01:40:40 - EndlessUsbToolDlg.cpp:4155 CEndlessUsbToolDlg::EnableHibernate
 01:40:40 - EndlessUsbToolDlg.cpp:1606 CEndlessUsbToolDlg::ChangePage
 01:40:40 - ChangePage requested from ThankYouPage to ErrorPage

The spurious error page prompts our user to “Retry” the installation, which he dutifully does, despite the fact that nothing was actually ‘wrong’ in the first place. The installer at this point stumbles upon a much more pernicious bug – namely, it fails to check that Endless is not already installed before starting to install it again; it plows right ahead, deleting our C:\endless file and recreating it again (unnecessary, but not by itself problematic), and then goes to install GRUB. When it looks in the boot sector for the Windows MBR to copy out, it sees that the bootloader there is not in fact a Windows one – it’s something else, something strange and unexpected that it doesn’t know how to deal with, so it aborts the installation process and deletes the C:\endless it had just created:

 08:02:50 - EndlessUsbToolDlg.cpp:5174 CEndlessUsbToolDlg::WriteMBRAndSBRToWinDrive
 08:02:50 - EndlessUsbToolDlg.cpp:5320 CEndlessUsbToolDlg::GetPhysicalFromDriveLetter
 08:02:50 - Opened drive \\.\PHYSICALDRIVE0 for write access
 08:02:50 - EndlessUsbToolDlg.cpp:5134 CEndlessUsbToolDlg::IsWindowsMBR
 08:02:50 - C:\ has a non-Windows MBR
 08:02:50 - Error: no Windows MBR detected, unsupported configuration.
 08:02:50 - EndlessUsbToolDlg.cpp:4993 Error on WriteMBRAndSBRToWinDrive (GLE=[87])
 08:02:50 - SetupDualBoot exited with error.
 08:02:50 - EndlessUsbToolDlg.cpp:4698 CEndlessUsbToolDlg::RemoveNonEmptyDirectory
 08:02:51 - Removing directory 'C:\endless\' result=0

The irony of course is that this ‘strange’ bootloader that our installer refuses to deal with is our GRUB installation that we put there in the first place! The computer is now, although still running, a dead man walking. Once you turn it off, it will never boot again, because it is stuck in limbo: it has a GRUB bootloader, but without the requisite C:\endless\endless.img file in place, GRUB can’t find the C: partition, and, hence, can’t find the rest of the code needed to complete the GRUB initialization process and boot into either Endless or Windows.

The Fix

As is often true with nasty, edge-case bugs like this one, the real work is in finding and reproducing it (and in that we were lucky to have the log file of a similarly affected machine). Once the nature of the problem is known, fixing it was quite straightforward. We cannot boot the machine from its internal disks, but we can boot from a USB. Once running on a ‘live’ Endless USB, we can do the repair work needed to get the system back into a coherent state. Concretely, we recreate the deleted C:\endless\ directory, and inside of it we put a blank endless.img file (so GRUB can find the partition) and a grub\ subdirectory (so GRUB can complete its execution). We can then power off the machine and remove the USB drive. Upon reboot, GRUB will now execute to completion and, since endless.img is just a blank file, will boot directly into Windows. From there we can run the uninstaller to clean everything out, then download a new version of the installer and do a proper install.

It is these two items, the endless.img file and the grub directory, that must be manually restored to the Windows partition in order for the computer to be bootable again.

It is these two items, the endless.img file and the grub directory, that must be manually restored to the Windows partition in order for the computer to be bootable again.

The new version of the installer has two important fixes to ensure this error does not happen again. Firstly, it will not show a spurious error page when the application receives the suspend signal; secondly, it will ensure that the error code path includes a check for whether C:\endless is already installed before embarking on the install process all over again. Had that check been in place, the installer would never have deleted endless.img the second time round, thus avoiding that dreadful ‘half-way’ state. This problem was, therefore, one of those curious oddities in software development where you have two bugs layered on top of each other: a glaring, catastrophic one hidden in the code path of a subtle, marginal one.