• Patch Lady – what is a race condition?

    This topic came up the other day after the issue with the unbootable devices occurred in January.  Kirsty recommended that I blog a bit more in detail as to what was going on.

    The issue is talked about in a KB4075150 that discusses how to recover from the issue caused by January and February updates.  As a reminder, in January on some – but not all – machines we saw folks hit with inaccessible boot device and the only way they could recover was to pull off the January updates.  Then in February we saw the symptom that devices lost the use of any USB connected item including mice and keyboards.

    The underlying cause is described in the KB:

    This issue occurs in the unlikely event, due to a race condition, that the Windows Update servicing stack incorrectly skips installing the newer version of some critical drivers in the cumulative update and uninstalls the currently active drivers during maintenance.

    So what exactly is a race condition?  A hint about what is a race condition is documented in KB317723 which is specifically talking about when race conditions and deadlocks are seen in Visual Studio.  Race Conditions and Deadlocks can actually occur in any software, it’s not unique to any one platform or any one piece of software.  In our case we saw this race condition in the updating process.

    A race condition occurs when two threads access a shared variable at the same time. The first thread reads the variable, and the second thread reads the same value from the variable. Then the first thread and second thread perform their operations on the value, and they race to see which thread can write the value last to the shared variable. The value of the thread that writes its value last is preserved, because the thread is writing over the value that the previous thread wrote.

    In the 1709 updating process, as the component based servicing was attempting to update the acpi.sys file, it removed the old file and attempted to replace it with the good file.  But because of …well basically a bug in the process…. instead of completing the task properly it didn’t reinstall the proper acpi.sys file, your system failed to boot.  The acip.sys file is a key and core component of the boot process.  If it fails to be in place, your system doesn’t boot.

    It becomes a bug when events do not happen in the order the programmer intended.

    I’d argue there was another failure here… that of Windows 10 to be self-healing.  I have had several credible reports from folks that they got one shot to do the dism commands to remove the faulty update and fix their machines.  If – especially in the early days of this issue – they tried various commands they found on the web, they ended up harming the component based servicing stack more than helping.

    For the most savvy of folks on the web, they had a backup image of their Windows 10 and therefore were able to roll back.  This is why I still stress so much with Windows 10 to have a backup.  Some folks I’ve talked with feel that the days of client backup is over.  All of your data should be in One Drive for business, you should have your account tied to a Microsoft account, you should be signed up for an Office 365 or Microsoft 365 account such that no matter what device you log into your profile roams with you.  But when your device doesn’t boot… AT ALL … you need options to get your system back up and running as fast as you can.  And one shouldn’t overlook the “don’t move my icons on my desktop” that I see expressed by many a computer user as they migrate from one platform to another.  Any end user is way more comfortable seeing their desktop back to the way it was.  Getting your tools on your computer back to how you like it takes time.  We are creatures of habit.

    So bottom line:

    1. I’m not seeing this as widespread as before (thank goodness) but there’s still a few tweets and forum posts that are making me keep an eye and to see if we are really out of the woods.
    2. It did not occur 100% of the time.  While I can introduce you to consultants that had 30 machines out of 100 in a network nailed, I can also introduce you to consultants who went… “what issue?  Mine are fine.”  This is what made it so frustrating for me, I could never point to any single one thing as a root cause.  Normally one can point to SOMETHING… something third party, something antivirus related (case in point the recent failures to install the March Windows 7 update on some machines with Bitdefender installed was traced to that a/v), but in this case there wasn’t any smoking gun to point to.  Some were lucky, some were not as the “root cause”, which for me is an unacceptable diagnoses when it comes to operating systems.
    3. I only saw this on 1709 versions of Windows 10.
    4. It just reinforces to me that you HAVE to have a backup, and make sure you have downloaded and have a bootable flash drive ready.  The best way to fix a Windows PC is to have another Windows PC handy.

    Ensuring you have media and a backup means you are ready for anything….and if nothing happens, well all the better.  It gives you peace of mind that you can deal with anything.