Jump to content

Wikipedia:Reference desk/Archives/Computing/2024 July 20

From Wikipedia, the free encyclopedia
Computing desk
< July 19 << Jun | July | Aug >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


July 20

[edit]

CrowdStrike outage - C-00000291*.sys file contents

[edit]

I'm curious as to what exactly caused the current CrowdStrike outage. According to this article,[1] the root cause is a single file located in C:\Windows\System32\drivers\CrowdStrike\ and is named C-00000291*.sys. Even though it ends with the sys extension, it is not a kernel driver. I don't have CrowdStrike on my PC and I'm just wondering if this file is a text file, and if so, has anyone compared the broken version with the fixed version and figured out the delta? A Quest For Knowledge (talk) 21:51, 20 July 2024 (UTC)[reply]

No, this is the Computing ref desk of Wikipedia, we hold ourselves aloof from such minor niggles as the proximate cause of bringing down half the world's IT systems because some CrowdShite CrowdStrike techie couldn't be arsed to check whether their patch worked before releasing it into the wild (and consequently half-asleep IT teams across the globe). It's a channel .sys file to update the Falcon Sensor software, not an AV update, which seems to have caused a page fault, try The Register at e.g.[2] It will be a BitLocker problem for many, since Win 10 removed the ability to boot directly into Safe Mode. Where are your 48-digit recovery keys?[3] On-prem, I hope, in a locked safe, to which there are two keys, since one has gone on holiday with the IT department's boss. MinorProphet (talk) 16:59, 21 July 2024 (UTC)[reply]
It is not a single file; there are different files whose name match the pattern.[4]  --Lambiam 20:57, 21 July 2024 (UTC)[reply]
Apparently the error was not in the patch itself. A logic error in an existing driver, CrowdStrike’s Endpoint Detection and Response (EDR) driver, running in kernel mode, had until recently gone unnoticed. It was triggered when parsing the content of the patched file, a configuration file. (For more detail, see [5], but the reporting on where the logic error was is a bit confused.)  --Lambiam 21:15, 21 July 2024 (UTC)[reply]
As is Microsoft's commitment to OS security.[6] MinorProphet (talk) 23:34, 21 July 2024 (UTC)[reply]
There were at least two procedural problems involved, in addition to the code error. The first procedural error was by CrowdStrike. The second procedural error was by the user community, with assistance by Microsoft. The error by CrowdStrike was inadequate pre-deployment testing. They should have had a small number of target systems on site running applications that resembled some of the operation-critical applications of their customer base. They should have deployed the patch to the target systems, and verified that the target systems came up cleanly. Obviously, they did not do that. They had some sort of system testing prior to deployment, but not a simulation of actual deployment to target systems.
Second, if I unplug my desktop computer from the UPS, and then plug it back in, it comes up to a blue screen that is not exactly a blue screen of death. I will call it a blue screen of birth. It says that Windows did not load properly the last time. It really means that Windows did not shut down properly, but that is a minor detail. It then gives me the option of restarting my PC normally, or of various options. Either the customers didn't have a USB drive or CD-ROM to boot from, or the startup sequence had been altered to make it more automatic to assume a normal restart. Something was wrong with the restart procedures of the customer base. Robert McClenon (talk) 02:47, 25 July 2024 (UTC)[reply]
An Australian firm overcame the restart mess: How a cheap barcode scanner helped fix CrowdStrike'd Windows PCs in a flash, from El Reg. MinorProphet (talk) 12:14, 25 July 2024 (UTC)[reply]
CrowdStrike has issued a "Preliminary Post Incident Review". According to this report, it was a bit more complicated (and, in my eyes, looking even worse for the company) than the story I linked to before. It was the result of four failures coming together. (1) an instance of an "InterProcessCommunication Template" was deployed whose content was invalid. This file was not itself a program, but contained data that was meant to be used to detect a certain attack. (2) The file was validated before deployment by an automatic process. The validating program contained a logic error, because of which it failed to detect that the content of the template instance was not valid. (3) As noted above by Robert McClenon, the deployment process of the company did not include a preliminary live test of the effect of deploying the file in a controlled environment, which would have made the issue manifest before worldwide deployment. This is strange; such live tests are standard across the industry. Omitting it and relying instead on static content validation for something so safety critical and meant to be used in kernel mode is indefensible. (4) The program interpreting the content did not itself attempt to validate the content independently. This is the most forgivable part; unless also developed independently it would likely have suffered from the same error as the validator of point 2. But as a result of interpreting the invalid content it executed an out-of-bounds memory read triggering an exception while in kernel mode, resulting unavoidably in the blue screen of death. This is hard to forgive; the interpreter should have been hardened to do a bounds check before performing a potential out-of-bounds operation.  --Lambiam 18:41, 25 July 2024 (UTC)[reply]
This article "The months and days before and after CrowdStrike's fatal Friday" is subtitled "In the short term, they're going to have to do a lot of groveling". Understatement or what, especially since the CEO of CrowdStrike was the CEO of McAfee when an update borked millions of of PCs worldwide in 2010. Useless managers? They self-promote themselves upstairs. The update was only in operation for 78 minutes, there were many more millions of PCs which escaped. Plenty of links within the article for b/g info, and the comments are fairly negatory and forceful. As I understand it, M$ was forced to allow kernel access to 3rd-party suppliers since it couldn't be trusted to keep tabs on the security of the monopoly of its own OS. 'Page fault in nonpaged area' (0x50) has existed as a concept since bounds-checking was a compulsory module of Coding 101. MinorProphet (talk) 21:59, 25 July 2024 (UTC)[reply]