Like anything else in society, IT has fads. Lean, agile, DevOps—fads come and go, usually accompanied by a sacred text or two that very few people who claim to adopt a given fad actually read. The result is often a misunderstanding of the vocabulary used by those who developed the concepts in the first place. A great example is the confusion surrounding the term “fail fast.”
Fail fast doesn’t mean what most people seem to think it means—which leads to a great deal of angst whenever it’s discussed. That angst is amplified when discussed among Canadian businesses, because Canadian businesses are very risk-averse, as an article in the Financial Post suggests.
What it really means to fail fast
To fail fast means that one accepts the reality that all IT systems—be they hardware or software—fail and prefers that they do so quickly and cleanly. But failure isn’t binary. There are tons of different ways that these systems can fail, including partial failures, drawn-out and difficult-to-detect failures, and a pattern of failure followed by a return to normalcy that’s often called “flapping.” Those who practice failing fast prefer that something outright fail if it’s going to fail at all. If your malfunctioning IT simply falls over and dies, then that failure can be detected quickly and responded to quickly.
The whole point of failing fast is to recover from failure quickly. It’s not enough to design your IT to fall over at the slightest hint something is off-kilter. To implement failing fast as it was intended means that monitoring processes exist to detect these failures and—more importantly—that procedures exist to remedy the situation quickly and efficiently. In IT, we remedy most failures by getting a computer to solve the problem for us. Acceptance of failure doesn’t mean systems administrators or developers are standing around, hat in hand, mourning a downed IT system.
Failing fast is typically coupled with an obsession for automated remediation. The web server fails a stability check. This failure is detected, so a backup web server is brought online while a script modifies DNS so the world can find the new server, all without human intervention. That’s fail fast.
Let’s get it straight: IT is different
So what’s the problem? A core misunderstanding of how IT does or even can operate. Business owners often have irrational expectations that properly implemented IT can be done in such that it never fails. They then think anyone who talks about failing fast is embracing failure, seeking a scapegoat, or point-blank not doing their job. Business owners demand their IT teams design a web server that can’t fail. IT practitioners know this is impossible. Business owners demand that IT teams design a network that can’t be hacked. IT practitioners know this is impossible.
Most business owners—especially Canadian business owners, who tend to operate relatively small organizations—aren’t used to running up against the fundamental laws of physics or economics. Our business leaders are used to being able to executive-order things into occurring. Building that skyscraper in six months isn’t impossible, and, in an emergency, staff can be motivated to make it happen.
But IT is different. It operates on the edge of human knowledge and has to deal with extremes. Some things simply can’t be done. Consider security: An attacker only has to succeed once in order to compromise a network. They have the leisure of trying an unlimited number of times, often in automated fashion. A defender only has to make one mistake, and the attacker wins. Because the defender can only deploy software and hardware that currently exists on the market—and all those products have unknown flaws—attackers will eventually win, no matter how hard the defender tries.
Taking accompanied leaps
Failing fast would have the defender detect breaches and respond to them as they occur. This is rational and pragmatic. It also offers a methodology for engaging in change. If the changes made are small and incremental, their impacts can be detected quickly and the changes either accepted or quickly discarded.
It’s an important concept in IT and even has a role to play in other areas of business. But it has to be accompanied by the processes, procedures, and automation to make sure businesses don’t suffer undue outages. In our increasingly tech-dependant world, our hands are tied to alternatives.