michael werneburg

the economics of IT systems reliability

the problem

Business leaders understand that we live in an uncertain world. External matters outside our control can impact their ability to meet their objectives. While few businesses are defined by their IT systems, many are today delivered through technology. Whether external systems like cloud platforms or internal legacy systems, all IT systems are prone to variable reliability: occasionally being slow or out of service, sometimes being marred by unexpected impacts from systems changes—and of course, impacted by major disasters.

Making an investment in reliable systems can be difficult if the leadership can't put a price on the value of that reliability. Reliability is an unvoiced expectation rather than voiced: that is, when asking technologists to provide a service they won't specify the target reliability, and might not even ask. It is assumed that reliability is built into the solution they have bought or built.

Technologists know that this isn't so. Today's systems have many hidden complexities. The are expected to do more than ever, to do it faster than ever, to interoperate with more counter-party systems, and to be more available than in the past. Look at web-based solutions. In the early nineties, call centers operated mostly inside the 9—5 business day, there were few corporate websites, and any websites that existed might be little than a brochure. But the network was established, and between email and databases and electronic forms, business was finally digitizing. By the late nineties, the presence of a website was guaranteed, and the presence of a website introduced pressure for communication outside of normal business hours. The website was visible world-wide, and was always on—potential clients expected to get a response from someone with that sort of presence, no matter their own hours. We built more and more interactive functionality into websites, and they stopped being brochures to instead became applications in of themselves. By the middle of the next decade, mobile platforms had transformed into application-driven computing platforms, and business was always-on and could entirely digital. While we may take the complexities that deliver all this functionality for granted, we see with every cloud outage, with every ransomware attack, and every flubbed software update how precarious these things can be.

Provisioning reliability is expensive, with cost increasing in a non-linear fashion with incremental improvement to reliability. For instance, clustered systems and support teams in two geographies more expensive than a simple data replication strategy. The former offers very quick recovery, while the former offers a lengthy and risky effort. In exchange the former may cost a huge multiple of the cost of a standalone system without reliability infrastructure.

When reliability is assumed and its costs prohibitive, how do we demonstrate the value of reliability?

comparing cost curves

I've found that two factors are required.

First, you must compare the cost of business interruption with the cost of reliability efforts. Below I show two non-linear curves, with the cost of not doing business climbing immediately from the onset of the outage (pink curve) or starting slow and then building (blue curve). The truth for any given business may lie in between, it may be linear, and it may even be stepped in nature.

two curves showing cost of business interruption
two curves showing cost of business interruption

example cost curves

A company with a gradual curve for cost of business interruption might be a business-to-business financial firm. If it can't do business for a day, it loses its expected revenues for that day, but suffers only modest loss in terms of reputation with its clients. It is unlikely to attract a lot of overhead in terms of regulatory reporting or fines. It almost certainly won't have problems funding its business operations or investments or cash outflows. But as hours turn into days, more and more counter-parties will become aware of the outage, and the company will start to suffer those secondary problems. And if it should go longer, the loss of credibility can be a long-term problem. (The same can happen if outages are smaller but frequent.)

Eventually, of course, if the company can't conducts its affairs it will go out of business.

A company with a steep curve has immediate impact from its loss of ability to function, but then suffers less impact as the days drag on: the damage is largely done. This was the fate of many retail and hospitality businesses during the pandemic. But to go back to the example above, it's also possible for our financial firm to have this sort of impact from an outage. If the loss of reputation and the impact from regulatory penalties are felt early in an outage, it could be that several days or even some weeks of outage worsen the problem but only incrementally.

pricing an outage

It may not be possible for business leaders to put a price on these many costs of not doing business, but I've found that even walking through the possibilities tends to capture and focus the attention on the nature of the problem. It also tends to drive people to be very risk-averse. For the discussion below, I'll choose the steep (pink) curve to depict the cost of business interruption.

The cost of reliability curve starts small but climbs rapidly. Because the outage curve is based on an event of unknown likelihood, a factor must be applied to represent that likelihood. This also is non-linear, but can be assumed within the curve of the cost of business interruption curve. At some point, be it hours or days or weeks, these two curves intersect. This indicates the optimal expenditure.

cost of business interruption versus cost of remediation
cost of business interruption versus cost of remediation