Michael Tsai on Southwest Airlines and Technical Debt

This is a really interesting take on how technical debt directly contributed to Southwest’s terrible week.

From Zeynep Tufekci’s article on the NYT:

It’s been an open secret within Southwest for some time, and a shameful one, that the company desperately needed to modernize its scheduling systems. Software shortcomings contributed to previous, smaller-scale meltdowns, and Southwest unions had repeatedly warned about the software. Without more government regulation and oversight and greater accountability, we may see more fiascos like this one, which most likely stranded hundreds of thousands of Southwest passengers — perhaps more than a million — over Christmas week. And not just for a single company, as the problem is widespread across many industries.

This problem — relying on older or deficient software that needs updating — is known as incurring technical debt, meaning there is a gap between what the software needs to be and what it is. While aging code is a common cause of technical debt in older companies — such as with airlines, which started automating early — it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand. As you might expect, quicker is cheaper.

I think what’s so insiduous about technical debt is there’s not always an obvious, clear measurement that you’ve taken on too much. Every software system has some amount of technical debt, but it’s a challenge, even for experienced software engineers, to guage the level of debt. It’s not equivalent to the physical world where you can see if something is crumbling or has some other type of defect.

In some cases, software’s technical debt may be obvious, for example if its running on hardware or software that is no longer available or updated by it’s developers. COBOL running on a mainframe is a good example.

However, as Tufekci’s article states, even newer software can suffer from technical debt. On the contrary, it is possible for old software to be well maintained and have an acceptable amount of technical debt. For example, many modern operating systems (such as Linux) are decades old yet remain well maintained and cutting edge.

It’s likely that Southwest runs on programming languages and operating systems that are actively updated, but the debt lies in the software that they’ve written and runs their business. This is probably the case that a lot of companies in the 10-20 year old range fall into, and they would not be unique if they had technical debt that has gone out of control.

It’s an industry problem. We lack the tools and techniques to appropriately guage tech debt in quantifiable ways. How much technical debt is too much and how much impact it may have on a business boils down to opinion. Different engineers will debate with each other on this topic. There’s no standard, industry bechmarks or regulations to turn to, unlike other fields of engineering.

Perhaps our field will develop such standards, but my hunch is that software is unique enough from other engineering disciplines that we may never be able to develop such standards. I’d like to be wrong about this.

Does anyone know of productive cases of measuring tech debt? Or systemic ways to keep it within a maintainable limit?