The Software Engineering Game

CRUFT: An alternative to the Technical Debt metaphor

Technical Debt is a term that means different things to different people. Ward Cunningham created this metaphor as a way to explain to his non-technical boss why they were refactoring their code. But if you get six programmers in a room and ask them what it means, and you'll get a dozen definitions. This phrase can mean just about anything, from useless features, to bugs, to unnecessary code. So, practically, it means "code I don't like".

We, as programmers, can be more precise when we discuss code. For the last few years, when I find myself reaching for this metaphor, I try to think a little harder about what it is that I'm trying to say. We don't need to use a financial metaphor intended for non-technical people to discuss the tradeoffs that we make in software. Instead, I'd suggest we use this:

CRUFT

CRUFT represents five dimensions of technical debt:

  • Complexity
  • Risks
  • Uses
  • Feedback
  • Team

Each of these dimensions can be measured, often using simple and objective metrics that are familiar to anyone who has worked in software.

Complexity

Complexity makes software hard to change

Complexity comes in many forms, but "lines of code" is a pretty reasonable measurement. This is especially true if you use automated formatting and linting tools to keep your code consistent. The more code we have, the more there is to read, test, document, deploy, etc. This increases the cost of change. The less code we have, the better.

Here's a thought experiment: Let's say that you have a couple of programmers, Alice and Bob, working on the same project. By accident, both of them try to solve the same problem on the same day. Bob sits down writes a 1000 lines of code that solves the problem beautifully. The code is well written, well tested, and the deployment and operation of it is well documented. Alice heads off to to the park, where she thinks about the problem and feeds the pigeons. At the end of the day, Alice wanders back to the office, deletes 3 lines of code...and the problem is fixed. Which of these solutions is better?

If you want to be more nuanced, you can separate complexity into necessary or unnecessary complexity. Complex problems sometimes require complex solutions, but often times, we make our solutions more complex than they need to be. Or, sometimes, the problem changes and problems that were previously complex become simple. This is where refactoring can help. But all complexity is bad, even the necessary complexity. The more of it you have, the harder it will be to make changes.

Risks

Risks represent unwanted behavior

Risks fall into two categories, known and unknown (and perhaps a third we could call Rumsfeldian). To measure your known risks, look in your issue backlog. You probably have some categorization that makes sense here (i.e. "bug", "incident", etc...). I try to close issues for risks that I don't expect to see again, knowing that I can re-open them if I'm wrong about that. That way, the open "incident" issues represent risks that we are living with, but for one reason or another, we don't want to address right now. For example, let's say we had an outage because the disk on one of our servers filled up with logs. The immediate remediation of that risk was to delete the old logs, but until we set up something to automatically rotate the log files, we still have this risk.

Unknown risks are harder to measure, but thinking about their potential impact more precisely can give you more flexibility in how to manage them. I like to think about the Net Present Cost of a risk. That is, how much is this going to cost me (in terms of time/money), and when, with what probability. These are "what if" scenarios that are important to consider...but equally important is to avoid analysis paralysis. A ship is safe in harbor, but that’s not what ships are built for. Any venture requires risk, and one reason we build feedback into our software is to help protect us against risks.

When I'm reviewing a pull request, I can frame my feedback in terms of the known and possible unknown risks. I will occasionally add new issues to the backlog when approving a PR to track the new risks that we're taking on when it gets merged. I can do that because I'm the one who will bear the brunt of the cost if that risk turns against us. I will get paged in the middle of the night. I will have to answer to my boss if something goes wrong. Separating the assessment and repercussions of risks leads to imbalance and poor decision making, so I try to make sure those responsibilities stay with the same people.

Uses

Uses represent wanted behavior

This is why we build software in the first place: It does things that hopefully create value (usually $$$). However, uses aren't necessarily valuable. Sometimes, we create uses that are valuable, but that value fades over time. Sometimes, we're just wrong about what's valuable, and what we create doesn't result in the value we hoped for. In these situations, we can consider removing some uses to reduce complexity. This is a common definition of technical debt, one in which a system has accumulated a lot of functionality over time, but some of it is no longer used. The added complexity makes new uses harder to add, and so the system is unable to adapt to new requirements because it's spending too much of it's "complexity budget" meeting old requirements that are no longer needed.

I think the best way to measure uses is to look at your passing tests. Each automated test in your system represents one thing that your software can do. Again, whether these behaviors are valuable is another question. Often, to see the value of the system, you can't look at the behavior in isolation...you have to look at the interactions and workflows that they enable. But if you want to measure uses (perhaps to track the change over time), just counting your passing tests is a pretty good way to do it.

Feedback

Feedback represents how fast we can learn

Feedback is essential, and fast feedback allows us to make progress quickly. Often times we need to go through a certain number of iteration cycles to achieve a result. The total time to finish the task is a function of how long each cycle takes. How many times have you been trying to diagnose a problem with an automated build, committing and pushing changes over and over again, hoping that this time the problem will be fixed? If you need 10 cycles to fix the problem, and your build takes an hour, then this problem will take all day to fix. If the build takes a minute, you can fix it while you wait for coffee to brew.

Feedback takes many, many forms, but the three that I think about most are Observability, Automated Testing, and Value Discovery. Each of these has it's own sub-topics, and I won't go into them here, other than to provide some basic definitions of what I mean:

  • Observability - Logging, metrics, tracing, latency. How do you know what running software is doing?
  • Automated Testing - Unit/integration/acceptance tests, CI/CD pipelines. How do you know if a change to your software will have the intended effects?
  • Value Discovery - A/B testing, retrospectives, customer satisfaction, usage metrics. How do you know which uses are valuable?

How you measure feedback depends on what it is, but all feedback can be thought of as information over time. When something happens, how quickly will you know about it? For example, one thing I keep a close eye on is automated test cycle time. I like this a lot because it's easy to measure, objective, and actionable. When a suite of unit tests gets too slow (more than a few seconds), it means the programmers working on that code can no longer stay in the flow because the cycle time has gotten too long. That's an indication to me that we've got too much complexity in a single repository, and we need to take steps to split it up.

Team

The team represents the people who are capable of supporting the software

Software engineering teams can only be so big while remaining effective. Amazon's two-pizza teams are an example of this harsh reality. We can't simply add more people to a software effort and expect all of them to be able to work together effectively.

However, our ability to manage complexity, mitigate risks, create new uses, and interpret and respond to feedback will be a function of the number of software engineers who are capable of doing those things for a particular software system. This is sometimes different than the number of people who are organized into a "team". Different factors at different companies might create team structures where only a small percentage of a "team" is actually capable of performing these tasks.

One way to separate these two groups is to think about which people could leave the team and make supporting the software impossible (aka Bus Factor). The size of this set of people allows us to quantify the support that we can provide. In my experience, 3-5 people is optimal. 2 is lean but can be effective. With 6 to 8 people, you start to spend a lot of time on communication overhead. If you don't, the support starts to fracture into smaller groups, where only certain people can do certain things. Personally, I've never seen 9 or more people who all maintain the same software effectively.

When you only have 1 person who can maintain a particular piece of software, you introduce "key person risk", which acts as a multiplier on all your other risks. If that person departs, this value goes to zero and you will wind up with abandonware. This is a difficult situation to recover from.

Minimizing CRUFT

Given these 5 dimensions, you can treat technical debt as a portfolio optimization problem, trying to increase feedback and team support, minimize complexity and risk, while trying to maximize value derived from use. Once you start thinking of technical debt in this way, you can start making tradeoffs in design discussions, pull request reviews, and everyday collaboration with your fellow programmers. Here are some common tradeoffs:

  • Increasing complexity to add new uses
  • Increasing risks to add uses (implementing only the "happy path" to get something done quickly, without handling edge cases)
  • Decreasing complexity to increase team support
  • Decreases uses to reduce risks and/or complexity
  • Increasing feedback to reduce risk
  • Increases uses to increase feedback (through more user activity)
  • Increasing the team size to reduce key person risk

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

The comments to this entry are closed.