Software Engineering

How To Risk-Manage Your Way Through Almost Anything

Or, why does everything always suck so bad?

One of the most frequent (and most serious) frustrations I encounter among direct reports and peers is expressed as some variation of “why don’t we ever have time to do things right?” What makes this ironic is that I don’t think any executive ever got up one day and decided they wanted their company to build a steaming pile. Ok – there is considerable evidence to the contrary, but work with me. Call me crazy but when leaders set a goal for their organization, I’m betting they pretty much just assume that it will be executed according to best practices (there is a subtle twist here that we will get to). To state it more concretely – I think they actually believe the code should work. Silly of them, I know, but there you are. Furthermore, I think they actually believe it should be possible to improve the capabilities of the software over time in such a way that every change doesn’t place the enterprise in existential peril. They might even think that there is no reason for them to say these things explicitly because they cannot imagine why anyone would do it any other way. They might, heaven forfend, think it is your job. I can hear the “but”. Hang in there with me a little longer. I will address where the leadership team needs to do more. But first, we need to look at the problems closer to home.

We need to properly locate the source for a big part of the problem – us. This is the “us” that is not “them”. If you cannot get past this the rest probably isn’t going to do much good. We start the process by identifying what is under our control and what we must do about it. What I want to propose is not a fool-proof method. It is a framework for thinking about a whole class of vexing problems in terms of risk management. It’s a pretty good lever. I wish it was easy. It is not, as you will see. But it’s not particularly complicated, and at least you can be doing the correct hard thing.

Setting Up The Problem

There is a frequent scenario where a stakeholder wants something bad – and they are willing to get it bad if they can get it soon. In their calculus (right or wrong) almost anything next month is critically more important than something good six months from now. This doesn’t have to be a problem. But it almost always ends up being a problem. Why is this so? First, coders don’t like building lashups. Hah! What a joke. Coders love building lashups. It’s fun. But they know the limitations of those lashups, and they know success will bring expectations for said lashup that, in no way, shape, or form, it can hope to meet. Here’s the kicker. Even if everyone claims to understand these limitations, this tends to be forgotten. What unfolds is a classic failure to identify risk and deal with it cleanly and transparently. And the first line of defense is with the person most proximate to the risk.

If you were to ask people why they built something that didn’t fulfill basic requirements that all production software should meet, you would invariably get a response in some way asserting that “we tried but weren’t allowed to”. I am not saying this does not, in fact, happen sometimes. But I want to deflate this bubble, not completely, but significantly. I’m setting aside egregious failures of good faith for now, not because they don’t happen, but because I think the real McCoy is fairly rare. The more usual form of “soft” bad faith is more relevant, and I think we often tacitly encourage it. Let’s imagine a situation in which an organization has decided that it needs something, and it needs it badly. The setup has the classic hallmarks of a failure – too much, too soon with too little. For most of us, this does not require imagination – just memory. What are some of the ways in which we respond to this?

  • Inexperience – we don’t know how. We see the apparent futility of the situation but do not know what technical solutions to offer. A variation is that we have a pretty good understanding of how to avoid a trainwreck at a technical level, but we don’t know how to effect a change of direction in our stakeholders. In either case, it all seems above our paygrade. We are tempted to feel it is best to do what we are told and start looking around for the exit – a fairly common response that explains a lot about our industry.
  • Disinterest – it is ok with us if they want it bad and get it bad. This could be because we have become jaded – tragically, all too common. Or it could be that we just do not like the often-boring work required to produce quality software. Building bad software is relatively fun and easy (in the early stages). And we have been handed the perfect cover: “We were told to do it this way.” Probably the most frequent example of this is around automated testing (or any testing for that matter). When faced with a schedule constraint, this programmer will ask, “Does that include automated tests?” I have almost come to despair that what I am hearing is an honest enquiry. Rather, what I am almost invariably hearing is the eager anticipation of a programmer who knows they are about to get out of doing something they did not want to do anyway. Full points for asking the question – that was the beginning of the right conversation to have – but it was a feint. You can almost hear the sigh of relief when the answer comes back, “We will have to come back and do that later.” I could be even more harsh here, and I hope it will become clear why.
  • Disagreement – everyone says they are interested in building quality software, but we do not agree about what that looks like. This is perhaps the messiest category, and more difficult to tackle from a risk management standpoint. The reason is simply that there really can be legitimate differences of opinion about what quality and best practice look like. When a champion arises with a strategy that seems to align with the urgent need of the moment, and you believe it is banking on a fraught with peril, it can feel almost impossible to diffuse the bomb. And what is worse – you might be wrong, and they might be right. 

We can apply a risk management approach to each of these and see how things could be different. First, what do I mean by risk management? Typically, people think about things like identifying all the material risks you can think of, the probability they would occur, the cost of the risks if realized, doing some math to model various “what if” scenarios, and putting it all into a big document. All of that is fine, and sometimes useful, but not where I am going. What I’m envisioning is something more foundational, and it has to do with how we think about and respond to risk. We need to start by identifying the actors, and there are only two: Stakeholders and Experts. Stakeholder is a term we use a lot and probably think we understand. But in this context, it only means one thing. It is the person who has the legitimate responsibility to decide whether to consume a risk. Consuming a risk means accepting responsibility for the consequences if the risk is realized when you are legitimately in a position to bear those consequences (it is your money to lose). Expert in this context is simply the one who has line of sight awareness of the risk. It is not an absolute claim of expertise in a domain – it may simply be a claim to some information about the risk. The role of the Expert is to discharge what they know to the Stakeholder – full stop. It is tempting to say, “it really is that simple.” If it were not for all the weird and wonderful ways these two roles can interact it would be that simple. But the fact that there can be a great deal of complexity in how this works in practice does not negate the value of the essential insight about the nature of the problem we are trying to solve. At the end of the day, we are looking to have a clear handoff between people who have (or can get) information and people who are in a legitimate position to make decisions based on that information. And what we will find is that a great deal of the dysfunction and frustration we face is due to a failure to identify who is in what role and to execute the responsibilities of those roles. We obviously want to be successful in whatever endeavor we are engaged in – that is right and proper. But even if we fail, a proper handling of risk can be the difference between a “clean” failure and a “bitter” failure. A clean failure is one where we give it our best shot and it does not work out. A bitter failure is one where we feel betrayed or let down.

A Deeper Dive

Inexperience

I am primarily focused on inexperience in the Expert role. A great deal has been written about skill development of one sort or another. But on any given day you know the things you know, and, critically, you do not know the things you do not know. But for risk management purposes the critical thing (and sometimes the hardest thing) is to be explicit about what you do not know. By conveying this to the Stakeholder, far from letting them down, you are giving them the opportunity to evaluate their options. By failing to do so you are setting them up to be blindsided.

This is fraught with difficulty because it may seem like an admission of incompetence – and it may even be. But not knowing something is not usually the same thing as incompetence. Everyone is someplace in their career and has seen what they have seen. The idea that someone “should” know something is in the realm of the imponderables. For the topic at hand, it is essential to understand that not knowing something is not nearly as dangerous or irresponsible as failing to assert that you do not know something in a timely fashion. We are not concerned here about all the things you do not even realize are risks but you are too inexperienced to perceive (this falls more under the treatment of “disagreement”). We are concerned about areas in which you understand that you are inexperienced and do not have the depth of knowledge to speak to a particular risk. As an aside, you can frame this in a way that makes it clear you are prepared to do everything you can to develop a point of view that may be more useful to the Stakeholder. But you are letting them know that this may not be possible in a timely way.

How does this help in the situation where you have technical insight and line of sight on risks, but little experience in how to deal with a leadership team that does not seem to acknowledge the risk? The good news is that this is not primarily an exercise in how to be persuasive. It is not about winning a technical, or a business argument. It is about identifying the Stakeholder, articulating the risk as you understand it, and placing that information squarely in the Stakeholder’s court to explicitly address. The goal is to force an explicit decision about what to do with the risk. It may not be the decision you believe is correct, but it needs to be an explicit decision. You should not be pugnacious or arrogant about it. It should be obvious to everyone that your concern is genuinely for the success of the endeavor (and this needs to actually be true), and that you are prepared to salute and abide by the decision made (with few exceptions). But you are insisting that a decision be made by the one who is in a position to consume the risk. This also requires courage because it is often not popular. Stakeholders, unfortunately, would often prefer to abdicate this responsibility – they may hope that you do not want to grow a backbone any more than they do and will quietly let you carry the burden of the risk. While there are times when the juice is not worth the squeeze (pick your battles), broadly speaking, failure in this makes you an enabler of organizational pathologies.

It should be clear by now that the most important currencies for this transaction are courage and integrity. The good news is that those attributes are no less available to people with little experience, and no more available to people with a great deal of experience. What seems like a tangle of hard problems becomes less problematic when you are careful not to try shouldering problems you are in no position to resolve.

Disinterest

The important thing to understand is that you are subsidizing a “soft” bad faith transaction because you are allowing conditions to unfold that keep the right conversations from occurring. You are doing so under the cover that your hands are tied. If you are quietly abandoning best practice to meet a project constraint, the first question I would ask is just how much commitment you have to best practice. I would assert that abandoning best practice is not on the table. It should be baked into your estimates, not an optional extra that can be negotiated. The risk discussion with the Stakeholder should not be about abandoning best practice, but about decoupling external constraints, opportunities for scope reduction, phasing, and resource allocation. You may not feel that these mitigations are on the table. They are always on the table. Stakeholders are usually very familiar with how to manage puts and takes of this kind. Your job is to give them information about where seams exist in the work breakdown that can be exploited either in terms of scope, schedule, or parallel effort. Your job is also to not hand them a free lunch that is not free. There is a legitimate discussion about when to introduce tech debt, but as we will discuss shortly, in most cases you cannot afford tech debt. The essence of the failure here is one of integrity. The solution is a frank disclosure of risk as early as possible with a Stakeholder who is in a position to consume that risk, and to work with them to find options that maintain the integrity of the deliverable.

Disagreement

Courage – check. Integrity – check. Experience – check. This is where things get a lot more complicated. What if we disagree about what quality looks like? If you disagree with the Stakeholder about what quality looks like, and they are not abdicating but are stepping up to consume the risk, well…, count your blessings – there are worse places to be. This is a persuasion problem, not a risk management problem. There is still a strong risk management angle involved that can be used in the persuading. You have the same fundamental responsibility to articulate the risk you have line of sight on. It is the Stakeholders responsibility to articulate what level of quality is appropriate to meet their objectives. In the happiest of worlds this is merely a discussion about the relative merits of two technical approaches. You may feel passionately about it, but at some point, the Stakeholder has the legitimate responsibility to choose, and you have the legitimate responsibility to salute and do your level best to keep the risks you foresee from compromising the effort. If the disagreement is with peers, it’s less cut and dried, but articulating risks that competing approaches would need to address is still a good framing for discussions like this. It gives you an objective good to champion, so you can focus on principles and be flexible about method. Again – these are great problems to have if this where the discussion is centered. But very frequently there will be (or should be) an exploration of what level of technical debt will be acceptable. This is so pervasive in all the scenarios I’ve mentioned that I think it deserves a treatment of its own.

Technical Debt

The question on the table is, when is it appropriate to take on technical debt? The first point I would make is that you should not underestimate the impact of merely having the conversation. One of the reasons I value a risk management framework is that it places a premium on the disinfecting benefits of sunshine. The presence of technical debt resulting from a discussion and decision always beats technical debt that just oozes in through the cracks.

From a risk management perspective, accepting technical debt usually comes down to another question: when will the technical debt be extinguished? The simple reality is that technical debt, like financial debt, can be a huge lever to resolve an otherwise over-constrained problem space. But unlike financial debt, technical debt is frequently opaque to the Stakeholder. The primary responsibility of the Expert is to expose it to a degree that allows informed decision making. One of the most difficult problems in engineering is to articulate the cost of decisions – in particular the costs that will only be realized sometime after the initial development of the software (the tech debt equivalent of balloon payments). I have often thought that if engineers could project the true cost of technical debt in financial terms, we would find ourselves with the tables being turned and engineers would be explicitly forbidden from introducing tech debt (which would create another set of problems). In one sense this problem is simple. It is your job as the Expert to provide an understanding of the costs to the Stakeholder. We do that all the time when it comes to upfront costs. And historically we suck pretty bad at it. But there are many tools available to you and this is not a discussion about estimating – at least not that kind. It is the submerged costs that we have a responsibility to arm the Stakeholder to grapple with. Fortunately, while a quantitative way of capturing that can be daunting, a qualitative approach is usually good enough. As an aside, the difficulty of determining the future cost of tech debt quantitatively is one of the reasons we need to be far more skeptical of taking it on than we usually are. We aren’t normally in the habit of writing blank checks – except in the technology world.

The following assertions should be uncontroversial if you have worked on software engineering projects for any length of time. Whether they are or not – I am asserting them.

  1. Over time the incremental cost of change of software increases unless you actively expend resources to contain it.
  2. The rate of increase is profoundly affected by decisions made early in the life cycle of the software.
  3. Decreasing the incremental cost of change becomes more costly the longer you wait to attempt to do so. At some point the cost to reverse the trend exceeds the cost to replace the software.

At a qualitative level these assertions give us a framework for thinking about the risk being injected into different types of endeavors. Back to the question – is it appropriate to take on technical debt? Consider these scenarios:

  1. An experiment or proof of concept that is not operationally deployed or is effectively quarantined to contain risk. In most cases accepting tech debt is not only acceptable but probably the only responsible and timely approach. But there is a caveat. The Stakeholder should understand that the entire effort is a sunk cost. The cost to operationalize the experiment is as much or more than the cost to scrap it.
  2. A startup effort that has an exit criteria of meaningful revenue generation, but non-scalable operational deployment. Tech debt may be acceptable. If the domain has a low regulatory burden, and Stakeholders are willing to accept that the technology stack will need to be re-built to achieve scalability beyond a certain point, this could be a responsible risk to consume. But also consider – are Stakeholders willing to disclose this to a potential new Stakeholder (say, in an acquisition)? The question almost answers itself. You might object that startups are inherently over-constrained, and it is not realistic to avoid technical debt in this context. But, as an Expert, I would want to understand the exit criteria. And, as a Stakeholder, I would want to understand if what I am getting will withstand the scrutiny I expect in a successful exit. The world does not owe your business endeavor a successful outcome. Assuming good faith as a baseline criterion for success, tech debt should not be taken for granted as a recipe for success in a startup.
  3. Same setup but with scalable operational deployment as an exit criterion. The only mitigating factor is that operational revenue is not currently on the line, only future revenue. Limited kinds of tech debt might be acceptable. Tech debt can impact the cost of enhancing the software in future without necessarily impacting the ability of the system to scale operationally. But these concerns are often entangled, so this is probably wishful thinking. I don’t believe this scenario differs meaningfully from a steady state operating system except in the considerations available for how long you have to extinguish the debt. The immediate risk of introducing tech debt is lower but the long-term impact is every bit as serious.
  4. Building and deploying a change to an operational enterprise stack that generates material revenue dependent on high availability. Try to imagine a conversation in which a programmer or team says the following to a Stakeholder: “We are making a change. We do not know if it is safe. We do not know what impact it will have on the system. We do not know if we can roll it back quickly.” It’s hard to imagine this conversation taking place.
    • Alternate reality number 1 – reasonably healthy tech stack: The Stakeholder says, “absolutely not”. Very uncomfortable conversations happen. Schedule commitments probably slip. But the technology stack probably gets incrementally more healthy. It’s hard to imagine how this scenario even emerges in a healthy organization and you will think, correctly, that it seems contrived. A healthy organization shouldn’t provide the kind of soil where this kind of problem grows.
    • Alternate reality number 2 – unhealthy tech stack: The Stakeholder says, “Sure, go for it!”. If you are like me your brain just exploded. And. Yet. This. Happens. Or, at least, changes this risky happen. But in all likelihood, all the way from the inception of the technology stack, this conversation did not happen. Ever. And yet Experts will swear that they were just trying to meet unrealistic constraints and Stakeholders will swear they always asked for best practices to be followed. And they tacitly covered for one another because each was getting what they wanted most in the near term. This should make you squirm with second-hand embarrassment.

Should you find yourself in that last scenario, a reasonable question is, “Is there any safe way forward?” No. You might think that a safe way forward would be to “stop the line”. But there are risks associated with that as well. There is a way forward that gets safer over time. The Stakeholder must accept that they are in a very risky position and be willing and able to commit to a forward path that will not make it worse, will incrementally make it better, and will be expensive – in terms of new investment, lost opportunity, or both. The Experts must be willing and able to commit to explicit disclosure of risks, and a plan to incrementally make it better. Is there enough money to dig out of this hole? Maybe not. The problem with tech debt is that while the upside is almost always very finite, the eventual downside is often nearly unbounded in the context of your venture. I do not consider this to be “too pessimistic”. But I acknowledge that this is a dire prognosis. A dire prognosis is not improved by pretending it is otherwise.

A Deeper Look at Experts and Stakeholders

One possible objection to this overall approach might be a fear that it would proliferate micromanagement to deal with all the risk discussions. If it does, there is probably a failure to understand who the Stakeholder properly is. It’s worth thinking through this more.

Expert and Stakeholder are relative roles. Stakeholders typically have a higher-level Stakeholder they interact with as an Expert. Likewise, an individual contributor functioning as an Expert will normally encounter risks that they can legitimately take responsibility for and consume. They are effectively their own Stakeholder in many situations. The question you should be asking yourself is – if this goes wrong are you able to absorb the consequences? Or are you creating a situation that risks blindsiding your Stakeholders? We do this with every line of code we write so it’s not a foreign concept. We routinely experiment and refactor, consuming the risks of the decisions we make. Sometimes we work a few extra hours to make that happen. At some point we may uncover technical risk that clearly falls outside the estimating assumptions baked into our commitments. That is the time to engage in the Stakeholder conversation.

Breaking this down even a little more – for each of the two roles there is a ditch on either side of the road. On the one side you can claim authority you do not have. On the other side you can abdicate responsibility properly yours.

  • Expert – You fail to communicate a serious emergent risk and effectively preclude the Stakeholder being able to bring solutions to the table.
  • Expert – You abdicate responsibility even for risks that you would be able to consume. Effectively, you are forcing the Stakeholder to do your job for you.
  • Stakeholder – You refuse to give latitude to the Experts to exercise agency over their own work.
  • Stakeholder – You abdicate downward, forcing Experts to take on responsibility for risks that they are not truly in a position to consume. At best this creates confusion. At worst you are setting them up for failure and refusing to consume risks that legitimately belong to you, or that you should, in turn, be escalating. It does not create conditions where risks get daylighted and consumed. This should not be confused with mentoring, where you explicitly encourage the Expert to take on a risk, accepting that you might have to consume the risk if it is realized.

Concluding Thoughts

Will this address every situation? No. But I think you would be surprised by how many seemingly intractable situations can be untangled into something much less pathological. Are there situations where someone is acting in genuine bad faith that will defeat your attempt to act in good faith? Yes. But people often alter course in the face of the bright disinfecting sunshine of a frank risk conversation. At the end of the day, you can only be responsible for what you can be responsible for. But at least you can make sure you are taking responsibility for that.