When production goes down, what's the first thing you investigate? Maybe you check Sentry for new runtime exceptions. Next, you might check logs or traces for where these exceptions originate. Eventually, you might find yourself on GitHub clicking the "Closed" filter on the pull request page to see the most recently merged new code and fresh production deployment.
I did this every time production broke and was responsible for responding to the incident. I've seen incidents varying from dropped production databases because rake db:drop was ran in the wrong terminal window to a Spinnaker instance redeploying three-month-old code because Redis restarted.
Our changes are the common catalyst for production incidents, but why?
The Life of a Codebase
Code, and the infrastructure that runs that code, are constantly changing. Nothing can stop us when we're bushy-tailed and bright-eyed building our first features in a wide open field. And that's (sort of) right. Short of writing just lousy code, new codebases that have very few components are inherently more reliable—small projects Just Work™.
Just look at rubular.com – a long-standing website I've used countless times to write regex statements through trial and error (I hate regex). In the 13 years I've written Ruby, this website has not changed (other than the underlying Ruby version).
But only some codebases can be like Rubular. The codebases that pay the bills are under constant scrutiny from those who rely on what they provide. Customers, those damn paying customers, are the most prominent reason why codebases change. We write new code to compete, but that new code often takes on a life of its own.
Writing code is like playing Jenga, kinda.
Jenga, the excellent game of wood blocks and gravity, is how I think of an evolving codebase. The tower starts as a perfectly stable grid of wood girders and excited players. The first player to go rarely has rhyme or reason for their initial move. The tower is so stable that any piece can go nearly anywhere and won't fall. But as the game progresses, the tower becomes taller and more unstable. Each player gets increasingly nervous, hoping they're not the one to make it ultimately collapse, ending the game.
Each turn in Jenga is the same as when you merge a pull request on GitHub.
There's one minor (oh so minor) difference between my Jenga analogy and writing code, though. When we write code, we're also minting new wood blocks and attempting to slot them into place. As engineering and product teams, you're given a choice on where we think we should place our next block. Do you want more features or more stability? Should you fortify any load-bearing tech debt you have?
If we're making a new feature, the block is likely getting added to the top. If we're refactoring an old feature, we could slot our girder into the middle section. If tackling overall tower reliability, we could slot a tiny steel brick at the bottom, bringing the center of gravity down and creating more stability.
Jenga, but with a twist.
As a Jenga master code carpenter, you can play different blocks, and they change the dynamic of the game you're playing. I reckon you have three types of main blocks to play each turn:
Wood block: Create and deploy a new feature. They can be moved or added to add height to the tower each turn.
Toil block: These blocks must touch at least three other blocks, and can't be moved for at least 25 turns after being placed.
Process block: These blocks can be added anywhere on the tower but can't be moved for at least 50 turns. Also, at least 2/3rds of players must agree to the location of the process block.
New features and code aren't the only things that stabilize your Jenga tower. Your decisions to go down a few floors and refactor girders of choices past make a considerable difference in which towers can continue to grow and which crumble and fall. Similarly, your process is a part of your change rate failure and should accommodate the height of your tower and then some, but not too far beyond the desired height you're trying to reach. Process blocks should be "long-term ephemeral" (is that a thing? I think it can be) because they can't move quickly.
Fighting gravity
That production change that took down service wasn't singularly the cause, but it was the catalyst. The mixture of changes that preceded an outage was already fighting gravity; the latest change made is the one that just took the blame. The order of the changes we made, where we put our blocks (and where we didn't), and maybe someone barely hit the table as they got up to grab another beer gave gravity its inevitable win.
But guess what? The tumbling tower of production is precisely what needed to happen to highlight where your weak spots were. Eventually, you get good enough at playing Jenga through constant trial and error to play like this.
Well done as always, Robert. Gives me a completely different way of thinking about code and points out that incidents are part of the ecosystem and the circle of life for software. Ping me if you want to connect sometime. Cheers.