Lesson #1: The Most Insidious Bugs Hide In Between
Imagine a scenario. You are writing a new feature. Let's say that it's a REST API. You cover it in unit tests. You test it manually. It all works great. You have your mocks, stub, spies, whatever the terminology is within your language/runtime of choice. You test the feature locally and in DEV or ephemeral development environment.
It all works great. You're happy. The stakeholders are happy. Sprint velocity is holding up well.
It is now time to integrate with another system, micro-service, or whatever the case may be.
The integration is done, and it fails miserably. There are a bunch of errors that come up, which you fix one by one. But, after every error there comes another one, and another one.
Now, the other service, the one that your system is calling, was most likely also thoroughly tested - or so that had that other team thought.
The problem is that although both your service and that other service have been tested on their own, they have not been tested together.
You might say that - "Well, if the call fails, it means that my service wasn't tested all that well after all".
You might also retort that "if the API specification is 100% correct, as it should be, and there are no bugs in the services, then everything has to work. If it doesn't, it means that the specification/requirements were incorrect"
You would be right on both accounts. The problem is that the reality of software engineering is messy. Requirements and specifications often lack detail and teams make assumptions that aren't always true. In fact, they rarely are.
What happens then is that you write the code based on those assumptions and understanding but the other team, that is writing the other system, has different assumptions and understanding.
When you integrate the two systems, this is where the true test lies. In fact, this is not limited to systems. It also holds true for different components, or modules, or classes within the same codebase or application.
The rule is that until you have integrated the separate components, whatever they are - you just don't know if things work the way they should. That is why the most insidious and hard-to-find bugs are found in the relationships between entities and not so much within the entities themselves.
Luckily, there is a solution. Integrate early and do it often. Create integration tests and End-to-End tests that can test the entire flow of the application. Use them sparingly, but each such test is worth dozens of unit tests since it tests an entire flow.
The other thing you can and need to do is ensure that contract specifications between systems are well defined and well communicated. This is not always up to. However, as a software engineer, you can communicate the importance of this to those who are responsible for these requirements - product owners, managers, or business analysts.
Otherwise, you will continue writing code under your own assumptions and with you own interpretation while others have theirs.
Bugs will continue hiding in between.
Lesson #2: You Are Only as Agile as Your Least Agile Link
Much has been said about agile software development, Agile™, Scrum, and everything in between. There is no shortage of opinions from software engineers and others on the topic. Many of these opinions are critical of the Agile process and methodology.
This lesson though is not about whether Agile methodologies are right or wrong. Instead, this is an observation about one key aspect that makes or breaks the success of Agile in an organization.
My first acquaintance with Agile methodologies was somewhere towards the midway in my career. One large organization that I was working for at the time started switching from waterfall to Agile. My team was one of the teams that switched to the Scrum version of Agile within that organization. We started working in sprints, assigned points to stories, and our managers began to vigilantly track team velocity.
There was one problem.
My team did not work in isolation. We worked within an ecosystem of products, teams, systems, and programmes. Many of those were dependent on my team while others were dependencies for us. It is those dependencies that, as I started noticing at one point, were the cause of significant impediments to our delivery.
The problem was that even though we had biweekly sprints and were delivering functionality at a fairly rapid and consistent pace - our counterparts were not.
When we had to design an API that called System XYZ, ran some business logic, and then returned a response to the customer - guess what happened. The team working on System XYZ was not ready. Not even remotely close! That team was still in Waterfall mode - waiting for months on end to complete their functionality.
So while our API was the "star of the show" with everyone waiting for it to be ready, it couldn't really work without the other system being ready. This resulted in shattered expectations - on high levels, lots of misunderstanding, and a significant fallout.
Now, you might say - "Why, this isn't agile at all! True Agile takes these things into consideration. You can at least mock the other system!" You're right. But this was early on in Agile adoption across the industry and so many didn't yet have these "lessons learned".
I've also got news for you. Many teams and organizations are not aware of this and other such basic "agile guidelines" to this day! Even if they are aware to some degree, they still implement Agile processes in a way that is counterproductive and achieves the opposite of the intended speed and agility that was intended.
So what can you do as a "mere software engineer"? After all, you don't make the requirements and you are not in the role of setting deadlines and making promises to upper management. Not in most large corporations, anyway.
The one thing you can do is communicate these concerns early and often. System dependencies can be resolved if everyone understands the reality of delivering that product. If your Product and Business stakeholders have awareness and understanding of these dependencies, measures can be put in place to account for them, address, and mitigate them.
There is nothing stopping you from saying to your Product Owner or Manager -
"Hey, that System XYZ, doesn't look like they are going to be ready in time. What we can do is create a fake response from them and use it instead of the real system when the client calls us. That way, we demonstrate our functionality without waiting for them. We just need your help in creating awareness around this and communicating it to the project stakeholders."
The sooner, that's done, the better.
At Least 50% of Bugs Are Due to Caching
Caching in memory. Caching in a separate product. Caching in the browser. DNS caching. OS level caching. Distributed caching. You get the idea…
One time, I was asked to fix a particularly elusive bug that was causing the intermittent purging of data from an application. I remember investigating the bug was pretty frustrating because the issue didn't seem to be repeatable. Sometimes things worked the way they should have, and then suddenly, some of the data being processed was gone. Just like that. So instead of saving the data into the Database, there were a bunch of cryptic errors written to the logs.
It took me and another developer, hours, to finally figure out that there was a data structure in the code that was being used as a cache. However, instead of caching locally within one thread, it was being used as a Singleton so the cache was application wide.
That was issue number one.
The other issue was that data was being, incorrectly, deleted from the data structure on certain conditions. When that has happened, you can deduct what came next…
That's right. Singleton, and a shared state between all application threads after all.
Data got wiped across the entire application on that node.
This was one example of what can go wrong. Now, you might say - "wait - this is just because the cache was at the wrong level, and also was incorrectly being purged."
You would be right. But, the point is that there are many different issues that can be caused by misusing a cache, misunderstanding how it works, or just plain not being aware that something is cached. This was just one example.
Many bugs and unexpected outcomes - especially those that are hard to find or those hard to reproduce - will be due to various types of caching.
What makes these often non-trivial to fix is that there are so many places where a value can be cached. Moreover, the reproducibility of the issue depends on cache invalidation and eviction strategies of that particular place where things are cached.
To make matters even more complicated, things can get cached on several layers, and so good luck with figuring that one out.
There were a number of notorious cases in the industry where large-scale production meltdowns were caused directly or indirectly by mistakes in how caching was handled.
Don't believe me? Here is one such use case that happened back in '22 with Slack
Have a read, this is a great overview of how they had found the issue, analyzed it, and addressed it. It also demonstrates the complexity involved with many such cache-related issues.
So what can you do? What can you as a software engineer do to mitigate and minimize the likelihood of such bugs ever occurring?
Well, for starters. Just be aware. Know that when there is a bug that seems to be sporadic in nature, one of the first questions to ask is - "what's being cached and where?"
Second, have proper logging and debug statements - especially around places where you explicitly know there is caching involved.
Third, be very careful with components that share state. Cover them with enough tests to ensure that success, failure, and edge cases are accounted for, understood, and well tested.
Also, learn how caching works and the various caching strategies. I wrote a deep dive exploring this very topic where I talk about various caching strategies, eviction patterns, cache invalidation and other things you need to know when making use of a cache. Check out that article here.
Lastly, as a side note, there is a reason why the Singleton pattern has fallen out of favour in recent years, and this is why. Use it judiciously or maybe don't use it at all. Be careful about making things "static" (Java) or otherwise shared between different components. That obviously doesn't mean that you cannot share state and data. It just means that you have to be aware of where it is happening and put the appropriate guardrails and observability in place.
No wonder caching is the second hardest problem in computer science (right after the naming of variables).
It ain't easy.
Enjoyed this article and looking to level up further in your career as software engineer or software/solutions architect?
I wrote two guides that can help with that:
🚀 If you want to become a senior software engineer, check out "Unlocking Your Software Engineering Career"
🚀 If you want to either become a software/solutions architect or get better in that role, check out "Unlocking the Career of Software Architect"
Comments