New and mature teams alike need to deploy applications to production. You would think that doing so is fairly straightforward - at least within the realm of large, well- established, organizations.
It's reasonable to assume that such organizations and the teams that work within them have a set of guidelines, best practices, and standards that they can reference when building and deploying an application.
In reality, the process of building applications, especially large-scale, "enterprise-level", distributed applications is a road full of discovery, challenges, surprises, and hiccups.
As new projects come up, they often go through the same steps and challenges that previous projects had to go through.
This, of course, is redundant, error-prone, and inefficient.
Moreover, this lack of a unified standard for building production-grade systems introduces significant risks into these applications. These aren't trivial risks either. Imagine that your system doesn't have proper security measures or an audit-log in place. That means that the application is at a high risk of being compromised, and if that happens, it will be extremely difficult to know what had actually happened and how.
This, of course, results in erosion of customer trust, revenue loss, legal ramifications, and a plethora of other consequences - all detrimental to the company's success.
Building such systems over the past twenty years, I've developed a set of such guidelines and standards. These guidelines harden an application and help make it "production-ready". I now teach and help organizations implement these guidelines as part of my consulting practice.
In this article, I want to share some of these crucial guidelines in the hope that they will help your organization set a new or existing application up for success.
For each such guideline, I will also call out which architectural characteristic is being represented by that guideline. For more on architectural characteristics, visit my Software Architecture Trade-Offs series.
It's worth noting that although most of these guidelines can be applied to a system of any size, complexity, and degree of centralization - they are really geared towards large distributed applications within an enterprise.
Please note - this is far from being an exhaustive list. However it does constitute some of the topmost items that I consistently find organizations miss when building such applications.
These items are a requirement for most distributed, enterprise-grade, applications - especially those that are serving external clients.
Let's dive in.
Auditing / Audit-Log
Architectural Characteristic: Auditability
What's an Audit-Log and Why Do We Need One?
Your application, I would assume, will get a lot of traffic. There will be things happening. Users invoking actions, performing operations, and in general, triggering things to happen.
Imagine that you've built an order management system, and a user placed an order and paid for that order. But, the order never got through. The user never received whatever it is they paid for. Now they are contacting your customer support for what happened.
You need to figure that out. How do you do that?
Well, you need to know a few things. You need to know what happened. When it happened. Who it happened to. Also - why it happened. That's at the very least of what you need to know.
There needs to be visibility into the original user action and the metadata related to it.
You might say that the application is logging all of that data. But, that's not enough. First, you need to be sure that this data, and I mean - all of it - is really logged everywhere where appropriate.
Second, you need to know where and how to look for it.
Third - you need an ability to pull it out quickly, efficiently, and hopefully also have some automated process to do it for you.
That's where an audit-log comes along.
An Audit-Log is a construct that records all user-triggered actions with enough metadata for you to know exactly what has happened. The audit mechanism can be different things and the audit-log itself - i.e. the log records - can go to a variety of places - logs, database, external system, and so forth.
We'll explore the details around this next.
What do We Put In An Audit Record?
The information in an audit record will of course depend on the business context of your application to an extent. However, at the very least, you will record the following things:
Timestamp when the user action took place
System Timestamp of when the record is generated
The action/operation that has happened
The user who triggered this action
Success or Failure of the action (if available)
Any metadata or contextual information
So on a code level, this would look along the lines of the following (in JSON format):
{
"tsOccured":"1717359421"
"tsCreated": "1717359425",
"Action": "ORDER_SUBMIT",
"userId": "tPV1wI5JoLzPqJGLOxCiE",
"result": "SUCCESS"
"metadata": {
"Param1": "abc"
}
}
Now, this of course is represented as JSON but you might have this as plain text in the logs, or in another format. It doesn't matter. The idea is the same.
The most crucial aspect of the audit log is that its records need to be identified with ease. Hence, if you are putting log records as part of your regular application log, you need to make sure that these logs can be easily identified and separated from other operational logs.
One way of doing so is by creating a special log prefix for audit records such as "__AUDITRECORD__" so they can be easily found.
You may also consider sending these records to a different log than your operational log.
What Is The Audit-Log Used For
Records of user activity can and are used for the following use cases:
Security audits
Compliance standards such as PCI-DSS
Client dispute resolution
Troubleshooting of a production application
Data analytics
Knowing how to get an application ready for production is a key requirement from the role of the software architect.
So is the understanding of architectural characteristics and trade-offs. I dive into all of these in my guide - Unlocking the Career of Software Architect
Traceability
Architectural Characteristic: Observability
The idea of tracing is not anything new. What this concept means is that you can "trace" a request as it moves between different services. This is especially important in a micro service environment where a request can move between different internal services before it is fulfilled.
Traces are also part of the Observability paradigm (along with logs and metrics). We will discuss Observability in another instalment.
In a synchronous API call, you would usually pass this ID as a custom HTTP header that would be part of your request metadata.
It's important to note that the concept of Traces or Tracing is not limited to synchronous requests though. Asynchronous requests also need to be traced.
The way that tracing is commonly done is by inserting a persistent unique identifier when the first request is created. The request is then sent where it needs to go and that identifier is included in its payload. If the receiving service needs to propagate that request further, the next request will include the same identifier.
These identifiers are typically called a "Request ID", "Tracing ID", or a "Correlation ID". As the flow is executed across multiple services, we can log each step in the way and use that correlation id as a way to pinpoint the entire lifecycle of that request.
Similarly, in an asynchronous application, the request ID might be inserted into the metadata of the message or event as it is published or sent to its consumer.
Note that correlation IDs can also be generated and inserted into the request context by different monitoring and observability tools. Hence, you do not always have to implement these as part of your application logic.
API Management
Architectural Characteristic: Security, Observability, Scalability, Resilience
If your application exposes any kind of an API, and most likely it does, you need mechanisms in place to manage and protect API traffic. You need things like authentication, payload validation, request management, monitoring, URL rewriting, and more. All of these things are required for all of your APIs and so it makes sense to handle them in a central manner - ie an API management solution.
Now, there is a lot more than what I mentioned above that is provided by API management products and services, but one item is important in particular - and that is request management.
Request management is the concept of limiting the inflow of API requests into your application. This is done to prevent your application from being overwhelmed and ensure that it is only tasked to process what it can handle.
This is done via a number of ways and there are a number of concepts and techniques important to understand when we are talking about request management.
First, we need to understand the concept of TPS or Transactions Per Second. This is a metric of how many transactions (requests) your application/API can process within a second. Note that we are using "transactions" as a synonym for "requests" here. This is different from transactions in the context of database transaction processing and the related concepts of ACID (atomicity, consistency, isolation, durability).
It's also important to note that TPS is often used as a proxy for concurrent requests. It's a very similar concept but the two concepts are not the same. Request concurrency refers to how many requests can be processed from beginning to end at the same time by a system. TPS refers to how many requests can be processed within a second.
This, for all intents and purposes is almost the same because a second is a small enough timeframe. The difference, however, is that you might have 100 requests being processed in a second but since each request can take, say 50ms, to process, not all of them may be processed concurrently within that second.
This, in many cases, is an immaterial difference, unless we are talking about a system that processes very large volumes of traffic within a sub-second timeframe.
We should be aware of this difference but for the purposes of our discussion here, we will only focus on TPS as the number of requests that can be processed within a second.
There are a number of mechanisms by which an API management solution manages requests.
Throttling
Throttling is often confused with Rate Limiting because the two concepts are used interchangeably. They overlap, and different products will document one, the other, or both - sometimes using them to describe the same thing.
The idea behind throttling is to limit the number of requests the system can receive within a timeframe on a -global- level.
For example, you might state that your API can support 1000 transactions per minute - globally across all clients.
Rate Limiting
The difference between rate limiting and throttling - typically- is that rate limiting policies are applied per user/client.
In other words, your API can support, for example, 20 requests per second for a particular client.
Throttling is sometimes described as being a type of Rate Limiting.
Load Shedding
Much like rate limiting and throttling, load shedding is a technique to ensure that a system does not become overwhelmed. What it refers to is the idea of dropping requests or "shedding" requests based on predefined rules.
So for example, you might want to drop requests after a certain TPS is reached for that API. Alternatively - requests can be dropped if some underlying resource metrics exceed a threshold - such as CPU or memory.
Retry Mechanisms
Architectural Characteristic: Reliability, Fault Tolerance, Resilience
Your production-grade system will most likely communicate with other systems. It will also make use of internal communication between its own components.
What happens when there is a failure in that communication? Imagine that your application provides a service to the user. The user first logs in and then is able to perform a bunch of actions - say on their use profile. The user makes a change that propagates from one service to another. But, there is a network error and the request never reaches that other service.
Now, in the case of a synchronous request, where there is an end-to-end feedback loop, you present the user with an error message. However, what about asynchronous requests where no such feedback loop exists?
In such a situation where there is a failure scenario, you have two options - error out or retry. Even if you retry, of course, you might still end up with an error eventually. That said, in the case of retries, you will at least have made a reasonable attempt at bringing the action to successful completion.
The point of a retry is that you are re-attempting to deliver a request or an action where it had failed the first time around. Now, a retry can be both asynchronous and synchronous. However, usually you would not be putting extensive retrying (if at all) on synchronous requests - especially those with direct user interaction.
This is because retrying makes such a user interaction longer, and so it is typically favorable to fail-fast and notify the user as opposed to trying again. That being said, you need to weigh the trade-off between the importance of request completion and user experience. This trade-off will depend on the context and is not always obvious.
Asynchronous requests are where you would have more extensive retrying because these typically happen in the background and do not involve direct user interaction.
At its core, the concept of "retries" is simple to understand - ie - repeat an action for a number of times until it is either successful or fails after a pre-set number of attempts.
In practice, however, there is a lot of nuance and things you must consider when implementing retries.
These are the concepts to be aware of:
Exponential Back-Off
The idea behind retries, like we said, is to reattempt delivery of a message, query, or command. The simplest way to do that is to have a configurable time frame after which the message will be redelivered. For example, every 30 seconds. Now, of course you would put some reasonable limit on that so that you don't retry indefinitely and flood your system with requests that potentially may never complete.
Let's say that you've configured a maximum of 5 attempts spaced out at 30 seconds each.
The problem is that - because you are delivering these messages at the same time, there is a chance that the destination system becomes overwhelmed and fails due to that. This happens all the time with systems that process large volumes of information. The reason is that the designation system does not have a chance to recover since it continues to receive the same volume of requests time after time.
One way to mitigate that from happening is to introduce "exponential backoff", which works by spreading our retry requests over an increasing (exponentially) span of time. Here's an example of how it would work.
First retry - delivered after 30 seconds from the original failed request.
Second retry - delivered after 60 seconds from the first retry.
Third retry - delivered after 120 seconds from the second retry.
And so on…
The point being that if the volume of your requests overwhelmed the destination application, retrying that same volume again and again, wouldn't solve the problem. It will only exacerbate things.
Hence, we introduce a degree of variability so that requests are spread out over a longer time frame that gets wider over time.
Jitter
Sometimes, introducing exponential back-off is not enough and you need true randomness. That's where the concept of Jitter comes along.
When introducing jitter to a retry mechanism, we are introducing a level of randomness and unpredictability. This helps spread out requests further over time and increases the likelihood that the destination will have enough capacity to process these requests.
For example, the first retry might happen after 5 seconds. The second retry after 17 seconds. The third - after 3 seconds, and so on.
Dead Letter Queue (DLQ)
DLQs are a key concept in messaging and event streaming. With retries, you would typically have an asynchronous mechanism to re-attempt the delivery of requests that failed the first time around.
These requests will usually be in the form of messages transmitted via a type of Service Bus, Message Broker, or Streaming platform.
But, what happens when all attempts to re-deliver these messages are exhausted? This is where a DLQ comes into play.
DLQ is where you would send messages that could not be delivered - even with a retry mechanism. These messages can then be inspected or acted on via both automated and manual means.
Back Pressure
Back pressure is not a technique for retries but is rather a mechanism for request management that will often cause retries to happen.
The idea behind it is that a consumer application cannot keep up with the influx of data from a producer. So back pressure is used by the downstream consumer to communicate in some way to the upstream producer to slow down.
Hence, back pressure is a flow control mechanism used in distributed systems to ensure stability and to prevent overloading components. This is done by managing the rate at which data is produced and consumed.
There are multiple ways of achieving this. One way is via bidirectional or duplex protocols such as HTTP2 or WebSockets. Another is by including an indication to slow down within the response headers of an HTTP response.
The key concept here is that there is a feedback loop where the producer in effect becomes "aware" that it needs to slow down with sending data to that particular consumer.
Now, the relationship between back-pressure and retries is that when the consumer application communicates back to the producer to slow down, the producer may need to retry requests that were already in-flight and rejected.
Circuit Breaker
When you send requests to a system and they constantly fail (with say an HTTP 504 - Bad Gateway), often it does not make sense to continue sending them thus overwhelming the target application even more. This is because we don't really know when the target system will be up again.
The circuit breaker is a mechanism that "keeps tabs" on the target system by limiting the calls made to it. While the target system is down, the circuit breaker has the ability to allow only a fraction of requests to go through in order to gauge whether the system is healthy.
It can also "ping" the target system on a timed basis regardless of whether there are requests coming in.
The concept of circuit breaking is tied to the concept of retries because you may want to queue the requests that were throttled and not allowed to proceed. These requests can then be re-attempted later when the destination system becomes healthy again and is capable of processing these requests.
You will not always want to do that as is the case with synchronous flow. However, with asynchronous flows, using a circuit breaker and retrying is a fairly common practice.
Disaster Recovery (DR) and High Availability (HA)
Architectural Characteristic: Availability, Resilience, Fault Tolerance
The concept of a disaster recovery plan incorporates many different aspects of deploying and running an application in production. It is one part of a larger Business Continuity Planning (BCP) exercise.
Notably, not all systems need an explicit DR plan. That said, most applications that run in "production" and service users do need some degree of DR planning. At the very least, you need to understand what happens if the system goes down and is not able to service users.
Now, DR does deserve its own article, if not a book. However, we will go through some of the key items that you will most likely need to keep in mind.
Database Replication, Backups, and Snapshots
Your application is most likely only as good as the data it caters. That's why one of the most important items to safeguards above all else within your application is your data.
Just like with your personal data, your applications' data needs to be backed up. The simplest form of backup is a point-in-time backup or a snapshot of your data at a particular time.
Most, if not all, databases provide a mechanism to perform this kind of a backup, and many provide further - capabilities of doing so (such as incremental backups).
Whatever the case may be, you need to ensure that data is backed up on a regular basis and that the procedure for restoring it is performed and tested routinely.
There is also the concept of an RPO (Recovery Point Objective) that you would communicate to your clients. The RPO is the extent of data you can tolerate to lose before restoring operations. For example - an RPO of 15 minutes means that after the system goes down, it may lose up to 15 minutes of data being sent or processed by it. This is from the point of the system going down and up to it being restored to service.
The difference between database replication and backup is that replication is constantly happening by syncing data from your main database to a replica. On the other hand, backup is something that happens at specific times and so any data that changed in between backups can potentially be lost.
Infrastructure Redundancy
Many cloud services come with their own redundancy and DR capabilities built-in. In other words, things like message brokers, databases, distributed caches, identity management, and other cloud services have business continuity built into them by the cloud vendor.
So the cloud provider will, in part, ensure that these services are up and running even if some of the underlying hardware and virtualization environment running those services go down.
There are of course a number of caveats there because in some cases, these services may be operational only within a particular zone but not across regions. In other words, some services will continue operating if one datacenter within a cloud zone is down, but will not operate if the entire region goes down. Other services can continue operating if a whole region goes down, but these are typically costlier and require more thorough planning, configuration, and deployment structures.
Also, the cloud provider has its own SLAs (service level agreements) and SLOs (service level objectives) which are important to understand and be aware of when using cloud services.
The key point here is that cloud vendors do provide a level of redundancy and DR capability that makes it easier for your application when it comes to ensuring business continuity.
However, each cloud vendor and each service within the cloud vendor is different. Hence, you need to understand the extent to which these capabilities are provided. Also, for any private cloud and on-premise environments, you still need to tackle disaster recovery entirely on your own.
In addition, you also need an ability to recreate your infrastructure if need be in a new environment. This is where IaC (Infrastructure as Code) comes along. IaC is the ability to represent your system's footprint in the cloud as a set of scripts that when run - can create the entire environment from scratch. In non-cloud environments, this is typically done via custom scripts.
Active-Active vs Active-Passive HA
When preparing for a DR scenario, there are two main modes of running a system. The first, and more complex and expensive of the two, is Active-Active High Availability. That means that if the infrastructure where your system is deployed goes down unexpectedly, there is another full instance of it that can continue serving clients at full scale.
Typically, this is done as a multi-datacenter or a multi-region deployment. What it really means is that your entire infrastructure is mirrored. An Active-Active HA operation wouldn't usually be set up just for the sake of DR. It would be used in the course of normal operations.
So for example, if your system is deployed in the region of West US and serving customers from the West Coast, you might have it also deployed in East US and serving the East Coast in the course of BAU (business as usual).
Had something happened though in West US and the system went down, users from the West Coast would continue to receive uninterrupted (or almost uninterrupted) service by the mirror in the East Coast.
Running Active-Active HA for the entirety of your infrastructure is of course highly complex and costly. It is not meant to be the case for every system out there but is typically a consideration for very large scale applications that serve large volumes of traffic to users across geographies.
Active-Passive HA, on the other hand, is a mode of operation, where all or some of your infrastructure is deployed in another datacenter or region in "standy-by" mode. In other words, should something happen to the main (active) infrastructure, the secondary (passive) infrastructure would act as a fallback to serve client traffic.
Active-Passive HA is less complex and less costly than Active-Active HA but does require careful planning and diligent synchronization of the Active and the Passive deployments.
You also need to decide on the extent to which the Active deployment is mirrored within the Passive deployment. For instance, you may need to mirror not only the infrastructure itself, but also data. That means that you would need to consistently be replicating data from your Active storage to storage in your Passive deployment.
Summary
In this article, we went through some of the key aspects of getting a system ready for production. Much of this is applicable to enterprise systems that serve a large and public user base. However, even smaller systems or MVP applications can benefit from at least some of these guidelines.
Things like the ability to audit user requests, traceability, and API management are key for most systems out there. Disaster recovery, although mostly applicable to large scale systems, is still something you at least want to consider for smaller applications. It may not be to the same extent, but you should still have awareness of the implications when the application is down. Even if you don't need to act on it, it's useful to understand the risks involved.
It's important to note that the items we've discussed here are by no means an exhaustive list. There is a lot more that goes into making an enterprise application ready for production both from a technical and organizational standpoint.
Observability, production support processes, dispute resolution, automated QA, CI/CD failover - there is a lot that goes into making an application ready for production use. These are all topics for future articles. Not all of these apply to the same extent to startups or MVP applications. However, it helps to understand all of these so that your team knows what to focus on next, how to set up the application for success, and mitigate any risks.
Knowing how to guide teams in building robust, hardened, production-grade applications is an important capability that sets apart great software architects.
For a lot more details, deep-dives, and drill-down into the nuts & bolts of the software architect's role, check out my guide - Unlocking the Career of Software Architect.
Comments