In the first article of this series, we've focused on what are architectural tradeoffs and why they are important.
In the second article, we've established why there has to be a conscious decision making process when it comes to your software architecture.
In this third piece, we will sum things up by talking about concrete architectural characteristics and why it is key to understand which ones of these characteristics your system needs to follow. To do that, we need to understand what architectural characteristics are and how choosing to focus on one characteristic can sometimes force us to trade-off another characteristic.
In other words, architectural characteristics are specific aspects of your system that have to do with its overall capabilities. Mark Richards, in his excellent lecture on identifying software architecture characteristics, also refers to these as architecture "-ilities". You will see why in a second.
Below are some of these characteristics. This is by no means an exhaustive list - only some of the ones that you will most commonly encounter while designing and building applications and the ones that I have most commonly seen across both my full time career and consulting practice.
What I've attempted to do here is to identify the characteristic, indicate when it is most likely to be of importance, what are we trading off by focusing on it, and what are some of the ways to target that characteristic. Granted, all of this is very high level and there is vastly more more to be said on each of these particular points.
Architectural Characteristics
Characteristic: Auditability
What is it? Your system does stuff. Events flow through. Users initiate actions and cause things to happen. In many cases and for many reasons, you'd want to have a clear record of these things happening. Namely, you'd want to record that a particular action was initiated by the user at a particular time and what were the specifics around that action.
When is it important?
Regulated industries such as Finance
For compliance with internal and external standards (ISO, PCI, SOC, SOX)
For security reasons so that you can identify and investigate unauthorized activity
What are We Trading Off
Complexity of the system grows
Testability is more complicated as you would want to test that audit events are properly recorded throughout all of your system(s)
Some ways to implement
Log audit events separately from your application logs. This can be done either into a database, a separate log stream, third party log aggregation system, or another type of permanent cloud storage (such as S3 in AWS, for example).
Have clear policies for data retention and eviction (financial data typically needs to be retained for 7 years).
Be mindful of cost - most of this data can be moved into a less-frequently used (and therefore less costly) storage tier (cold storage) as that data will likely not be needed and its retrieval time is typically not expected to be instantaneous.
Characteristic: Observability
What is it? There is often confusion around the difference between Monitoring and Observability. Indeed, there is a lot of overlap between the two concepts and they are often interchangeably used. One common way to distinguish between the two is that
Observability refers to the capability of identifying potential problems within the system before they happen. Monitoring is the ability to see what goes on within the system(s) through logs, metrics, traces, alerts, and dashboards. Monitoring is one component that enables Observability.
For the purpose of this article, I am referring to both monitoring and observability under the Observability umbrella.
When is it important?
Any practical system needs a level of observability. The more complex the system is, and the more it is being used, the more complex and the observability mechanism should be.
What are We Trading Off
Observability increases the complexity of a system.
Depending on the level of observability and the mechanisms deployed, it can also affect application performance.
Cost - whether to implement third party/off the shelf solutions or to build in-house
Scalability - the observability system needs to scale with the business applications. This is a given, but often missed item - especially with observability systems deployed in-house/on premise.
Some ways to implement
Using cloud-native or third party APM (Application Performance Monitoring) and Observability tools from vendors such as Datadog, New Relic, Dynatrace, Honeycomb, etc
Ensuring that your application is logging important information
For distributed systems (like microservices) - using traces to maintain line of sight for requests that span multiple systems
Using AIOps tools (generally available as part of above mentioned third party solutions) to discover and predict application behavior
Using open standards such as OpenTelemetry
Ensuring that alerts, dashboards, reports are configured based on exposed metrics, logs, and traces. In other words, ensure there are meaningful artifacts that can be monitored and actioned on in real-time.
Characteristic: Scalability/Elasticity
What is it?
The ability of the application to support increasing traffic/larger number of requests or users. How well does your system adjust from serving 10 transactions per second (TPS) to serving 1000 TPS? What happens when the system grows from a user base of 10,000 users to 1 million users?
Scalability is used interchangeably with the concept of Elasticity in some cases while in other cases, the two denote slightly different aspects of the system. To be exact, Scalability refers to a general ability of the system as a whole to accommodate ever-growing loads over a long span of time. Elasticity refers to the ability of the system to accommodate fluctuating load/traffic during a specific timeframe.
To simplify things, I am combining the two concepts here.
When is it important?
If your application is expected to grow as time passes, you need to be thinking about scalability. This applies to most systems that are deployed in a production environment to an external user base.
The only time that you most likely don't need to worry about scalability is if you know that system usage will be capped to a limit (for example - an internal company application). This includes production applications where the user base is known to be very small and where the requirements of the application do not warrant its growth.
What are We Trading Off
Complexity of the system grows
Deployability becomes more difficult. It's much easier to deal with deploying one container then 1000 - even if you are using a container orchestration framework.
The more scalable the system is, the harder it is to maintain Observability because there are more components to observe, track, and investigate.
Testability might be more challenging and testing use cases grow to cover the distributed nature of the application.
Some ways to implement
Using container orchestration systems such as Kubernetes to dynamically adjust the level of compute needed (number of containers)
Leveraging serverless services in Cloud environments where scaling, for the most part, is the responsibility of the Cloud itself
Developing your application in such a way that it lends itself well for scaling out. For "single-threaded" runtimes such as Node, that means not blocking the event loop for longer than necessary. For multi-threaded applications, this means leveraging concurrency in an optimal way (avoiding deadlocks, using non-blocking locking, etc)
Developing and deploying using the stateless paradigm as much as possible where maintaining and passing state is only done when necessary.
Identifying bottlenecks by running load, stress, and traffic tests frequently.
Favouring scaling-out to scaling-up
Characteristic: Responsiveness
What is it?
The responsiveness of an application is its ability to provide some kind of a response in a reasonable timeframe to the caller - even if the application is under significant load. In other words, the application is able to handle and act upon user actions without the user experiencing delays. This goes for both frontend and backend applications.
When is it important?
When the application is user facing - ie there is an actual user waiting for something to happen.
Whenever your use case does not have much tolerance for the application being non responsive. For example, someone using a bank's website to apply for a loan, will probably be more tolerant of delays (due to lack of choice) than someone using a service to get real-time stock quotes for day trading.
What are We Trading Off
Often, responsiveness is directly related to scalability. The more scalable the system is, the more likely it is to be responsive. However, the application also has to be architected in a way to ensure responsiveness.
Complexity of your application's internal design grows.
Testability becomes more complex to cover potential use cases that can limit responsiveness.
In some cases, we will need to trade off responsiveness for correctness. For example, consider the case of an attempt to get data from service A and service B, but service B is not reachable. You may want to return data from service A and not wait indefinitely for service B as that will compromise the responsiveness of the system. What happens is that you return data quickly (from service A) but it isn't the entirety of a response because service B data wasn't present.
Some ways to implement
Graceful degradation
Circuit breaking
Asynchronous processing
Characteristic: Fault Tolerance
What is it?
The ability of a system to continue functioning when one or more of its components fail.
Fault Tolerance is also related to application responsiveness. The difference is that responsiveness is more about ensuring end-user experience whether fault tolerance focuses on the system being able to continue operating and producing results when only a subset of its components are working.
When is it important?
Mission critical systems.
What are We Trading Off
Much like with responsiveness, we may be trading off correctness of system output for its overall ability to operate.
Some ways to implement
Graceful degradation
Standby replicas (for both compute and storage)
Active/Active or Active/Passive deployment
Architecting the system in such a way that disruption to its key components if minimized
Monitoring for faults
Using BASE transactions instead of ACID
Characteristic: Extensibility
What is it?
The ability to easily add new functionality and new integrations to your system.
When is it important?
When we know the application might grow and pivot in unexpected ways.
When there is a requirement for adding new components
What are We Trading Off
When extensibility is a requirement we're going after, there needs to be significant planning invested ahead of time into architecting the system in such a way that it lends itself more easily for being extended.
Some ways to implement
Modular software - creating decoupled, highly-cohesive, encapsulated services that aren't dependent on each other (both in terms of code and architecture)
Using open standards and best practices. For example, leveraging REST or a popular message queue framework to connect microservices.
Using clear API contracts
Decoupling data stores/databases where there are different sta stores for different domain data
Characteristic: Testability
What is it?
The ability of the system to lend itself easily to being both manually and automatically tested in its entirety as well as by components.
When is it important?
When we want high confidence that testing exercises are covering as many of the application's flows as possible
What are We Trading Off
System complexity grows. Often, especially with distributed systems and asynchronous flows, it is difficult if not impossible.
The system needs to be designed and architected in such a way as to allow components to be tested in isolation (this goes both for code and infrastructure).
Some ways to implement
Testable (ie modular) code (code that lends itself to unit testing, integration testing)
Modularity where application functionality can be mocked
Ensuring as much of the testing can be automated
Characteristic: Performance
What is it?
Performance is closely related to scalability and elasticity and there is much overlap between these characteristics. The difference is that, typically, scalability and elasticity refer to the system's capacity to successfully adjust to growing loads. Performance, on the other hand, talks about the ability of the system to process all of that load under reasonable timeframes.
When is it important?
Most user facing systems need to be performant enough to give a positive user experience. What "performant enough" means really depends on the context of the application. For example, within a context of most websites, more than a few seconds is typically considered bad user experience (though it varies depending on what that website is doing)
What are We Trading Off
Ensuring performance requires that the system is architected in such a way as to anticipate (and remove) bottlenecks. It also requires careful optimization where it matters. This means dedicating enough effort for designing the system in this way, ongoing monitoring, and identifying problem areas.
Some ways to implement
Non-blocking I/O
Asynchronous programming and architecture (messaging, event-streaming)
Decoupling systems
Optimizing long running processes and CPU intensive operations
Optimizing application code
Choosing technologies with low network overhead (ie those that communicate over protocols lower down the stack then HTTP)
Leveraging cloud-native services with known performance guarantees (as long as these services are used as prescribed)
Previous: Software Architecture Tradeoffs Series - Part 2 - The Problem with Making Unconscious Decisions
Comments