Designing Data-Intensive Apps Notes

vignesh dharuman
3 min readFeb 7, 2024

Chapter — 1 : Reliable, Scalable and Maintainable Applications

I am reading the Designing Data-Intensive Application by Martin Kleppmann . This books has a lot of useful information that will help every software development team to make better and well-understood trade-off during our system architecting journey. In this series of articles i will try publishing the key takeaways[caution!! limited to my knowledge :) ] from each chapter.

[Please do read the complete book as it has lot of useful information put clearly in easily understandable terms. Use this article as a refresher.]

We tag an application as data-intensive application, if dealing with the data of the application is the primary challenge. By dealing with data we mean the quantity of data, the complexity of data or the growing speed of the data. When building such applications, the most important aspects we need to consider are its reliability, scalability and maintainability.

Reliability : Reliability means the systems ability to work correctly, even when things go wrong. Things going wrong can be from hardware failures(which we could handle with redundancy), software failures(which could be handled by proper testing and monitoring).

Scalability : Scalability means the system’s ability to cope with increased load. In order to understand the issues that might arise from scalability, first we need to identify the key load parameters in the system. For example, it might be request per second for web applications, ration of read to writes for databases, no. of simultaneous users incase of chat applications, hit rate on a cache, etc.
Performance: Once the load parameters are identified, we then need to investigate what happens when to the system components when these parameters increases and how much resources of each components we will need to increase these load parameters to required levels. The metrics to evaluate the performance of a system will vary based on the nature of the system,
i) for a batch processing system, we will care about the throughput, which determines the total time the system will take.
ii) for online systems, its the response time. Response time cannot be treated as a single number as we will get different response time each time we process the request. Hence it is better to use percentile mechanism when measuring the response time. In this we sort the different response times from faster to slower, the median is the halfway point and is called the 50th percentile. We can consider the number of instances after say 95th, 99th, 99.9th percentiles to get a good picture of the system’s performance.

Percentiles are often used in Service Layer Objectives and Service Layer Agreements(sla’s) which are contracts that define the expected performance and availability of the service. For example, a SLA agreement might state that the service is considered to be up if it has a median response time of less than 200ms and a 99th percentile under 1s(which mean 99% of the requests out the measured requests should be served within 1s) and the service should available at-least 99.9% of time(which mean the service can have a maximum downtime of only approximately 8 hours per year).

Once we understand the impact of load on the required performance, we need to find ways of coping up with the load. The common options are to scale-up/vertical scaling(where we upgrade our existing components to more powerful machines) and scale-out/horizontal scaling(where we add more instances and distribute the load across). Distributing stateless services across multiple machines is fairly straightforward, but stateful data system in a distributed setup adds complexity. Hence is recommended to keep stateful systems as a single node until scaling the system is costing high or high availability is required.

Maintainability : For better maintainability, we need to consider three design principles,
* operability :- make it easy for the operations team to keep the system running.
* simplicity :- make it easy to understand the system, by removing as much complexity as possible from the system using abstractions. Also not including any non-required complexities.
* evolvability :- make it easy to introduce changes to the system in future.

Thanks for you time is reading through, i hope it was useful. That is it for this article, see you at the next chapter.

Happy Learning and Coding.

--

--