Book Review: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (2017) by Martin Kleppmann

Camilo Matajira Avatar

Author’s purpose

The author’s purpose with this book is to “… explore how to create applications and systems that are reliable, scalable and maintainable” (p.673). All these, specifically in the “diverse and changing landscape of technologies for processing and storing data” (p.7).

Concerning the main audience: For the author, the book is especially relevant for software engineers that need to decide on the architecture of the systems they are working on (p.8). And concerning methodology: The book tries to summarize the most important ideas from many different sources and provide references to all the relevant literature (p.11).

Main themes of the book

As stated before, the main themes of the book are how to create applications and systems that are reliable, scalable, and maintainable (p.673), this is the all-encompassing theme of the book. At a lower level I identified the following sub-themes:

  1. For any given problem, there are several solutions, all of which have different pros, cons, and trade-offs. (p.674)
  2. There is no such thing as a total fault-tolerant system. You can anticipate certain faults, and create a system robust to those faults, but it is impossible to create a system that is robust to all possible faults. As Kleppmann said: “If the entire planet Earth ( and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space –good luck getting that budget item approved.”(p.23)
  3. You don’t have to adopt all the fads. For example, a database does not need to be distributed; and vertical scaling is not necessarily bad: As with point 1, there are always trade-offs. Kleppmann pointed out: “A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.” (p. 38)
  4. It is better to have raw data easily accessible early than organized but late. According to Kleppmann “In practice, it appears that simply making data available quickly –even if it is in a quirky, difficult-to-use, raw format– is often more valuable than trying to decide on the ideal data model upfront.” (p.573)

Uniqueness of the book

What make the book unique, was to be able to condense in a single book the main ideas concerning data systems: This book is a textbook concerning data. On top of that, the author was able to cover all these topics while keeping a highly rigorous style: The book is supported by more than 800 references.

Perhaps one of the things that I found most valuable is that the author is willing to discuss the current technologies that are trending. Several authors avoid talking about specific technologies because the landscape changes very fast, and hence their book will start to become outdated. However, it is complicated to follow that kind of books without falling asleep. Instead, Kleppmann discusses all the data buzzwords and what type of problem each tries to solve. He recognizes that some technologies (for example Hadoop) are beginning to fade away, but still mentioning this is still preferable to just pointing out principles without any reference to any actual technology.

Compare the book with the best in the field (software engineering)

Despite I am not very immersed in the data field, it is possible to draw a comparison between this book and Uncle Bob’s ‘Clean Architecture’ book. The comparison is because both are software engineering books, and both claim to be speaking about architecture.

For Bob Martin, one of the key architectural distinctions one must do is to differentiate policy from details. The policy is the most important part of your code, it is also known as the business rules or the entities. The business rules are procedures that make or save the company money. The rest are details and are secondary. For Bob Martin, Databases are a detail.

“So let me be clear: I am not talking about the data model. The structure you give to the data within your application is highly significant to the architecture of your system. But the database is not the data model. The database is a piece of software. The database is a utility that provides access to the data. From the architecture’s point of view, that utility is irrelevant because it’s a low-level detail– a mechanism. And a good architect does not allow low-level mechanisms to pollute the system architecture.” (Bob Martin, Clean Architecture, p278).

The question is if in some circumstances, the data architecture could be part of the policy, or if it should always be considered a detail. I would say that for Martin Kleppmann, on data-intensive applications, the data architecture makes part of the policy and business rules or at least it is very close: “We call an application data-intensive if data is its primary challenge– the quantity of data, the complexity of data, or the speed at which it is changing– as opposed to compute-intensive, where CPU cycles are the bottleneck” (p.7). On data-intensive applications the data-systems-design is close to the business rules, either because the business rules need the data, or because external requirements demand high availability of the service; in either case, the data-intensive applications are generating or saving money, and hence it could be part of the policy.

Recommendation

I recommend this book for Senior System Administrators, Database Administrators, and Software Engineers/ Data Engineers. I recommend the book for seniors because the book is most useful for people that will take architectural decisions. People with a more junior profile perhaps could find more useful a more practical book. However, a junior practitioner could profit from Part III ‘Derived Data’ in which Kleppmann discusses what are the tools to process data in Batch and Streaming. These chapters cover a great deal of the ‘buzzwords of big data.

For Senior System Administrators interested in data, this book will provide a picture of the data ecosystem and will help them understand the place that tools like Hadoop, Spark, Kafka, Elasticsearch, OLTP and OLAP databases, warehouses, and in which situations these tools are useful. Also for senior System Administrators could find useful the section on replication, partitioning, and network failures in distributed systems.

For Senior Database Administrators and Data engineers the book is a mandatory read. In my opinion, this book is close to what the “Handbook of Unix and Linux System Administration” is for Linux System Administrators. For Senior Software Developers that worked with data, this book could also be useful for them to design the architecture of their applications.

Tagged in :

Camilo Matajira Avatar