Book Review: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

Summary

The Phoenix Project is the top-rated book for Devops.
It is a novel, with a setting in IT, that promotes DevOps practices (Lean Manufacturing, the Theory of Constraints , and the Three ways) in businesses.

The main story is about Bill Palmer an IT operations middle manager who is promoted to VP of IT operations in the company he works for. His new role involves supervising around 150 people. Bill’s major assignment is to ensure the deployment into production of a 20 million dollar project called Phoenix. The company’s survival depends on launching Phoenix, but the project is already behind schedule and over budget.

Despite Bill’s warnings, pressures form higher management force the deployment of Phoenix. The launch was a disaster and not only Phoenix was compromised, but the software in all the Point of Sales of the company. This major disaster, makes Bill reflect and ask the advice of Erik (a new board member who will become a mentor to him).

Perhaps the greatest wisdom that Erik provided was that IT is not different from any manufacturing plant and hence is subject to the same laws and principles. This comes as a shocker to several of the characters of the novel because to them, technology, being performed by highly skilled knowledge workers, is different from physical or assembly-like operations. To them, Technology is an endeavor of the mind, and hence it resembles artisan work. In this regard, Eric simply shows that the mess that the company is in, is the result of neglecting the wisdom of more than 40 years of scientifically-grounded management movements like the Theory of Constraints, Lean Production, or Toyota Production System and Total Quality Management.

Following Eric’s advice, Bill changes the way IT operations work, including the relationship of Operations with the Developer team. This allowed these two teams to quickly deploy a new project (Project Unicorn) that was very successful and allowed the company to return to profitability. The success made Bill promoted to CIO of the Company and begun a grooming process to be the next COO of the company.

Main takeaways

Several breakthroughs allowed the IT Operations department to change. I would like to focus on the first two. First, the notion that IT should be treated like a factory. To increment the throughput of the IT department the first problem you should solve is WIP. Having a lot of unfinished projects at the same time generates that nothing is delivered on time. In the end, the work that gets done is the work of whoever can shout the loudest and arm-wrestle the IT workers. To come out of this situation, the manager needs to first stop taking any more projects (to the possible extent), protects his workers from any outside interruption (specially protect the constraint or bottleneck resource), cancel WIP that does not add value to the company, and start adding project according to the rhythm of the bottleneck.

Second, a bad scenario for an IT department is to live in constant incident/firefighting mode, because that is not “work” at all and does not add value to the company. To escape this situation the author suggests starting by tracking/monitoring all the changes done in production. IT operations people, for a given week, should list all the changes they need to do in the system (for example, opening a firewall, modifying the schema of a database, etc.). Then these changes must be classified into ‘high-risk’, ‘middle-risk’, and ‘low-risk’.

High-risks are changes to systems that are very fragile, the example that the book gave was a third party software of a vendor that does not exist anymore so no possibility of support. Modifying this software is prone to errors, and if it’s down it is very complicated to put up again. High-risk changes must be scheduled by higher IT management, everyone involved should be communicated of the change and receive their approval (ideally success rate and expected downtime should be provided as well), and there should be contingency plans in case something goes wrong. “You know, like having firefighters and ambulances lined up in the runway, ready to spray safety foam when the airplane lands in flames”.

Low-risk changes are changes that people do all the time, and that has been done many times successfully. These changes do not need approval, do not need scheduling but still need to be submitted. For middle-risk, the idea is that the submitter had the responsibility and accountability to consult and get approval from people potentially affected. Afterwards, the request is analyzed and scheduled by higher IT management. Unauthorized changes are not permitted neither undisclosed changes during an outage. On top of this, the author suggests practicing incident calls every two weeks. For each exercise, the timetable of the schedule changes must be present.

Conclusion

Finally, I would first recommend this book to technology managers. The wisdom in this book helps them to handle the complexity of managing a technology team. The second group that I would recommend this book are both developers and operations people (System Administrators, Network Engineers, Database Administrators, etc.). I think all people in development and operations can relate in some sort to any of the characters and the situations in the novel. Reading this book would allow you to better understand your managers’ struggles while also provide you with plenty of ideas to implement in your company, something that will groom you for leadership later on.

PS: Here are my major extracts and lessons from the book:

Throughout the book, the lessons that I learned (from either of the characters) are the following:

What is expected of the Infrastructure / Operations department: It is expected to keep the lights on. Infrastructure should be like the toilet, you use it and you don’t ever worry about it not working.
It is very dangerous when a developer rushes code into production just before going into vacations.
There should be a system in which people communicate the changes they are going to perform in production. It would be even better if the system allows authorization from senior engineers and scheduling of changes.
Bad idea to make infrastructure rush at the end of a project
A non-priority project rushed into production generated the bug that caused a lot of trouble to the company.
A company should not depend on a single guy who is the one that knows and understands the whole system.
There should be a system to track the projects a team is working on. It should not happen that no one knows how many projects (WIP) a team is working on at the same time. Hiring smart people and tasked them with areas of responsibility is not a good enough solution.
In the company, 75% of the staff time went into solving incidents (firefighting). This was caused by lots of WIP inside the system.
Part of that high wait time and high WIP was due to different executives going directly to their favorite IT person asking a favor or pressuring to get something done.
Work In Process (WIP) is the silent killer. Accumulating WIP means that one would be accumulating inventory (work to do) in the bottleneck resource. This leads to not delivering on time.
As a manager, you should not simply accept new work without taking into account the availability of all the work centers.
20% of the changes to production generate 80% of the risk.
How to handle a severity 1 outage? No one touches anything, the tracking of the changes to production is reviewed. Discuss the actions before implementing them. (The last thing you want to do is to make things worse and complicate establishing the root cause)
The guy that had all the information of the company had two problems: One, he worked on everything and pleased everyone (being their personal Geek Squad) at the expense of the most important project of the company. Two, probably he viewed his knowledge as a sort of power, it put him in a position where he’s virtually impossible to replace.
Infrastructure / Operation works like a factory.
Be very aware when higher management forces you to pass the point of no return in a migration.
There are 4 types of work:
One) Business projects
Two) Internal IT projects
Three) Changes (to production)
Four) Firefighting/ Unplanned Word/ Anti work.
Unplanned work is the most destructive type of work. It is not really “work” at all. The others are what you planned on doing, unplanned work is what prevents you from doing it. It always takes time from your goals. That is why it’s so important to know where your unplanned work is coming from. Unplanned work kills your ability to do planned work. So you must always do whatever it takes to eradicate it.
Any improvement not done in the constraint is just an illusion.
To improve output you need to: One) Identify the constraint, two) Exploit the constraint [make sure that the constraint is not allowed to waste any time. Ever. It should never be waiting on any other resources for anything, and it should always be working on the highest priority commitment the IT operations organization has made to the enterprise. Three) subordinate the constraint, set the tempo of the flow according to the constraint.
Remember it goes beyond reducing WIP. Being able to take needless work out of the system is more important than being able to put more work into the system. To do that you need to know what matters to the achievement of the business objectives. Whether it’s projects, operations, strategy, compliance with laws and regulations, security, or whatever.
Technical Debt. It comes from taking shortcuts, which may make sense in the short-term. But like financial debt, the compounding interest costs grow over time. If an organization doesn’t pay does its technical debt, every calorie in the organization can be spent just paying interest, in the form of unplanned work.
Unplanned work is not free, quite the opposite. It’s very expensive because unplanned work comes at the expense of … Planned work.o
Unplanned work has another side effect. When you spend all your time firefighting there’s little time or energy left for planning. When all you do is react, there’s not enough time to do the hard mental work of figuring out whether you can accept new work. So more projects are crammed onto the plate, with fewer cycles available to each one, which means more bad multitasking, more escalations from poor code which mean more shortcuts.
Every work center is made up of four things: the machine, the man, the method, and the measure. We should standardize the constraint’s work so that other people can execute it. Get those steps documented to enforce some level of consistency and quality, as well. This is standardization is the “bill of materials” for the work in IT Operations.
Improving daily work is even more important than doing daily work.
Mike Rother: it almost doesn’t matter what you improve, as long as you’re improving something. Because if you are not improving, entropy guarantees that you are actually getting worse.
The wait time for a given resource is the percentage that the resource is busy divided by the percentage that resource is idle. So if a resource is fifty percent utilized, the wait time 50/50 or 1 unit. If the resource is ninety percent utilized, the wait time is 90/10 or nine times longer. And if the resource is ninety-nine percent utilized?
Do not be like John (the security manager) which loads people with work that does not have any value to the company.
If it’s not on the kanban board it won’t get done. And more importantly, if it is on the kanban board, it will get done quickly. You’d be amazed at how fast work is getting completed because we’re limiting the work in process. Based on our experiment so far, I think we’re going to be able to predict lead times for work and get faster throughput.
Everyone needs idle time or slack time. If no one has slack time, WIP gets stuck in the system. Or more specifically stuck in queues, just waiting.
The first way: you must gain a true understanding of the business system that IT operates. You must leave the realm of IT to discover where the business relies on IT to achieve its goals.
The flow of work should go in one direction only: forward. Create a system of work in it that does that. Remember, the goal is single-piece flow.
On the manufacturing floor, whenever we see work go backward, that’s rework. When that happens, you can bet that the amount of documentation and information flow is going to be pretty poor, which means nothing is reproducible and that it’s going to get worse over time as we try to go faster. They call this ‘non-value-add’ activity or ‘waste’.
Until code is in production, no value is actually being generated because it’s merely WIP stuck in the system.
In ten years, I’m certain every COO worth their salt will have come from IT. Any COO who doesn’t intimately understand the IT systems that actually run their business is just an empty suit, relying on someone else to do their job.
A dysfunctional marriage assumes that the business and it are two separate entities. IT should either be embedded in business operations or into the business. Voila! There you go, No tension. No marriage, and maybe no IT department either.