Lean Web Operations — Planning for the Unpredictable (IV)

Published 2017-11-28 by Jochen Lillich

Our business is managed web hosting, so keeping our web operations work in flow is essential. In this article series, I’m sharing what we’ve learned in that regard.

Part 1: How we got ourselves into trouble
Part 2: How we learned to keep work inflow
Part 3: How we learned to make quick decisions
Part 4: How we learned to manage interruptions
Part 5: How we’re doing today

How we learned to manage interruptions

Over the previous parts of this series, I’ve explained how many of our efforts didn’t lead to results because we lacked focus and direction. There’s a well-known time management method called the Eisenhower Matrix, a simple way of deciding on which tasks you should focus. This matrix has four quadrants between the axes of “Importance” and “Urgency”. Here’s a short introduction:

The tasks in which you should invest a minimum of our attention (ideally none at all) are the ones in the quadrant “not important, not urgent” — stuff like “trying out that new text editor”. The most noise comes from the “urgent but not important” quadrant. Many phone calls and Slack conversations fall into this category; you feel like you can’t afford to ignore them, even though you have priority projects where your efforts would create so much more value.

That leaves the two other quadrants high on the “importance” axis. And that’s where our main problem was. The fact that there are not one but two quadrants of important work created a daily dilemma for everyone on our team.

Nobody likes to be rushed, and planning things prevents them from becoming urgent. So we consciously put as many of our projects as possible into the “important but not urgent” quadrant. Regardless, from the depths of the “important and urgent” quadrant kept emerging what we call “unplanned work”:

*applying security updates
*managing incidents
*resolving support requests
*provisioning customer orders

Since quality is one of our core business values, there can be no doubt that these tasks need to get done quickly. However, this unplanned work always competes for our time against the projects we need to get done. In practice, we found it impossibly hard to find a balance here; everyone’s scales tended to tip in favour of the unplanned.

“By sharing the pain, we were multiplying the pain.”

We thought that by “load-balancing” these interruptions between team members, their impact on each would be minimal. We made it a team effort to quickly respond to new support tickets, roll out fresh security updates etc. I realised too late that in reality, this approach was forcing everyone into constant firefighting mode. In other words, by sharing the pain, we were multiplying the pain. That had a terrible effect on everyone’s productivity and job satisfaction.

It was this realisation that led me to make the changes I’m describing in this blog series. Here’s what I wrote in my journal in November 2016:

“It looks as if we have an accountability problem. I realised this when I went through the list of open ops projects. There are so many projects, many of them untouched for more than a year. I’m sure that this is a source of frustration and a feeling of being overwhelmed. This list keeps telling us “We don’t get anything finished around here.” That’s not the truth, because we do a lot of stuff outside of projects, which doesn’t help to improve our accountability score either.”

In my defence, I did try to hold people accountable. The problem was that every time we discussed a delayed project, I got presented with a litany of reasons why other work had once again sucked up all our capacity. Support requests were taking whole days to resolve, software updates didn’t go smoothly, and incidents ripped everyone out of their assigned work. There also were meetings, family emergencies, vacations and sick days.

These were all legitimate reasons that we wouldn’t be able to eliminate, so our first impulse was to at least keep them at a minimum. For example, we started to draw a more definite line between “support request” and “custom change request”. While our DevOps strategy allows for making changes to our managed hosting platform for individual customers, we can’t afford to handle them as part of our free tech support. The only place for them is in a project that gets ranked, planned and executed just like our own. And on top of that, it also needs to get billed. However, while that was proper business behaviour, it didn’t change our situation much; we only moved one type of interruption onto the pile of “projects that never get finished”.

My resignation grew, and I started to tell myself that in web operations, you’re simply doomed to work in this way until your company is big enough to have a separate IT projects team. If it doesn’t implode first.

What I didn’t see back then was that “big enough” actually starts at two people. Obviously, at that size, you won’t yet have a formal projects team. However, we’re not talking about organisational structure here but about focusing on what’s important, and the operations team at Spotify found out that even a single person can make a difference. They introduced a role called “sweeper” which only task it is to take care of all the daily interruptions and distractions, thus shielding the rest of the team so they can do planned work.

We started to experiment with this approach and adapted it to our situation. The “Libero”, as we named the role, rotates through our ops team in two-week shifts. Since the Libero is not supposed to take part in projects, everyone still gets to deal with urgent tasks, just not at the same time as planned work. Now it’s ten days of reactive work followed by weeks (!) of focused work.

The introduction of the Libero role changed the game for us. Suddenly, we have uninterrupted time to focus on on the things that we need to get done to be successful.

First, we overcompensated, expected the Libero to catch every bullet for the team. That quickly turned out to be unsustainable, and we established a few ground rules:

The Libero is still a member of the ops team. They’re encouraged to ask for support, for example, if a task requires specialist knowledge or when unplanned work starts piling up.
Not all unplanned work is urgent. Custom changes disguised as support requests, for example, get passed to the ops team and converted into projects.
The Libero needs regular breaks to recharge. The remaining team members support this by taking the baton temporarily, for example during the Libero’s lunch break.

When things go smoothly (which they increasingly do, I’m happy to add), there isn’t even enough unplanned work to occupy a person full-time. We took this as an opportunity to apply another one of our core values: freedom. We allow the Libero to spend slow seasons as he or she sees fit. There might be a conference recording they’ve wanted to watch, or a book they’d like to finish. We don’t mind either if they decide to do the laundry or watch Netflix.

After a few months, our perception of the role changed. The Libero role isn’t just the poor soul who has to serve as the ops team’s lightning rod. It’s the reason why we’re now able to keep outages and support response times short and still deliver product improvements. The Libero is the key to keeping ourselves sane and our customers happy, the latter being confirmed by the feedback we’re getting.

The introduction of the Libero role was the third significant change we’ve made to improve our web operations work. In the final part of my series, I’m going to summarise these changes and explain how we measure their success.