Unicorns and Rainbows

My employer has been in the process of reorganizing for several months, now. A couple of weeks ago, as part of this reorganization, I was moved to a “new” team. In actuality, this team is simply a small subset of the people I already worked with. It was an All-Star team of three. Our mission was … whatever we saw fit, related to everything in our domain of skill, interest, and concern. Mostly we would direct our efforts at release engineering, build engineering, systems architecture, and deep-investigation type troubleshooting. Stuff we already did anyway, albeit much less formally.

When I first learned of the new team (and the team reorg plan in general), I was in a bad state of burnout. So bad, in fact, I just couldn’t even get excited about the new team. I just didn’t care. However, when the three of us got into a room with our PO for the first time, the energy was palpable. We are like-minded engineering types, and I have the utmost admiration for my teammates’ skill, professionalism, and opinions. I felt much more positive after that first meeting.

The new teams have been in effect for two weeks now. At the time the new teams formed, management placed a moratorium on all releases until we could, collectively, address some serious crashes and performance shortcomings our CMS had been experiencing.

The first week, my team addressed the single biggest performance and reliability pain point we were experiencing within our CMS: the database. It was a problem that had been plaguing us for months and was getting worse as time went on. We were experiencing a horrible cascade of knock-on effects that would eventually crush our database with 1000% load.  It wasn’t so much that the database itself was a problem, but once we hit a tipping point, it really hampered recovery efforts.  We needed to do everything we could to eliminate the database as a mode of failure.  We spent the week systematically reorganizing our database layer such that our web presentation layer used a load-balanced, horizontally-scalable read-only database connection. At the same time, since we had some momentum, we spent time analyzing queries and discovered several columns we could index to improve matters further. These two changes resulted in a dramatic shift in the performance and reliability. We saw a 40% reduction in database traffic, a 30% reduction in query times, and withstood an “event” that normally would have brought us to our knees for 2 – 3 hours, recovering in less than 20 minutes. Finally, just for good measure (and because we had a free morning), we corrected a problem where our application had mysteriously stopped delivering video through our CDN, which was causing a really bad user experience when streaming video (and costing us extra money in bandwidth charges), and we suspected was contributing to the severity of certain transient events that had the capability to cascade across the datacenter and wreak havoc – and were difficult to recover from.

The second week, due to the prior week’s success, we attempted a release. It had been three weeks since the last release, and a lot of unrelated changes were going out together, so we considered it a high risk release and planned accordingly. The short story is the release failed anyway. We noted anomalous behavior during the release and aborted. During the post-mortem, we developed a hypothesis about what happened. From that, my team directed its attention at the source of errors during the release: memcache. After pouring over documentation, wading through memcache client code, discussing, and debating, we felt we had a solution.  We decided to attempt the release again.  At the last minute we made the decision to drastically simplify our memcache setup.  That single decision was pivotal. Quite by accident, we fixed a serious problem we’d been experiencing. Roughly 70% of our log volume was directly attributable to memcache error messages. After our change, the memcache contribution approached 0%. There was an entire class of other errors in the logs that ALSO disappeared with this change, which we didn’t expect.  These were things we could not attribute to memcache and superficially seemed entirely unrelated. It still nags at me that we accidentally solved such a massive set of problems, but it was a huge win.  The release was successful the second time, and we had the opportunity to test our new setup with our nightmare recovery scenario.  The system hardly broke a sweat.  Plus, now we can expand capacity in the future very quickly and non-disruptively.

Yet, this is the uninteresting part of the story.  Yes, yes, it’s fun to tell tales of success beyond your wildest expectations.  It’s the thing great conference talks are made of.  Those are fun.  So why is it the uninteresting part? Because it was our *process* that paved the way; and ultimately that’s what allowed us to yield the results.

But that’s a discussion for another post.

Leave a Reply