jennyst: Jenny on a photo of space (Default)
[personal profile] jennyst
As many of you are already aware, the most recent AO3 deploy did not go as smoothly as we hoped, and we’ve sometimes had issues on previous major releases. The big items are all fixed now, but it reminded me that I know a few places (both work projects at my day job and Dreamwidth) where we deal with similar issues. Here are a few ideas I have been thinking about, around the principles of managing incidents on an IT service.

Sometimes, when a technical group is trying to deal with a major problem or a code release that's gone wrong, management and task prioritisation is an issue. You have everyone putting out little fires with buckets, when actually it needs someone to go, "Wait, guys, this is a pretty big building and it's all on fire. I'm ringing the fire service - they have trucks with big hoses." But to do that you have to have one person let go of a bucket in order to pick up the phone.

The general part



At my day job, to support a live IT service, there are usually several levels of support: 1st line support, 2nd line support, 3rd line support, and incident management. For a major code release, there is also the deployment manager and support team for that release, who may or may not be the same people as the 2nd or 3rd line team.

There are three main ways a problem can be discovered and handled in this model:

  1. Sometimes a user reports it to a helpline or support form, which goes to 1st line (i.e. the support committee). They give a friendly helpful response, and pass on details of the bug/issue to 2nd line if it's working hours, or incident management if it's out of hours.


  2. Sometimes an automated tracker spots a problem. Dreamwidth, like many places, has a server monitoring tool tied to their official chat. Theirs is called Nagios, and says, "HEY MARK, HEY MARK, THE SERVER IS DOWN!" like a toddler trying to get your attention. In other places, your friendly systems person's customised script goes, "Email alert: that key bit of the system is running out of memory". Either way, that goes to 2nd line, and if it's critical, copies incident management automatically.


  3. Sometimes 2nd line are browsing a server, doing their day job and checking stuff while they're at it, and spot an issue. If it’s likely to affect the live service, they tell IM.


Either way, you now have two teams talking to each other - IM and 2nd line. Incident management's job is to co-ordinate, prioritise and make the tricky decisions. 2nd line do the actual fixing.

IM are the people who sometimes go, "Actually it's not a big deal, 1st line, go tell the users to stop whining," and 1st line make it all tactful and then tell the users how to work around it. Sometimes IM go, "Hey, this is a big problem, it may be 3am but we need to ring 2nd line NOW."

If 2nd line spot the issue, IM are the people who go, "Hey, maybe we should warn 1st line about this, since they're about to get a ton of angry phone calls that the system is down." And IM do the phoning, leaving 2nd line alone to get on with fixing it.

Sometimes 2nd line look at it and go, "The server is up and running, it's all telling me it's okay, but it's not working - there must be a bug in the code." And they send it to 3rd line. 3rd line are the coders - 2nd line are the sysadmins. In smaller organisations, 2nd line and 3rd line may be a combined team with a mixture of skills.

After a big release, the deployment manager will also be there straight afterwards, talking to 1st line via Incident Management and talking direct to 2nd line, monitoring things proactively. If a problem is found, they'll throw it straight to the release team to investigate the bug, but at the same time, the deployment manager will work with IM to decide on a plan of action. The coders can carry on coding, because the bug needs to be fixed anyway, while the deployment manager and IM make the big decision of whether to roll back the release now or live with it until the fix is ready. In a company, that involves talking to the business owner who says how much of a commercial impact the bug is having, as well as assessing the impact on users.

The AO3 part



In the OTW, we don't really have any formal Incident Management. In theory, AD&T chairs do some of it for the AO3, but at the moment, they're too busy doing all the 3rd line stuff as well. And part of the point is that one person can't do both at the same time.

In a crisis, 3rd line have to concentrate on figuring out where the bug is and why and how to fix it. IM concentrate on telling 1st line to do admin posts, buying cake for 2nd line and coffee for 3rd line, and insulating them from each other so the crucial information gets through but everyone has somewhere to rant that's free from people yelling at them.

IM (or the deployment manager for a big deployment, where that manager works closely with IM and does some of this) know everyone well enough that they can go, "Jane Bloggs from 3rd line has just added pro plus to her red bull, I'd better tell 2nd line that it's going to be at least 6 hours so they can go and get some rest." They're the ones who can say to themselves, "Sue Jones from 1st line has now eaten at least 5 chocolate bars and is gazing longingly at the vodka bottle, I'd better warn 3rd line that the users are now 'really upset' not just 'a little bit upset', and see if there's anything we can do to get them another helper."

I would love to get someone from AD&T officially as the IM-type person, or ideally a couple of people in different timezones, but it has to be someone who's unlikely to be coding in a crisis, and at the moment, that's not true of the chair. It also doesn't need to be the chair - anyone can do it, so long as they know what type of decisions need to be approved by the chair. Sometimes rolling back a release is so obvious that IM can take that decision themselves, and sometimes the cost vs. benefit is not so clear, and AD&T chair needs to make the decision with input from Support or Testers or Coders.

We could also have a discussion at some point about how 2nd line and 3rd line work is split between AD&T and Systems. We have an advantage where several of our senior people are familiar with both types of work, allowing them to analyse the root cause of a problem more effectively - e.g. Sidra is both Systems co-chair and a senior coder - but that also means that people can end up trying to do two jobs at once in a crisis.

Next AD&T meeting, we'll be discussing the last deploy and what we can learn for the future. Having seen some of the discussions, both internally and externally, I’m hopeful. We’ve got a lot of good processes in place already, so long as we continue to follow them, and we have people around with the expertise to advise us where we can improve things further.

Date: 2011-11-24 10:49 am (UTC)
zero_pixel_count: a sleeping woman, a highway stretching out, mountains (Default)
From: [personal profile] zero_pixel_count
*blinks*

Thank you for posting this! I actually hadn't realised how much I needed a primer on what the various IT teams are supposed to be doing for my day-job. (I am now being gently mocked by my partner for not having known this stuff)

Date: 2011-11-24 12:57 pm (UTC)
samjohnsson: It's just another mask (Default)
From: [personal profile] samjohnsson
Yeah, there was a whole lot of "what the heck is actually broke and who do we tell, because throwing it into OTWCoders feels like shouting into the Center Ring." For that meeting, if she's not already planning on being there, you might want to wrangle some feedback from Matty.

In theory, could Incident Management be handled by the Release Manager - sie who merges the gitbranches, so that sie is most familiar with what code changed the most and what's most likely to break spectacularly? (I know we don't have one of those either, per se, but I'm dreaming!)
Edited (close quotes.) Date: 2011-11-24 12:57 pm (UTC)

Date: 2011-11-24 01:05 pm (UTC)
From: (Anonymous)
Oh, this is very interesting to read. Lots of people wearing more than one hat can be tough but workable under normal circumstances, but in times of crisis/incidents, it's... not such a good idea. Your firefighting metaphor is very apt!

The one thing I'd like to add to your run-down of tasks is the role of Comms (sorry, but how could I not talk about that!). In case of a major deploy, it'd also be a good idea to get a number of people together some time before the deploy happens to do some comms strategy preparation. This group, faciliated by a comms person and with input from 1st, 2nd and 3rd line as well as the IM or DM, then runs through a number of scenarios (from deploy going as planned to the worst case scenario) and identifies stakeholders and basic messages for each.

You'll then have a communications strategy both for the run-up to the deploy (where expectation management of users is an issue), and for after the deploy.

Of course, how much time you spend on each scenario is dependent on a number of factors, like the risk: how likely is it that this or that scenario happens? And if it does happen, how big is the impact? High likelihood + big impact = major preparation required; small chance and/or low impact means you don't need to work out every detail beforehand. It's still useful to have a rough idea of what to communicate with whom, should the scenario happen after all.

(Coincidentally, just earlier this week at work I ran a strategy session about a change to the system that's connected to our monthly salary payments. The chance of anything going so badly wrong that nobody will get paid in January are considered very, very small. But the impact would be huge! So we did spend some time talking about that. And while the project manager turned white and gasped for air when I brought it up, afterward he was appreciative of us having paid attention to it.)

(scribblesinink)

Date: 2011-11-24 02:39 pm (UTC)
unjapanologist: (Default)
From: [personal profile] unjapanologist
Thank you, this was really interesting! I'd love to read more about how the AO3 is handling/will change its handling of incident management, if you have any developments to talk about. (Am probably going to help with comms/support for a project and am eager to soak up knowledge)

Date: 2011-11-24 11:49 pm (UTC)
blueraccoon: (Default)
From: [personal profile] blueraccoon
So I am an incident manager by profession (my official title at my new job (squee!) is Ops Engineer - Triage Lead) and have spent about five years in for an commerce based website (we sold travel and are known for our horrible jingle DOT COMMMM)). So that's where I'm coming from on this.

My experience as IM is fairly similar to what you're describing, but I have a few other ideas and things that weren't in your post.

1. When we did releases at my old job, it was a highly organized process, and we channeled everything through the Change Management team and the Release Managers (hereto known as CM and RelMan for short). As time for release grew closer, all bugs were looked at during triage and assigned priority by the devs and by RelMan, and changes were noted by the CM team. I think that OTW is hitting the point where it really does need a formal CM, and an official RelMan process. These people can be wearing other hats - obviously our pool of resources is finite, and one can argue that RelMan is more important, but I think we'll need both as OTW continues to grow.

2. When we had a major release, we had a conference call going and all teams on the phone (or on site, but I recognize that's unfeasible), and the call was driven by the RelMan quarterback. The quarterback, who I think for your purposes was Deployment Manager, was the one giving updates or asking for them, identifying problems, and so on. (We didn't have different DM.)

3. RelMan ran the release until one of three things happened: a, the release went smoothly, everything's fine and we're BAU; b, we've uncovered a major Pri1 bug and we need it fixed NOWNOWNOWNOW; or c, release is running way late due to X and we need to fix X before we can finish up. Until then, IM's role is and should be passive - we were required to be at releases, or on the call, but there was nothing for us to do.

Once IM gets engaged, things go a little differently. Usually what we did was inform the bridge that this was now being treated as an incident, instead of a release, and that we were focusing all resources to (let's say we're in situation b) fix the bug that's breaking mobile browsing of the site. Pretty similar to what you said ahead; we would be the ones asking for updates from the dev team (who I guess would be your 3rd line), we would be the ones notifying our VPs and execs (in this case probably the Board members and com chairs); and we would notify our operations center (1st line) who would then notify our contact center so they were aware when they started getting phone calls or emails. In my experience, IM does not and never has contacted the end user directly, mostly because IM is focused on getting things fixed and doesn't, honestly, have the time or the skills to deal with the end user.

So I think that for OTW/AO3, what you need are multiple roles. You definitely need a release manager, someone to quarterback and keep track of things and send out notifications (aka It's ten o'clock and we've reached these milestones, we are green (on schedule) and our next update will be sent at eleven). Your RelMan should not be one of your existing support staff playing a role in the release. RelMan needs to have the objectivity to step back and say "Look, I know you're passionate about the code, but it's not working and we don't have time to fix it. Roll it back, you can debug next week."

You should have an IM, partially so RelMan can catch a break if things go FUBAR, or at least a second RelMan, either is okay if you're wearing multiple hats. The thing is, you have to make things very clear when you're switching from treating things like a "release" to an "incident", which is my argument in favor of IM + RelMan.

And lastly, what I think you need and don't have is a proper post-mortem process. At my last job, this was run by IM although we handed off to problem management, and a lot of places I think it's run solely by PM.

For those who aren't familiar with post mortem process, it's essentially a post-incident (or release) autopsy. The IM or person leading the incident lays out what happened, what went wrong, what was done to fix, and then the various teams involved discuss root cause, what we can do better next time, what we did right this time, and identify corrective actions and steps to prevent this happening in the future. Without a post mortem you are in danger of the same thing happening again because the one person who remembers that last year the server ran out of memory while running ten thousand processes is no longer with OTW, or is sick, or something, and nothing's written down so no one knows and boom, server falls over again.

We used to have postmortems for every release, honestly, which I think is a good thing; the more you understand about what happened, the better off you'll be for future.

Whew. I think I've rambled enough, but if I've said anything that doesn't make sense or that you disagree with, let me know. Note that my LJ/DW access is going to be sketchy until Monday, but you can reach me via email, or ping [personal profile] sanders for my cell phone if it's really urgent. :)

Date: 2011-11-25 11:31 am (UTC)
From: (Anonymous)
we need to start being more aware of what they do and ensure that the work is done, even if it's by a few people wearing several hats.

Yeah, the problem is perhaps not so much people wearing a number of hats, but people wearing hats that carry potentially conflicting roles or responsibilities. Like [personal profile] blueraccoon pointed out: how the RelMan shouldn't be part of the staff playing a role in the release because they need to have a certain objectivity. Same thing with Board members who are Chairs or Staffers and as such hierarchically answerable to themselves.

I think if the org pays more attention to how the roles get divided, it's still possible for one person to wear more than one hat. Not only in "software release land", but in the day to day business of the org as well.

Posts like these make me happy.

(scribblesinink)

Date: 2011-11-26 05:40 am (UTC)
azurelunatic: A glittery black pin badge with a blue holographic star in the middle. (Default)
From: [personal profile] azurelunatic
I have aspirations to become much better at facilitating various communication between groups, and perhaps eventually becoming some form of project manager somewhere, so this is very amazingly fascinating to me.

Profile

jennyst: Jenny on a photo of space (Default)
Jenny S-T

December 2016

S M T W T F S
     123
45678910
11121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 24th, 2017 06:58 pm
Powered by Dreamwidth Studios