An argument against out of hours system maintenance

“You have no problem with out of hours work, right?”

It’s that question during the interview that causes a knot in your stomach. You hope, even pray, that what they mean is the occasional bit of emergency response work or a crunch period at a critical part of a project, both of which you have no issue with. But after you accept that job you find out that the work is routine system maintenance.

I’m going to put forward the argument that in this, dare I say it, golden age of automation, continuous deployment, commodity hardware, virtualisation, micro-services and other cloud-related fluff that there is very few situations why this work needs to be done out of hours. If anything using this model incurs additional risk.

(To any prospective employers out there, don’t take this as my refusal to do such work, this piece is simply about thinking about the issue differently)

It trashes work-life balance – Let’s get the selfish one out of the way. Out of hours system maintenance has a direct impact on the work-life balance of those performing the work. Contrary to what main stream society may thing, working in IT doesn’t exclude you from having a “life”, which may or may not include things such as hobbies, playing sports or having a family. It’s hard to be a good attentive parent or spouse when you have to regularly stay up all night focusing on work. Lastly, depending on the hours involved and your specific circumstances, it could be unhealthy.
It increases risk due to humans being humans – By the time you have commuted home, had dinner and settled down for the time window approved for your work, it may be 7 or 8pm (or even later). At this point, you could have easily been up 12 or more hours. By the time you finish the work, you may have been up 18 hours or more. Numerous studies have shown that the incidence of mistakes and errors increases the longer we are awake. The scenario also means that you are likely to not get a full night of sleep before work the next day, which can lead to fatigue and again, an increased rate of mistakes and errors in your work. This presents risk to your employer.
It increases risk due to increased number and complexity of the objects involved – When you’re in the office working, you have a certain amount of objects involved in your work, things like your PC, the network equipment it connects to, the servers you’re working on and so forth. Generally you have some amount of redundancy with these: your PC fails, jump on another. The moment you move to out of hours work, chances are very high you’re doing the work at home and are introducing a range of new objects that don’t have the redundancy of your corporate infrastructure. Your router certainly won’t be in a redundant configuration and your link to work is almost certainly going to be a single residential connection with no fail over or service guarantee. All these create single points of failure and additional areas of risk that can prevent you from completing or even starting the work.
It increases risk due to the lack of a support ecosystem – During the day at work, you have a range of resources you can call on if things go wrong or you’re stuck. You can consult your fellow team members, people in other teams, external support options such as the vendors. Even your manager is part of this ecosystem you can call upon if you’re stuck, marshalling resources to assist you. Once you’re outside business hours, the options you can call upon diminish and the longer you’re working on the problem, the fewer options you have. How many of your vendors or peers will pick up the phone at 2am to help you?
Your systems should be architected to allow maintenance during the day – In the “good old days”, things were a lot more tightly coupled and singular. You might have had only one server for an application or service, and designing systems to allow redundancy was hard. Today it is very easy to design systems that are highly available and fully redundant. So if they are and you can take 25 or 50% of the systems involved in that service offline, then why not do the maintenance in business hours?
We have the tools and the technology to make it happen – As my cluster of buzzwords at the start suggested, we have a large array of tools and technology to dance around the holes created if we take something offline for maintenance. At the storage layer, we have replicated storage, at the server hardware layer we have virtualisation that allows us to move services around and so on. Consider the fact that higher up the stack, there are companies like Facebook that perform “continuous deployments” twice a day. ING Direct Australia, a bank that manages $51 billion in assets, performs such deployments twice a week, without the customer even noticing it’s happened. Update tools such Microsoft’s Software Update Services now have cluster aware updating which allows you to easily perform updates on part of a cluster, leaving the service available. On the Linux side of things, features such as live kernel patching reduce the need to reboot, thus decreasing the need to perform the patching out of hours. Lastly, the propagation of automation tools over the last few years means there are more options for the work to be completely automated and scheduled. Worst case is a human has to initiate the automation workflow and walks away.

Do I expect out of hours system maintenance to disappear completely in the near future? No. But I do feel that like a lot of the mundane tasks IT professionals have performed in the past, it will eventually be converted into automation workflows or eliminated in other ways. Until that happens, it might be worth considering how it can be done so these points are taken into consideration.

Leave a Comment Cancel reply