Taking gear out off service

This is a story about good practise and "don'ts" in sysadmin work,  inspired by Bryan's (@bdha) blog entry about the  first law of systems administration. Recently I  got some used gear for lab  - the story of this gear, a set of FC switches - remembered me of a lesson about what can go wrong and why I was once told to do things a certain way when administering systems in production. - I was also bitten by similar but never that hurting experiences thouch

Expect the unexpected (say hello to Murphy)

Once in the ol' days my current boss (and former sysadmin) told us: "If you migrate a service and power off the old server or whatever important gear: "Do NOT disassemble, throw/give away the old thing too soon. One never knows if you forgot a tiny detail and you might be happy to have the old box back running in short period."

Today we have virtualization, conf management like and revision control who can help you track things for config changes, migrations. Nonetheless it can be very dangerous.

What he also told me was: "Plan for the unexpected, you really never know what's going to happen in a migration if the system is complex. Also plan your time and resources and don't squeeze too much into that time window." (you might need a little bit of sleep though?)

Reality...

Now came day X where the customer (me) wanted to get his used gear. I was kindly asked fo an additional week due to their migration, fine. When I finally went there I had to realize that they were actually in the process of migrating the SAN traffic to the remaining fabric switches. And guess what? - Boom that's when the "unexpected situation"  happened even with highly available virtualization and clustering and a multipathed, multi-controller FC SAN. I had to come a back in a couple of hours for the time they fixed their production environment. I was only grumpy becase I had to wait additional time, but they were quite exhausted after this exercise...

Conclusion

Doing massive changes on your critical production environment  without any time reserve is not sane. I may sometimes try to rush into changes but not this way. Now I have had the chance to experience such a situation as an outsider. Lots of Sysadmin wisdom gets outdated quickly, but some doesn't. - This rule seems to be part of this wisdom. Stick to it, you not only hurt yourself by not following it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.