The fuck-ups
-
(Thanks to s3gunzel for the suggestion!)
It can be difficult as a system administrator,
We all make mistakes, some worse so than others! Post your mistake stories below, and how you overcame it. And if you’re worrying about a mistake, remember that we’ve all done it - it’s how you own up to it and move past it that makes you a better admin. -
katos said in The fuck-ups:
Thanks to s3gunzel for the suggestion!
Seriously - I suggested it for a reason. Got one from today.
A few months ago - we had to migrate domain names from a legacy business that had been bought out so we could renew the domain names. There were about ten domain names, and it was a reasonably complex process. Every time we'd contact the registrar, someone new (and presumably green) took the ticket and bounced it back saying "Oh we don't have that domain name" and I would reply saying "Read the ticket, we've been through this already. Make a note."
Anyway, we got through it after a couple of MONTHS! And then turns out I didn't migrate all the domains properly, and three expired.
Renewed them today. Feel a bit stupid.
-
Oh boy do I have one hell of a story for this one... This happened not even 4 months into my current job while my boss was still an actual IT person at the time (kinda).
So to preface this something important to understand is that my boss and the IT guy for nearly a decade at this point was primarily a developer, meaning he had little time to be dealing with IT infrastructure things hence why I was hired to assist him. Additionally, prior to me joining, he had been upgrading VM hosts with complete resets so things were kind of in disarray and not all the configs were best practices.
After a couple months in the "seeing if we trust you phase" they decided that they did, and thus assigned me my first ever project. A really basic one (or what should have been) and straight forward at that. The project was to enable online archives for the on-prem exchange users so that we could migrate to Exchange Online (some main mailboxes were too big).
Pretty easy, I read through the documentation, take note of some PowerShell commands to script this up so that we can automate it, take note of the additional disk space requirements, etc. so far so good. Presenting my findings to the IT director, he says "double check the disk space". I check the VM disk space and everything looks good, more than 200GB free on each VHD storing the databases. Confirm with IT director, and get the go ahead to proceed with enabling for a test group (the lead dev, himself, myself and the general manager for one of the divisions). Enable them no problems, check to make sure the archiving job will run overnight. Go home.
Next morning, I went into the office and all the emails were broken. And the Hyper-V state of the on-prem Exchange VM is "saved-critical" meaning something major happened to it. IT Director gets in and we immediately begin digging in to solve the issue, and find that the physical drive (RAID array) for the VHDs is out of space. Explain how I checked disk space again the day prior and IT director proceeds to inform me "the virtual drives were overprovisioned because of the VM migrations we were doing, now we have to clear room and fix this". In the end we solved it by creating a new RAID array with more storage, moving the VHDs, and then spinning up the VM. But then we had to recover two corrupt Exchange DBs, the first was super easy to repair, the second we had to restore from backup (which took 3 days because 100Mbs switches).
All in all, it was a major learning experience. The first lesson being double check both virtual and physical disk space for VMs when doing things that might use a lot of space even temporarily. Second, don't put both Exchange DBs on the same RAID array, and for sure don't overprovision the physical disk space. And finally make sure that backup systems have at least 1Gbs connections, preferably 10Gbs minimum.