Our apologies to all our users who got impacted by the outages in the past month. We understand the frustration and hope we are in a better position to avoid similar situations.
This is not an excuse but a postmortem.
In the past month we had an influx of users which exploded our servers. Took us a while to stand on our feet again due to corona, WFH, parenting and limited resources. You might find it interesting to see what was happening behind the doors during this hectic month. So here it is.
1 – Our first outage was when the server ran out of disk space. The culprit? Stupid logs! To add to the injury none of that came from our app! It was just the default server logs. To be fair I am not a server guy. Any decent server developer would have knows to disable these logs in the first place. Anyhow, I cleaned the logs. Then I made my next mistake. I thought everything is back on track.
2 – Once the system was fully operational, it started eating CPU like there is no tomorrow! CPU stuck at 100% and people started to drop out of games. I started a journey to optimize the server code for those precious CPU cycles. I am not a server guy. Took me 2 days to bring down CPU usage just enough to keep it running. Then I made my next mistake. I thought everything is almost on track.
3 – The next day servers went down again! I am not a server guy. Took me hours to find proper logs. Nothing fancy. Just an “Out of memory error” thingy. Well, it is a bit awkward. When you hit on all the marks (disk, cpu and memory) it is time to migrate everything to a bigger house. Then I made my next mistake. I thought it is going to be easy.
4 – As a wise man said once, backup your data! I am not wise but I tried to be, once. I even picked a low traffic time. That glorious moment when I started the backup process. Voila. Server went down again. It couldn’t handle both the traffic and the backup process. It is darkest before dawn I guess? I mean sure, I could get a load balancer and 2 reserve servers and 2 database servers and try to understand the networking around them…. But at the end, this thing supposed to be fun and I have a full time job and a full time family. And more importantly, I am not a server guy. So I decided to make another mistake.
5 – I know at this point our users must have noticed the recurring outages. So let’s take things a bit slower. I rented a better machine, spent another restless weekend on installing apps and deploying server code and trying to figure out when I made the first mistake. When I decided to become a computer major? or when I thought it all is going to be fun?
Finally, after couple days of pulling hair, the new server was up and running. Another day to figure out how to redirect the traffic from the old server to the new server and voila! All the requests are handled by the new server. CPU usage check. Memory usage check. Disk space check. I am not a server guy. But, I even installed a bunch of monitoring systems just in case! Then I made the next mistake. I thought the new server is too good.
6 – The CPU usage was lower than the old server. But hey it has twice as many cores. Who am I to judge?
I guess server guys could have figured out what these charts are! But, hey! I am not a server guy. Took me a couple of days to figure out the CPU usage was actually way lower than what it should have been. Simply because our IOS users could not connect! Easy to fix (ouch!). Suddenly, the CPU usage jumped to the roof again. Yoohoo! … Wait! What? This was supposed to be the good server… Did I make another mistake?
7 – Another 2 days of skimming through docs to figure out the new server (like the old server) is not setup properly. It was using too many threads and was running out of all the resources it had for absolutely no reason. I made the changes and suddenly everything started to fall in place. Am I making another mistake?
Conclusion: We had many sleepless nights and many outages that could have been avoided by just setting proper options on the old server and calling it a day. Who could have thought? I guess, I am not a server guy!