The hectic month

Our apologies to all our users who got impacted by the outages in the past month. We understand the frustration and hope we are in a better position to avoid similar situations.

This is not an excuse but a postmortem.

In the past month we had an influx of users which exploded our servers. Took us a while to stand on our feet again due to corona, WFH, parenting and limited resources. You might find it interesting to see what was happening behind the doors during this hectic month. So here it is.

1 – Our first outage was when the server ran out of disk space. The culprit? Stupid logs! To add to the injury none of that came from our app! It was just the default server logs. To be fair I am not a server guy. Any decent server developer would have knows to disable these logs in the first place. Anyhow, I cleaned the logs. Then I made my next mistake. I thought everything is back on track.

2 – Once the system was fully operational, it started eating CPU like there is no tomorrow! CPU stuck at 100% and people started to drop out of games. I started a journey to optimize the server code for those precious CPU cycles. I am not a server guy. Took me 2 days to bring down CPU usage just enough to keep it running. Then I made my next mistake. I thought everything is almost on track.

3 – The next day servers went down again! I am not a server guy. Took me hours to find proper logs. Nothing fancy. Just an “Out of memory error” thingy. Well, it is a bit awkward. When you hit on all the marks (disk, cpu and memory) it is time to migrate everything to a bigger house. Then I made my next mistake. I thought it is going to be easy.

4 – As a wise man said once, backup your data! I am not wise but I tried to be, once. I even picked a low traffic time. That glorious moment when I started the backup process. Voila. Server went down again. It couldn’t handle both the traffic and the backup process. It is darkest before dawn I guess? I mean sure, I could get a load balancer and 2 reserve servers and 2 database servers and try to understand the networking around them…. But at the end, this thing supposed to be fun and I have a full time job and a full time family. And more importantly, I am not a server guy. So I decided to make another mistake.

5 – I know at this point our users must have noticed the recurring outages. So let’s take things a bit slower. I rented a better machine, spent another restless weekend on installing apps and deploying server code and trying to figure out when I made the first mistake. When I decided to become a computer major? or when I thought it all is going to be fun?
Finally, after couple days of pulling hair, the new server was up and running. Another day to figure out how to redirect the traffic from the old server to the new server and voila! All the requests are handled by the new server. CPU usage check. Memory usage check. Disk space check. I am not a server guy. But, I even installed a bunch of monitoring systems just in case! Then I made the next mistake. I thought the new server is too good.

6 – The CPU usage was lower than the old server. But hey it has twice as many cores. Who am I to judge?
I guess server guys could have figured out what these charts are! But, hey! I am not a server guy. Took me a couple of days to figure out the CPU usage was actually way lower than what it should have been. Simply because our IOS users could not connect! Easy to fix (ouch!). Suddenly, the CPU usage jumped to the roof again. Yoohoo! … Wait! What? This was supposed to be the good server… Did I make another mistake?

7 – Another 2 days of skimming through docs to figure out the new server (like the old server) is not setup properly. It was using too many threads and was running out of all the resources it had for absolutely no reason. I made the changes and suddenly everything started to fall in place. Am I making another mistake?

Conclusion: We had many sleepless nights and many outages that could have been avoided by just setting proper options on the old server and calling it a day. Who could have thought? I guess, I am not a server guy!

blueberry-smoothie-3

Overwhelmed

I’m not used to this. I used to see each challenge as an opportunity, but now I’m sitting here totally confused, not sure what to do next…

It all started very simple, “Lets add Android” she said. “Should be easy” I replied, “After all, we can reuse the icons!”. You should know that we are all developers, we don’t have a single drop of designer blood in our veins. Writing a thousand lines of code feels like a breeze, compared to the storm of designing a 16×16 icon! I have personally spent hours and days trying to create a “simple” icon that later we decided to throw away since it reflected my lack of artistic vision (you are welcome by the way). So, when I thought about porting the iOS apps to the Android, it didn’t bother me that I don’t know Android! It just was a relief that we don’t need to design those damn icons again! Even later on when I created a mental list of the challenges that we need to tackle, I didn’t think about coding Android. The title that was blinking in my head was “Damn, we need our own servers!”.

Game score, user ranking, matchmaking, friend invite, achievements, … . All these services are available on Apple platform for free as long as you stick with their servers. But if one day, god forbid, you want to dip your toe in the ocean of multi-platform games, you are fucked. You would need to create all these services yourself. Some of them are easy to implement some of them are not. But regardless of difficulty, any sort of multi-platform communication, should go through your own servers.

I don’t know how much experience you have with backend development, but you don’t have to be a genius to realize servers need constant maintenance. You probably don’t want to be the one who needs to wake up at 3 a.m. to restart the server! Thankfully, it is not that expensive to piggyback on the infrastructures of one of the technology giants. The price? You have to learn their APIs and work with their limitations. Limitations that are waiting for you in the deepest and darkest corners of their documentations.

Google App Engine. One of the miracles of our time. “Deploy your code with a simple click of a button”. And of course the famous bigtable. This magical land of data. Sounds exciting. Right? So, you roll up your sleeves, shake your dusted mind and start working. It truly is a drug, the nostalgic feel of writing servlets and listening to hotel California… Its all fun and game until the day you realize you have built the whole app around sockets but the platform doesn’t support socket connections! “Such a lovely place, such a lovely face.” You are so deep at this point that you can’t jump to another platform. “You can checkout any time you like, But you can never leave!”

The options are limited to server ping and push notifications. The first one is expensive for users, the second one is lame. Its like when you are driving at 90mph facing a cliff, realizing your brakes are not working. None of the options lead to your favorite ice cream shop!

So, you go with the most rational option and start reading more documents. How to setup the certificates and send push notifications to apple devices using Google platform. Not that hard actually, and after a couple of weekends you make it work. Very reliable. The caveat? You can’t send push notifications to users without their permission. My guilty pleasure has always been to block the games that request push notifications. It’s against my ethics to expect otherwise from our users!

Push notification, unfortunately, is not the only issue we are facing. Let’s go back to the list of services we missed due to establishing our own server. You know how easy it is to invite a friend to a game? Well, it is not that easy anymore. Since Apple doesn’t give us the list of user’s friends, we need to get it from another source. But from whom? Luckily this is a no brainer. Facebook has a wonderful SDK for iOS platform. So, you spend a couple of more weekends understanding their nice APIs and creating a prototype only to realize the information you can get isn’t really useful! If your users really really trust you and give you all the permissions you asked for, then you can get a short list of their friends who not only installed your app but also gave you all these permissions! In practice, this means an empty list for most of our users.

At this point I’m just not sure if the blueberries even care for a cozy and warm waffle. Maybe we should cool if off a bit. Maybe a cold blueberry smoothie is good enough!

Classic-Blueberry-Mango-Smoothie-High-Res