-- Artur, we have a problem -- that was first words I heard over the phone -- we have grown from 1k to 100k online users in a few days and we expect to hit 1M online users within 2 weeks and probably much more when we go life.
In most cases, such a grow would be called a great success. In fact it is a great success, you just need to make sure you can maintain that success by not disappointing users with unreliable service. First impression is what counts...
Next day I was on an intercontinental flight to the office of a startup created by a group of students and just post-graduate folks. Extremely friendly people and what's more important, passionate about what they do. For them, it wasn't about money, it was about fun and excitement to create something new which will be used by millions of users. I was excited too, I like challenges and this assignment looked like a challenge for sure.
Thanks to the timezone difference and despite freezing temperature I showed up at office the same evening straight from the plane. It was dark outside already, but the small office was lit pleasantly. 8 or 10 desks spread in not exact order and cables were all over the place. For sure, looked like a place where people focus on their work rather than tidying stuff up. They welcomed me with smiles and some degree of mistrust. I knew this smile and mistrust face. It wasn't the first time I was called like a fireman to put down a wildfire. Actually I really was a fireman some time ago....
-- What's up? -- I asked and looked around. They weren't all new faces to me. I spent 10 days with them already a few months ago helping with code migration to Tigase. -- Can you give some details before I start?
-- We had less than 1k online users a few days ago, just our friends and other people who wanted to help us testing during development time -- They started to describe situation. -- Then, last Tuesday we switched to open Beta. Interest exceeded our expectation. We now have almost 1M registered account and up to 100k online users. Our servers are on their knees. We want to go life in next 2 weeks but we are afraid we cannot handle the load.
-- Can we? -- They added with question mark on their faces.
-- Of course we can. -- I replied without hesitation. -- It just needs some work.
I had no doubts Tigase can handle 1M and many more online users. However, there is no software which works for large installations out of the box. It always needs some custom optimizations. In fact, every singe large installation we were involved in was very different from other in terms of traffic shape and users behavior. This is why we implemented clustering strategy framework to allow such customizations and applying different clustering logic (strategy) adjusted to a use-case.
Before we could start any work I needed to know what exactly is going on inside Tigase on that installation, where are bottlenecks. This was necessary to prepare plan and prioritize work. I collected server statistics and talked to the team about custom code they created. It was well after midnight when we decided to take a break and continue next day.
With head full of new information and system performance metrics drove to the hotel. It was dark and very cold, although without any snow yet. The drive was a pleasant with no traffic so I could enjoy and still thinking how to approach the problem.
It turns out they have lots of custom code embedded in Tigase, unoptimized, which even did not use Tigase API. Instead they just modified Tigase code all over the places. Additionally, every message and most of IQ packets went through database as they implemented system to guarantee packet delivery to a user. 0% messages lost on their system, very nice but it created additional performance challenge. On top of this, they attempt deliver messages even if the XMPP client is not running on the mobile device. They combined push technologies for iOS, Android, BB and... SMS if there is no other option.
Next day I woke up early in the morning, at about 6AM with plan ready. Folks told me they come to office at about 9AM, so I had a few hours to prepare. I reviewed Tigase code and made notes of which API should be used for each feature and also looked at possible changes and improvements in Tigase core to make it easier to integrate with the client code.
-- What do you think? -- I was asked at 9AM in the office. -- Can we do it? Can we do it in 2 weeks?
-- I have no doubts we can do this. -- I replied without hesitation. -- However, we cannot do everything in 2 weeks so we have to prioritize to make sure critical stuff is ready in 2 weeks and overall code is prepared to easily add next elements.
The plan was:
- Write custom clustering strategy and run Tigase in a cluster mode.
- Extract custom code out of Tigase core to plugins and components.
- DB intensive tasks (the whole QoS system) put in a component and deploy it to several external components to distribute load.
- All "push" code which interacts with iOS, Android, etc... systems implement as components to make it possible to deploy them as external components if necessary as well.
- Based on the system metrics optimize the slow code.
The team was eager to work on the code and they all were very dedicated. My main role was teach them Tigase API, Tigase architecture and overall XMPO concept, instruct and help them design the most optimal way to implement certain features. I also did code review to make sure it integrates correctly with Tigase core code. And when time allowed I was also doing some coding myself.
The results exceeded our expectations. Everything was ready (although not fully polished) in 10 days! By the time online user number grew to 300k which gave us a good load as a testing ground. Their decided to go live on the 12th day....
Indeed, in about 6 hours from publishing information the new service user registration skyrocketed. Actually registration frequency was so high we started to have issues with DB performance. We quickly put a fix to throttle user registration requests number to acceptable level. We soon had several millions new user accounts in DB and about 500k online users.
Main benefits from the hard work in last days was that now we could easily scale up and down the whole installation by simply adding new machines. The load was well distributed. Tigase cluster nodes could be brought up without touching the rest of the system, we could add more external components to deal with DB intensive tasks as well as more external components for messages pushing. Code was refactored and prepared to easily add new features. I also taught them how to look and understand Tigase metrics to detect bottlenecks and slow code, so they knew what to optimize and correct.
The whole project was a great success. After a few years the system still works great and serves a few hundreds of millions users and I am really happy to see Tigase software at hard work.