Thread created on 22:02:46 - 25/10/22 (3 months ago)
I thought I'd provide a quick informal announcement regarding today's events and our plans in the near future, not only to provide you with some transparency, but also to help me personally wrap my head around the situation we've found ourselves in.
Firstly, some insights into the unprecedented activity we're seeing...
- We're seeing an incredible 2,000 in-game attacks per minute despite today's issues, on track to double the last highest peak. All of those attacks come with a huge amount of related activity which is obviously pushing the limits.
- We've just blown past 40,000 daily active players for the first time, which is a significant milestone I've been looking forward to for many months. It's a shame it's tainted by the problems we're facing.
- Torn's total requests per 15 minutes increased from 2 million to 7 million. Such a massive influx wasn't anticipated. Previous events have seen large increases, but never as much as 3.5x. Not only are there more users online than ever before, their activity is also vastly increased.
- The queries per second on a single database server increased from an average of 7,500 to 50,000. This database server likely only survived because of the bottleneck caused by our web servers.
Now, on to some notes on our current situation and future plans...
- As proven by Halloweek '22, we need to design our infrastructure in a way that allows for horizontal scaling during additional levels of activity. That process is currently a very manual and unreliable process that works fine during normal times, but fails us during large events.
- Fortunately at least, we seem to be handling things better than on the Valentine's day event in 2020, we've come a long way since then. Although the onslaught of 'Out of available ports' errors which our pods are manifesting as incredibly frustrating 550 pages continues to persist even 12 hours later.
- In an attempt to resolve this issue, assuming it continues into tomorrow, we'll try to get some extra web servers running on the cloud to handle this traffic. This may be challenging, but should be theoretically possible since we've already prepared a lot of the groundwork to make this happen.
- While we did see a brief DDoS attack last night, it was quickly mitigated and did not contribute to any issues faced today. All issues encountered today were a result of the current design being unable to handle an amount of traffic that grows with each passing event.
- Our current server infrastructure has slowly metastasized over the years into an undocumented mess, attempting to make quick fixes or improvements only seems to make things worse. Our current hardware isn’t in a position to be fully re-purposed to provide additional cycles without significantly degrading performance.
- This has sparked our initiative to try moving additional services into the cloud. Throughout the past year, we’ve brought on additional talent into the Torn team and we are recognizing how disorganized things have become over time. We’re in an excellent position to utilize our growing talent and continue to improve the infrastructure in Torn. This is a process - a carefully planned and tested process, not a quick fix that will just get us through the day.
- The current infrastructure design doesn’t allow for horizontal scaling in Torn. The deployment practices are dated and don’t provide consistent methods for backend load balancing. These processes are being re-designed from the ground up by newer infrastructure staff so we can continue to grow and stamp out major event issues.
- We’re currently working on adding additional database servers as part of the cloud migration testing to spread load during large events. I expect that during major events, we will need to prepare at least double the amount of available resources we currently have.
- Our lease is ending next year at our data center, if we want to remain self-hosted, we'll need to move racks anyway (with significant downtime) as the core PSUs are reaching their end of life. Now is the perfect time to attempt a venture to the cloud.
- The first portion of the migration will involve moving our entire development environment into the cloud. Successfully migrating our dev stations will allow the team to have a phenomenally more streamlined approach to their workflow, in addition to easier and faster testing of features and bug fixes. In response to the drastic amount of activity we experience today, we will also be testing some temporary production servers in the cloud to try and take the load off of the in-house servers for the time being.
- An eventual full migration next year may well come with outages, lag, and likely a whole host of new problems, while we find our footing, and maybe even afterwards. The upside is that we'll have tremendously more awareness and control over everything.
- Realistically, it would be irresponsible to be spending donator's money on new hardware that will be made redundant potentially within months by cloud operations. In retrospect, I should have pushed for some kind of temporary solution to our known constraints.
- I am so desperately fed up with the sleepless nights, the surges of anxiety, the sense of impending doom, and all of the health issues that come with that. There is no need for this to be so difficult. No matter what happens, I am fully determined to have a whole new, vastly improved, and simplified infrastructure next year, in whatever form that takes.
I apologise for our incompetence, and very much hope the issues will subside, or that our team is able to find an effective solution as quickly as possible.
Last edited by Chedburn on 22:09:29 - 25/10/22