Hi everyone. I’m Davz, and I am responsible for the TORN infrastructure. Managing this – and trying to eliminate problems – takes a substantial amount of my time, and as we have recently made some changes to our hardware I thought you may be interested with the infrastructure we run. This post covers the main components of our infrastructure, the primary open source software that we use and the high level design.
When you hit www.torn.com you hit one of our two powerful frontends. These servers run haproxy to manage connections to the application servers, stunnel to handle SSL connections to torn.com and also act as Python application servers for some of the site features such as Chat.
The vast majority of the TORN code is written in PHP, and executed with a patched version of PHP running behind the nginx server. There are approximately 10 web servers running this configuration, and haproxy on the frontends sends each client to the same backend using a cookie to select the correct backend. In the case of backend failure, clients are directed to a working webserver – although the session will be reset.
There is a problem that we have been unable to conclusively track down that causes clients to have this cookie reset from time to time (generally when refreshing the same page a lot); some work has gone into fixing this (various attempts have been made at a shared session store, all have failed under specific conditions – although we are waiting on hardware to have another go).
Much of the data used by TORN is stored in a set of MySQL databases. The data is ‘sharded’ by type across multiple pools, each of which runs MySQL multi master replication. We are using MMM to manage the VIPs. Backups are taken using xtrabackup, and are stored on a local NFS server (Nextenta). The transaction log is recovered, the backups are compressed and then encrypted and sent to Amazon S3. A separate job fires at regular intervals and launches a EC2 virtual machine in the same region to download, decrypt and decompress these backups – and import them into a MySQL server to make sure there are no errors. We store backup metadata (such as confirmation that a backup has been verified) in Amazon SimpleDB, and have a simple dashboard that shows verified backups and the size (plotted on a graph to detect anomalies).
Some data makes more sense to store in a NoSQL format, and we manage a highly available MongoDB replica set to store information. This includes several hundred gigabytes of archive personal stats data. THis data is backed up and verified in the same way.
We use Amazon Web Services for hosting content that needs to be shared across all servers (e.g. uploaded user images) and some critical redundant services (e.g. DNS).
All configuration is stored in puppet; we have a custom repository for software that is patched (our base OS is CentOS 6) and use the Fedora Cobbler project for provisioning. We use virtual machines for development environment and a small number of management machines.
We store code in a hosted git repository which allows all developers suitable acceess.
We use a hosted third party montioring tool both for our device monitoring (e.g. graphing CPU usage) and service monitoring (I.e. checking that www.torn.com is still working). We also use a hosted log system which is connected to alerting.