> The question: how does OpenTTD’s infrastructure look, or even: why is it so complex, is a rather complicated question to answer in a few words. [...]
Was waiting for them to get to the "why is it so complex" part, but after all the details of Cloudflare Pages, Cloudflare R2, Cloudflare Workers, Cloudflare Access, EC2 instances, multiple CDNs, hosted Redis, Nomad, Pulumi, web of proxies and APIs and front doors, a dozen microservices and an IaaS repo to make sense of all of this, it came down to:
> In total, we store over 150GiB of data, transfer over 6TiB of data monthly, have more than 10M requests a month, and serve thousands of unique visitors every week.
Basically my MacBook Pro from 2019 could host all their infra and data and serve the entire load (~3 RPS) with room to spare for my day-to-day work.
For anyone else who is reading the post looking to get inspired – ignore everything they did and start small. A single web service to handle all business logic, hosted on two rented VPS instances which split traffic. Data stored in mysql or postgres with some regular backup. Start scaling only when the load from this setup overwhelms you (and I can guarantee it won't for 99.9% of cases, including the one in this post).
Everybody wants to larp like they're operating at Google scale. I've even heard people refer to 1GB csv files as "big data"; that file fits in the RAM of my laptop from 20 years ago!
But I've stopped fighting it when I have no stake in it. If it's good for their CV and they can afford the expense, then good for them I guess. It's not my problem. Maybe it's problematic from an environmental standard I suppose, but that could impeach many hobbies.
It kinda makes more sense for OpenTTD. The game is famous for creating huge complicated networks for moving cargo and passengers, it is natural that the devs gravitate towards huge complicated computer networks too.
>> In total, we store over 150GiB of data, transfer over 6TiB of data monthly, have more than 10M requests a month, and serve thousands of unique visitors every week.
> Basically my MacBook Pro from 2019 could host all their infra and data and serve the entire load (~3 RPS) with room to spare for my day-to-day work.
Yeah, their requirements remind me of somebody on a bittorrent forum describing their minimum acceptable seedbox.
Hmm, the web services interact with OpenTTD the game/application, across a few versions, so it's understandable to me that their infrastructure is complex.
> To keep the AWS infrastructure as cheap as possible, we wanted to avoid needing a NAT gateway: if you use IPv4, you need something that allows you to talk with the outside world. On AWS you do this by installing NAT gateways. Sadly, those are (relatively speaking) rather expensive. So instead, we run as much as we can IPv6-only.
If the OpenTTD folks are reading this: this is not true. You can assign public IP addresses to your t4g instances by using a public subnet in the VPC they are deployed to. They will correctly reach WAN addresses via their public interface and you will not need a NAT gateway. This incurs no additional cost. It sounds like your current instances were deployed to a private subnet.
You will not need to do any additional configuration on Amazon Linux 2 or the distribution of Ubuntu those instances deploy with using the Launch Wizard - both from the point of view of the Linux firewall and from the point of view of security group configuration.
This will greatly simplify your configuration.
> Nomad is similar to Kubernetes, AWS ECS, AWS EKS, Azure AKS, etc, but a bit simpler to work with in a day-to-day... For this we run nginx on all clients in the cluster. Nginx uses Nomad’s service discovery information to forward traffic to the right instance.
You are reinventing Kubernetes. This is okay. At the time that these decisions were made, ChatGPT 4 didn't exist - nowadays, if you want, you can "just" ask for Kubernetes manifests, and you will get correct ones, and you will see the light. The keywords for what you should ask for are `flux` and `eksctl`. You can create your Kubernetes cluster with one file & one command with `eksctl`, then `flux bootstrap` a Git repo that will contain all your YAML files describing your application.
Another perspective is, to ameliorate your NAT gateway mistake, you have to be pretty familiar with the AWS and Cloudflare networking details. If you used eksctl, you wouldn't.
People have been running reliable and performant multiplayer game servers for decades on potato-quality hardware. It isn't some insurmountable problem that only became possible in the cloud computing age. Heck popular minecraft servers get orders of magnitude more traffic than OpenTTD, and they all run on things like DigitalOcean droplets and $5 VPS's. You don't need relationships with account managers at AWS and Cloudflare and an alphabet soup of proprietary products to serve a few thousand gamers playing a 30 year old game.
> If the OpenTTD folks are reading this: this is not true. You can assign public IP addresses to your t4g instances by using a public subnet in the VPC they are deployed to. They will correctly reach WAN addresses via their public interface and you will not need a NAT gateway.
Ah, but putting an EC2 in a public subnet is not "The Way"!
We've let AWS convince us to not put instances in public subnets so that they can make money hand-over-fist on NAT Gateways, which are WAY too expensive for what they are and do.
keeping your servers on the public network is not a good idea for a variety of reasons such as security, cost, control, access and compliance.
AWS makes money, it's a trade off; you can just as well put up an ec2 as a nat that is able to auto scale if you need to give your servers access. or only attach one during updates, etc.
> hosted on two rented VPS instances which split traffic
I would push back here. I think even this is ridiculous complexity in 2023 if you just need to serve a webapp or API. Managing VMs is a mistake now. Serverless is an extreme competitive advantage when you are small and trying to stay focused on the customers. I don't have time to babysit self-serving technological curiosities as we try to ramp. There definitely isn't any money in pet problems. There is also a benefit to noob developers - if you constrain yourself to shipping serverless function code, you can't possibly be tempted and fall into weird infra rabbit holes that rob you of your ability to deliver near-term value and learn about practical software work.
Our next-gen architecture consists of exactly 2 things. Azure SQL Database Hyperscale and Azure Functions. We deliver server-rendered HTML directly to the client from the HTTP trigger functions and they in turn connect directly to the DB. I almost suspect Microsoft doesn't like us doing it this way (e.g. mandatory URL route prefixes). But, too bad for them - we worked around it & our hosting model is effectively a prerequisite for the complexity circus everyone else employs. That's it. We are about to have a zero VM cloud infrastructure and someone with 2 days worth of YouTube training could become semi-effective at monitoring it all.
Become the cockroach of technology users. Use the barest subset of what is needed to get the job done, but do it in a clever way. Infest someone's cloud so you don't have to screw with boring things like compliance, audits, power supply replacements, deployments, etc. Use their products in ways that feel manipulative, but are still strongly within the lines of ToS and the barycenter of the overall crowd. Stay away from bleeding edge technology.
Imagine if you literally only had to push code (aka not declarative infra) to GitHub and pay one cloud bill. Certs, networking, patching, backups, monitoring, recovery, scaling up/out, etc all completely handled for you. Why wouldn't you want your life to be this easy? Is it because "fuck Microsoft[0]" that we continue to dig the technological equivalents of ditches by hand all day? Are we just poorly incentivized?
[0] To be clear, you could replace "Microsoft" with "Cloudflare" or a number of other hyperscalars. I am tempted to play around with the CF offerings again. I like the idea of an "emergency backup vendor" if things ever get spicy with Microsoft.
I would definitely not call serving HTML from Azure Functions next-gen.
You are increasing the response time by 20-30ms just by using HTTP-Functions instead of plain ASP.NET (lots of abstractions and grpc channel to worker processes). Beyond that you will get long cold starts, no response-streaming and slow scaling. The cost of the consumption-based plan will skyrocket after small number of requests compared to serving from dedicated App Service.
We are running on a dedicated/isolated (ASEv3) plan right now. It's incredibly fast & responsive. You can run a SSR web app directly on top of this kind of configuration no problem.
I am not sure what you are referring to with GRPC channels and worker processes.
Since you're using Azure App Service Environment anyway, why bend Azure Functions to fit your web application use case rather than simply deploying ASP.NET Core on Azure App Service?
> why bend Azure Functions to fit your web application use case
Because it is even simpler than screwing with ASP.NET Core if all you need is a simple SSR webforms experience (which I'd argue is most of us). I don't even have to worry about a startup.cs file or play games with the types of serializer or pipeline features I want to use. The only thing I have to do is take a http trigger method dep on ClaimsPrincipal and turn on AAD auth in the portal to get all my hard shit taken care of.
Why not return final text/html from a function that would otherwise return application/json to some cartoonish client-side complexity? Why add some API/web proxy middleware bullshit that serves no value? I am not buying the narrative anymore. This works, it is simple, it is fast. HTTP in, HTTP out. Microsoft would have to go so far out of their way to make this not work that it would look absolutely comical. "text/html is banned return content type because we are capricious assholes".
It takes me 5 minutes to spin up a C#/V4 function webapp that is mapped via GH actions to a multi-stage Dev/Qa/Test/Prod deployment pipeline with certs, monitoring, backups, DR, compliance, etc all handled by default.
Can you get your LAMP stack to pass a PCI-DSS audit after 1 hour of work? I can make contractual guarantees that my stack is going to pass those audits before I log into my PC for the day.
Complexity is a weird thing. It is really hard to understand the whole picture when you only look at the technology through a narrow perspective like "what brand is my database engine".
I am sure you have seen other developers try to get this whole pipeline working on their dev box then.
Look, I have been doing things the way you are describing in the industry for 10 years now. Most of the time, the shop really really really does not need it. It is not difficult to set up a regular backup and a cert.
> And if your MBP were to die today? Or if a burst of traffic came through? Or if you closed the lid and forgot to turn Caffeinate on?
So ... two macbooks then?
> Surely you can understand why people don't routinely host all their project's infrastructure on their laptops, even if the technical specs are enough.
I think the poster was simply highlighting that 150GiB of storage, 6TiB of transfer and around 4 requests per second to an average of maybe[1] 26 users per minute might not necessarily need a K8 cluster and 8 different cloud services.
[1] I'm taking "thousands of users" to mean "up to 19000", otherwise they would have said tens of thousands, dozens of thousands, etc.
But even then, you’re just pushing the cost onto the person maintaining the system.
Cloudflare R2 => Now you have to bootstrap and maintain your own high availability Minio cluster. On multiple servers in multiple data centers for redundancy, of course.
Cloudflare Workers/Pages => Now you have to maintain your own compute runner (granted, could be as simple as a docker container but that still both requires work to set up and transition over as well as maintain over time) and load balancer (once again, just Nginx or Apache but that requires setup and maintenance) to execute and serve this content.
Cloudflare Access => Now you have to maintain your own access control system like ory.
Cloudflare Tunnels => If you’re only running one node, this isn’t needed, so congrats I guess. You’d still need to provide internal access if you have multiple nodes in a HA environment, though.
AWS EC2 => Now you need to maintain your own VMs.
Etc.
This is a volunteer project. Having someone maintain all these things may not be even remotely practical.
It’s all the rage to hate on cloud services in 2023 but they abstract away a lot of operational work and that’s not something to be blindly discounted.
The more niche something is, the more vocal the members. So, yes. People would probably notice.
But also, the point was that a MacBook (or single server) can just replace the infrastructure defined in the article. That is clearly not the case if you’re suddenly losing uptime. That is a material deterioration, any way you put it.
And even if you didn’t choose to go the HA route, you still need to set up and maintain the server and all the things running on it. And fix issues when they come up. Choosing to have downtime does not magically make any of what I said go away; it at most eases the burden slightly.
Nobody cares about 99.999% uptime of an open source game. Nobody needs a sub-second convergence time of 99.999% of existing companies' networks. A couple of minutes of downtime is non-issue.
Totally agree. Just as a point of reference that a professional service with significantly stronger uptime targets fails to meet these unrealistic expectations. For an open source game with volunteer resources, it is not reasonable to expect anything more.
Edit: I would go so far as to say they could even have a weekly “patch Wednesday” where the expectation is set that there would be an hour of downtime.
Oh yes, a random person on HN says something is a non-issue so their word must be gospel. Definitely.
I explained exactly what needs to be maintained and why that’s a lot of work. You are free to disagree, but just saying “trust me bro” is not a compelling argument.
Have you ever maintained your own server? Or even tried? It’s not nearly as easy as you clearly think it is.
Oh, I remember you. This amount of hostility to the world is difficult not to notice. So at least you're not a rando here. Clearly a name to remember.
> It’s not nearly as easy as you clearly think it is.
Dunno, I'm 20+ years in the industry. It's not easy, though it's just a skill. You can hire a person who would do that for you. It's not necessary to make it look like a wizardry. It's just a skill.
I offer counseling if you're struggling with the basics. For you personally it would be 200% extra though. I don't like unnecessary rude people.
If you think hiring a person to manage this is cheaper than the setup they have, or even remotely reasonable for a volunteer project, then there’s nothing more to discuss.
That is in no way shape or form grounded in reality.
I think I know a bit more about this particular problem than a (senior) software developer. Infra management is my profession, which I've been doing for decades. The conversation might have been more interesting if you hadn't been so rude initially.
This conversation might have been more interesting if your point was anything but "it's easy, trust me bro", which you continue to do.
Waving credentials around and proposing solutions not remotely feasible for the problem at hand generally does not sway people nor provide for interesting discussion.
It's an open source game run by volunteers, played by enthusiasts.
HA is not necessary. Having 5m downtime once every 4 years while they switch between the primary and the backup droplet because something went wrong is okay for a game!
The majority of folks running solutions that result in effectively 0% downtime are vastly overpaying for what they get.
I think the point is that most of those requirements could be met with a $60/month hetzner server containing a 512 GB raid array. I think that comes with 100GB free backup and if you pay another 10 bucks or so you can upgrade to a terabyte. But it's been a while since I checked out those costs.
It doesn't have to be _your_ macbook, it could be someone else's, like a vps somewhere else with a 99.9 uptime. That would be more than enough to serve 150GB of static data + a forum. You don't need to involve amazon and cloudflare for a 90's game.
Note that the forum is a separate entity to the openttd project (though there is some overlap in people), and while openttd is the most commonly played variety these days, was started to discuss TTD and TTDPatch
While I'm not so confident that it's a given that performance or reliability increase after a system is distributed, let's assume that it's true.
This turns my question into: does OpenTTD need more performance or reliability for its website?
I can kind of see how DDOS protection might be useful, but... I don't protect my stuff against DDOS: the loss of service is nullified by the effort and risk required to set it up and maintain. What would that calculation look like for a random forum?
>While I'm not so confident that it's a given that performance or reliability increase after a system is distributed, let's assume that it's true.
I think you can unequivocally agree that a distributed service thats designed to be fault tolerant is going to more more reliable than your MacBook Pro sitting in you closet on your home Internet connection.
>I can kind of see how DDOS protection might be useful, but... I don't protect my stuff against DDOS
If you don't care about your stuff going down, then it doesn't matter. If you don't care about it, then comparing it to a setup where that is a feature, isn't an even comparison.
Even if you hosted this in your own closet on your MacBook Pro, OpenTTD's setup is still somewhat competitive. You might say "Oh I get that all for free" well a MacBook Pro costs money, your home internet costs money (although most consumer Internet is going to push back if you do more than 1Tb up a month- this is the cue for everybody who wants to rave about how great their Internet is to be contrarian below), you're paying electricity, and rent- even hosting your MacBook Pro is a marginal benefit from other expensive you already have, it's still not _free_.
I'm not sure even something ludicrous like a 10s latency would actually be a problem for OpenTTD. The uses of their hosting are:
1. The web page you're reading
2. Mod downloading and server listings. Sure, less latency would be nice, but is it vital?
In particular, if you join a multiplayer game, that's the end of their server involvement, they're not hosting the multiplayer servers, so it won't have a gameplay effect. Add on to that that a lot of their players get the game from Steam, linux package repositories, or JGR's github releases page anyway.
I just went to new.reddit.com and it's 9s for a full render. YouTube is 5s. Given openttd.org renders in a single request unlike those two sites, it actually wouldn't be that much of an outlier.
I wouldn't hold either of those up as good counter examples: 1) Single Page Apps vs. mostly static HTML 2) Not great exemplars of quickly loading pages 3) Your bandwidth might be bad- I was able to load youtube.com in <2s.
And yet they're some of the most popular web sites on the internet, so does that not indicate that they do so despite the performance and therefore users find their performance at least acceptable?
> 3) Your bandwidth might be bad- I was able to load youtube.com in
Good for you? I'm on a 2GBit/s fiber connection with 3ms ping to youtube.com . I am however, likely located further away from their actual servers doing the processing (while ping measures the response from their edge servers).
You're right, they don't. The user experience is that they're front-loading their wait time loading the app, and aren't doing a whole request/response cycle for every action they want to take. The end result is the net wait for their active session being lower.
>so does that not indicate that they do so despite the performance
It indicates they're trying to maximize features, rather than minimize load time.
>I'm on a 2GBit/s fiber connection with 3ms ping to youtube.com
Cool maybe your computer is slow then. Sample size of 1, anecdata, etc.
I'm sure it is a reliable system, but I'm willing to bet that two $10/mo VPS instances splitting the load would be equally reliable and still be able to serve requests worldwide with sub 200ms latency.
Having played (and lost countless hours) Transport Tycoon Deluxe in the 90s and then OpenTTD 10 or 15 years ago, I'm just tickled that this game adaptation is still going strong.
Chris Sawyer's ability to create addictive building games that remain fun to play long after their contemporaries have ended up in the dustbin of history is superhuman, in my humble opinion. Add to the fact that he did it all in Assembly, and it's hard not to place his achievements on a bit of a pedestal.
I would say Roller Coaster Tycoon is his best design. It was just before 3D became forced mandatory for games for some reason and not a one man job any more. I mean at the same time Sim City was actually good too.
These are much more recent games; things have bounced back and there's loads of 2D games today, but starting around the year ~2000 there was a decade or so everything had to be 3D, which often meant "3D FPS shooter", whether it made sense or not.
This often meant a regression in graphics quality (e.g. Baldur's Gate 2 vs. Neverwinter Nights) and frequently a regression in gameplay too (usually due to horrible controls and/or camera, something like Monkey Island 4 is a good example).
They could probably host this on a $60 Hetzner dedi, with maybe a $5 VPS from some other provider in case the big one goes down, at least to host the website and possibly a status page from two locations.
That can easily handle TiB of traffic (Say 20+ TiB/month), have a few thousand requests a second (depends on what kind of request), and you can easily parallelize a lot of this by running your backend in a few docker containers on this dedi.
That's, say, $70 including the cost for a domain and such, and all it requires in maintenance is regular system updates.
Auction boxes on Hetzner often come with >1TiB of SSD storage, if you're so inclined.
Then again, I'm the kind of guy to play OpenTTD by making a few simple networks and watching the thing generate stable revenue, and then messing around with building something else, so maybe the mentality is just different.
Running anything you care about on Hetzner is a needless risk. There's numerous other discount providers who don't abruptly turn off traffic to your box or close your account the moment they see something they don't understand.
< 3500 USD/year (that was the cost of the previous iteration on AWS, and they mentioned one reason they moved to the current setup was because it was cheaper)
I don't even know if you'd need two servers. Spin up a few Docker containers on the 512gb raid array and you could serve the forum and mods on the same machine. It's not like OpenTTD has the user demands of Reddit or Huggingface.
This sounds like practically a text book example of how to run a small-ish hosted service, and I like they got to take more interesting choices with Pulumi and Nomad vs. the "nobody ever got fired for buying IBM" of Terraform and Kubernetes. If I need to build the infrastructure of an early (or even mid depending on the complexity) startup I'd practically use this as a playbook.
> This means that if people want to play nasty and find issues in our services, they first need to bypass Cloudflare’s WAF. And this is not an easy thing to do.
Is it not the case that you just need to use their IP address and you bypass essentially 100% of what cloudflare offers?
I guess a targeted attack is hard (how do you find the "real" IP?), but there.may be speculative attacks just scanning through IP ranges
You default-deny all source IPs, then allowlist your CDN's IPs on your "origin server" or its network's firewall box (if you have such a thing). Is the usual way to solve this problem, anyway, IDK if that's what they're doing.
Then it doesn't matter if someone finds the IP of the actual server. Worst they can do is flood you with instantly-dropped connection attempts, but not probe services or run up your server hosting bill with large data transfers or anything like that. Scans won't find listening ports.
You can also set up TLS client authentication as a more complicated but a bit more assured method of refusing connections from anyone other than Cloudflare.
- accept traffic only from the published CloudFlare IP ranges
- connect to CloudFlare rather than accept traffic (Argo tunnel)
The first one unfortunately doesn't protect you from someone scanning from the CloudFlare ranges themselves. You can add a custom header in that case so that any traffic without the shared secret is not accepted.
Was waiting for them to get to the "why is it so complex" part, but after all the details of Cloudflare Pages, Cloudflare R2, Cloudflare Workers, Cloudflare Access, EC2 instances, multiple CDNs, hosted Redis, Nomad, Pulumi, web of proxies and APIs and front doors, a dozen microservices and an IaaS repo to make sense of all of this, it came down to:
> In total, we store over 150GiB of data, transfer over 6TiB of data monthly, have more than 10M requests a month, and serve thousands of unique visitors every week.
Basically my MacBook Pro from 2019 could host all their infra and data and serve the entire load (~3 RPS) with room to spare for my day-to-day work.
For anyone else who is reading the post looking to get inspired – ignore everything they did and start small. A single web service to handle all business logic, hosted on two rented VPS instances which split traffic. Data stored in mysql or postgres with some regular backup. Start scaling only when the load from this setup overwhelms you (and I can guarantee it won't for 99.9% of cases, including the one in this post).