$120K isn't going to cover the fully loaded costs of an SRE who can set up and r...

Figs · 2026-03-03T21:52:14 1772574734

> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

> If there's an issue with the server while they're sick or on vacation, you just stop and wait.

Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.

roryirvine · 2026-03-04T10:23:46 1772619826

So the entire business was happy to go offline for 2/3 weeks whenever their infra person fancied going off on their summer holiday?

By doing this, you're guaranteeing a bus factor of below 1. I can't think of any business that wouldn't see that as being a completely unacceptable risk.

Aurornis · 2026-03-04T16:23:19 1772641399

I agree.

I never understand the drive to stay away from cloud services for small scale operations. It’s not your money that’s being spent on the cloud, but it is your free time being asked to be on call when you encourage your company to self-host!

jononor · 2026-03-04T13:24:53 1772630693

Bus factor 1 is rarely enough for "entire business". But if the GPUs are for training models, and their users are the data scientists that are also on holiday around the same times - that might indeed be good enough policy.

Aurornis · 2026-03-04T16:24:38 1772641478

> and their users are the data scientists that are also on holiday around the same times

I’ve seen this before. It turns into restrictions on when you can schedule vacation times.

Not fun when your family wants to go on a trip but you can’t get the time off because it’s not one of the allowed vacation times.

jononor · 2026-03-04T19:17:04 1772651824

Ouch, that is indeed a risk one must be wary of. Can be a "works for the company but sucks for employees". Which can also drain the company of skilled people, a poor trade in most cases.

justsomehnguy · 2026-03-03T22:12:01 1772575921

If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?

stego-tech · 2026-03-04T13:30:17 1772631017

This.

Companies following consultant reports will usually end up offering 50% ranges, which for SRE/SIE roles in major metros comes to around $163k. If they study BLS/FRED/CPI data and aim to pay someone enough for a 50/30/20 budget in a major metro at median rent, they’ll offer $175k to $200k+. If they want someone to stick around, buy an average home, lay roots, it’s $210k+, minimum.

“Six figures” doesn’t cover essentials anymore for almost every major city in the USA, and the last thing you can afford to cheap out on is the labor supporting your IT infra. Every corner you cut today on TC (outsourcing, offshoring, consulting) is just letting fires rage until you either parachute out or everything burns down, and that’s not a game you can afford to play with critical business technologies.

Aurornis · 2026-03-04T16:27:32 1772641652

I’m not disagreeing. I’m explaining to the commenter above that $120K isn’t going to cover the costs of a full-time SRE who will be on call 24/7

If a business can’t afford a properly staffed crew with enough allowance to cover a rotation of on call duties and allow for vacations, they should prefer the managed cloud services.

You’re paying more but you’re buying freedom and flexibility.

Manuel_D · 2026-03-04T00:32:30 1772584350

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one.

You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.

stego-tech · 2026-03-04T13:32:41 1772631161

This is my favorite use of the public cloud: the modern-day “hot site”. It’s way cheaper to just pay reserved rates for failover instances of critical infra than a whole other unused site, assuming your particular compliance or regulatory frameworks allow it. Especially in an era of remote work, it’s highly practical and cost-effective.

PunchyHamster · 2026-03-03T23:29:13 1772580553

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

they come with warranty, often with technican guaranteed to arrive within few hours or at most a day. Also if SHTF just getting cloud to augument current lackings isn't hard

formerly_proven · 2026-03-03T21:39:42 1772573982

> There's a second bus factor: What happens when that 8xH100 starts to get flakey?

These come in a non-flakey variant?

spwa4 · 2026-03-03T22:33:25 1772577205

It's called a warranty.

And the other argument: every company I've ever know to do AWS has an AWS sysadmin (sorry "devops"), same for Azure. Even for small deployments. And departments want their own person/team.

Aurornis · 2026-03-04T16:28:28 1772641708

You can tell in this thread who has and who hasn’t had to work with this hardware.

My favorite are the responses from people saying the warranty will have someone show up in “hours” and fix it. Best of luck to you.

stego-tech · 2026-03-04T00:28:39 1772584119

Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!

> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.

Know your market, and pay accordingly. You cannot fuck around with SREs.

> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

This is less of an issue than you might think, but strongly dependent upon the quality of talent you’ve retained and the budget you’ve given them. Shitbox hardware or cheap-ass talent means you’ll need to double or triple up locally, but a quality candidate with discretion can easily be supported by a counterpart at another office or site, at least short-term. Ideally though, yeah, you’ll need two engineers to manage this stack, but AWS savings on even a modest (~700 VMs) estate will cover their TC inside of six months, generally.

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

This strikes at another workload I neglected to mention, and one I highly recommend keeping in the public cloud: GPUs.

GPUs on-prem suck. Drivers are finnicky, firmware is flakey, vendor support inconsistent, and SR-IOV is a pain in the ass to manage at scale. They suck harder than HBAs, which I didn’t think was possible.

If you’re consuming GPUs 24x7 and can afford to support them on-prem, you’re definitely not here on HN killing time. For everyone else, tune your scaling controls on your cloud provider of choice to use what you need, when you need it, and accept the reality that hyperscalers are better suited for GPU workloads - for now.

> Going on-prem like this is highly risky.

Every transaction is risky, but the risk calculus for “static” (ADDS) or “stable” (ERP, HRIS, dev/test) work makes on-prem uniquely appealing when done right. Segment out your resources (resist the urge for HPC or HCI), build sensible redundancies (on-prem or in the cloud), and lean on workhorse products over newer, fancier platforms (bulletproof hypervisors instead of fragile K8s clusters), and you can make the move successful and sensible. The more cowboy you go with GPUs, K8s, or local Terraform, the more delicate your infra becomes on-prem - and thus the riskier it is to keep there.

Keep it simple, silly.

throwup238 · 2026-03-04T03:43:39 1772595819

> Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!

>> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

> Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.

That's $120k per pod. Four pods per rack at 50kW.

What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?

stego-tech · 2026-03-04T13:21:55 1772630515

> What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?

The kind where TC isn’t measured by pod managed, but by person hired. Also the world where median rent in major metros is $3500 a month.

If you think $120k is rich, you’re either operating in the boonies, outside the USA/Canada, or incredibly out of touch with the cost of living today and need to seriously go study BLS/FRED/CPI data sets to understand how expensive it is to live right now.

ragall · 2026-03-04T16:26:07 1772641567

> outside the USA/Canada

Indeed, there's no reason for a company to host this kind of batch compute in North America. You can get very good people in Eastern Europe at 1/3 the cost.

Aurornis · 2026-03-04T16:30:36 1772641836

I like how this simple claim about being cheaper to self-host a single server has now escalated to opening an office in Eastern Europe and hiring people there to manage it.

ragall · 2026-03-04T17:10:41 1772644241

The trend of opening offices in Europe started one year into Covid. I'm sure that there are companies that haven't opened an office there yet, but fewer than one might imagine.

lazylizard · 2026-03-04T03:09:49 1772593789

i am not sre, merely sysadmin.

and somehow i have this impression that gpus on slurm/pbs could not be simpler.

u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.

and its a cluster with a job queue.. 1 node going down is not the end of the world..

ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...

stego-tech · 2026-03-04T13:24:33 1772630673

That sounds way easier than the methods I’ve had to manage GPUs in the Enterprise on-prem thus far (PCIe cards slotted into hypervisor boxes and shared via SR-IOV). I’ll have to look into it, but I doubt it’ll ever enter my personal wheelhouse given how quickly GPU-based workloads are either moved to the cloud for effective utilization at scale, or onto custom accelerators for edge workloads/inference.

charcircuit · 2026-03-03T23:07:11 1772579231

>If there's an issue with the server while they're sick or on vacation, you just stop and wait.

You can ask AI to troubleshoot and fix the issue.