$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
If there's an issue with the server while they're sick or on vacation, you just stop and wait.
If they take a new job, you need to find someone to take over or very quickly hire a replacement.
There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.
So the entire business was happy to go offline for 2/3 weeks whenever their infra person fancied going off on their summer holiday?
By doing this, you're guaranteeing a bus factor of below 1. I can't think of any business that wouldn't see that as being a completely unacceptable risk.
I never understand the drive to stay away from cloud services for small scale operations. It’s not your money that’s being spent on the cloud, but it is your free time being asked to be on call when you encourage your company to self-host!
Bus factor 1 is rarely enough for "entire business". But if the GPUs are for training models, and their users are the data scientists that are also on holiday around the same times - that might indeed be good enough policy.
Ouch, that is indeed a risk one must be wary of. Can be a "works for the company but sucks for employees". Which can also drain the company of skilled people, a poor trade in most cases.
If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?
Companies following consultant reports will usually end up offering 50% ranges, which for SRE/SIE roles in major metros comes to around $163k. If they study BLS/FRED/CPI data and aim to pay someone enough for a 50/30/20 budget in a major metro at median rent, they’ll offer $175k to $200k+. If they want someone to stick around, buy an average home, lay roots, it’s $210k+, minimum.
“Six figures” doesn’t cover essentials anymore for almost every major city in the USA, and the last thing you can afford to cheap out on is the labor supporting your IT infra. Every corner you cut today on TC (outsourcing, offshoring, consulting) is just letting fires rage until you either parachute out or everything burns down, and that’s not a game you can afford to play with critical business technologies.
I’m not disagreeing. I’m explaining to the commenter above that $120K isn’t going to cover the costs of a full-time SRE who will be on call 24/7
If a business can’t afford a properly staffed crew with enough allowance to cover a rotation of on call duties and allow for vacations, they should prefer the managed cloud services.
You’re paying more but you’re buying freedom and flexibility.
> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one.
You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.
This is my favorite use of the public cloud: the modern-day “hot site”. It’s way cheaper to just pay reserved rates for failover instances of critical infra than a whole other unused site, assuming your particular compliance or regulatory frameworks allow it. Especially in an era of remote work, it’s highly practical and cost-effective.
> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
they come with warranty, often with technican guaranteed to arrive within few hours or at most a day. Also if SHTF just getting cloud to augument current lackings isn't hard
And the other argument: every company I've ever know to do AWS has an AWS sysadmin (sorry "devops"), same for Azure. Even for small deployments. And departments want their own person/team.
Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.
Know your market, and pay accordingly. You cannot fuck around with SREs.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
This is less of an issue than you might think, but strongly dependent upon the quality of talent you’ve retained and the budget you’ve given them. Shitbox hardware or cheap-ass talent means you’ll need to double or triple up locally, but a quality candidate with discretion can easily be supported by a counterpart at another office or site, at least short-term. Ideally though, yeah, you’ll need two engineers to manage this stack, but AWS savings on even a modest (~700 VMs) estate will cover their TC inside of six months, generally.
> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
This strikes at another workload I neglected to mention, and one I highly recommend keeping in the public cloud: GPUs.
GPUs on-prem suck. Drivers are finnicky, firmware is flakey, vendor support inconsistent, and SR-IOV is a pain in the ass to manage at scale. They suck harder than HBAs, which I didn’t think was possible.
If you’re consuming GPUs 24x7 and can afford to support them on-prem, you’re definitely not here on HN killing time. For everyone else, tune your scaling controls on your cloud provider of choice to use what you need, when you need it, and accept the reality that hyperscalers are better suited for GPU workloads - for now.
> Going on-prem like this is highly risky.
Every transaction is risky, but the risk calculus for “static” (ADDS) or “stable” (ERP, HRIS, dev/test) work makes on-prem uniquely appealing when done right. Segment out your resources (resist the urge for HPC or HCI), build sensible redundancies (on-prem or in the cloud), and lean on workhorse products over newer, fancier platforms (bulletproof hypervisors instead of fragile K8s clusters), and you can make the move successful and sensible. The more cowboy you go with GPUs, K8s, or local Terraform, the more delicate your infra becomes on-prem - and thus the riskier it is to keep there.
> Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!
>> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
> Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.
That's $120k per pod. Four pods per rack at 50kW.
What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?
> What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?
The kind where TC isn’t measured by pod managed, but by person hired. Also the world where median rent in major metros is $3500 a month.
If you think $120k is rich, you’re either operating in the boonies, outside the USA/Canada, or incredibly out of touch with the cost of living today and need to seriously go study BLS/FRED/CPI data sets to understand how expensive it is to live right now.
Indeed, there's no reason for a company to host this kind of batch compute in North America. You can get very good people in Eastern Europe at 1/3 the cost.
I like how this simple claim about being cheaper to self-host a single server has now escalated to opening an office in Eastern Europe and hiring people there to manage it.
The trend of opening offices in Europe started one year into Covid. I'm sure that there are companies that haven't opened an office there yet, but fewer than one might imagine.
and somehow i have this impression that gpus on slurm/pbs could not be simpler.
u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.
and its a cluster with a job queue.. 1 node going down is not the end of the world..
ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...
That sounds way easier than the methods I’ve had to manage GPUs in the Enterprise on-prem thus far (PCIe cards slotted into hypervisor boxes and shared via SR-IOV). I’ll have to look into it, but I doubt it’ll ever enter my personal wheelhouse given how quickly GPU-based workloads are either moved to the cloud for effective utilization at scale, or onto custom accelerators for edge workloads/inference.
Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
If there's an issue with the server while they're sick or on vacation, you just stop and wait.
If they take a new job, you need to find someone to take over or very quickly hire a replacement.
There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.