Five cost surprises when you host your own LLM

Most conversations about model selection focus on capability benchmarks. Accuracy on MMLU, reasoning scores, context window size. Those matter, but they’re not what ends up in your budget conversation six months after you deploy. The cost to host LLM infrastructure is where the real surprises live, and the gap between a 7B model and a 70B model is not a linear function of parameter count.

I want to walk through five specific places where the math gets uncomfortable, because I’ve watched teams underestimate each of them. The goal isn’t to scare anyone away from self-hosting. It’s to give you numbers concrete enough to defend a budget in a room full of people who don’t know what an H100 is.

The VRAM cliff between 7B and 70B

A 7B model in 16-bit precision needs roughly 14 GB of VRAM. An A100 80GB can hold it with room to spare for KV cache, which is the memory that stores intermediate attention states during generation. You can run inference on a single GPU at maybe $3–5/hour on most GPU clouds. That’s a manageable number.A 70B model at 16-bit needs around 140 GB. No single GPU holds that. You’re now talking about 2–4 A100 80GB cards minimum, or 2 H100s if you want headroom. H100s run $7–12/hour on specialized GPU clouds, and closer to $25–40/hour on hyperscaler managed platforms. So before you’ve thought about storage, networking, or redundancy, you’ve multiplied your compute line by somewhere between 4x and 10x.

The cliff isn’t at 70B specifically. It’s wherever you cross a GPU boundary. A 13B model fits comfortably on one A100. A 30B model doesn’t. That boundary is worth knowing before you commit to a model family.

Monthly compute at realistic concurrency levels

Benchmarks are usually run at batch size 1, which tells you almost nothing about production cost. What matters is how many simultaneous requests your system needs to serve while staying within your latency target.For an internal tool with maybe 50 concurrent users at peak, a single A100 80GB running a 7B model can probably keep up. Call it $3/hour, 730 hours in a month: roughly $2,200/month in raw compute. Add 30% for overhead and you’re around $2,800–3,000.

Now run the same math for a 70B model serving the same 50 users. You need at least 4 A100s or 2 H100s. At $10/hour for that cluster, you’re at $7,300/month before overhead. For a customer-facing product at 500 concurrent users, you’re looking at a cluster of 8–16 H100s. At $50–80/hour, that’s $36,000–58,000/month just in GPU time.

Large language model pricing at the infrastructure layer scales with concurrency faster than most people expect, because adding users doesn’t just add compute linearly. You also need to maintain latency guarantees, which often means running GPUs at 60–70% utilization rather than 100%, effectively paying for headroom you’re not fully using.Storage is cheap until it isn’t

A 7B model checkpoint is roughly 15–20 GB. A 70B checkpoint is 150–200 GB. That sounds like a storage footnote until you account for everything else sitting next to it.

You typically need: the base model weights, at least one fine-tuned version, a previous checkpoint for rollback, your vector store if you’re running RAG, and logs. For a 70B deployment with two model versions and a mid-sized vector store, you’re easily at 600–800 GB. At $0.08–0.12/GB/month for NVMe-class storage, that’s $50–100/month, which is genuinely not a big deal.The part that surprises people is egress. If you’re loading model weights from object storage at startup, or shipping embeddings between services, or running in a multi-region setup, bandwidth charges accumulate. Global egress typically runs $0.05–0.12/GB. Loading a 200 GB model checkpoint 20 times a month (restarts, scaling events, node failures) is 4 TB of egress. That’s $200–480 just in data movement, and that number grows with model size and operational instability.

A concrete example worth thinking through

A team running a 70B model in a cloud environment with auto-scaling and no warm instance pool can easily trigger 30–50 cold starts per month during traffic spikes. Each cold start loads the full checkpoint. At 180 GB per load, that’s 5–9 TB of egress per month from model loading alone, before any actual inference traffic. Keeping a warm pool eliminates most of that cost but adds baseline compute spend. It’s a real tradeoff, not a configuration detail.The fine-tuning cycle nobody budgets for

Initial deployment cost is what gets into the budget. Retraining cost is what blows it.

Fine-tuning a 7B model on a few thousand examples takes a few hours on a single A100. Fine-tuning a 70B model with LoRA on the same dataset takes 12–24 hours on 4–8 GPUs. If you’re doing that monthly to incorporate new domain data or correct model behavior, the retraining line can be $500–1,000/month for the small model or $3,000–8,000/month for the large one, depending on your cloud and dataset size.Teams that run 7B models often fine-tune more aggressively because it’s cheap enough to experiment. Teams running 70B models sometimes under-tune because the cost is prohibitive, which is a different kind of problem: you end up with a more capable base model that’s less adapted to your specific use case than a well-tuned smaller one.

The operational staff multiplier

Infrastructure costs are visible on invoices. Staff costs are not, and they’re often larger over a 12-month horizon.A 7B deployment that’s reasonably stable might be maintained by one engineer spending 20–30% of their time on it. A 70B multi-node cluster with high availability requirements, compliance logging, and regular retraining cycles needs closer to 2–3 engineers with meaningful GPU infrastructure experience. That’s a $300,000–500,000/year difference in fully-loaded salary cost, which dwarfs the compute delta for most mid-market teams.

This is the number that’s hardest to show a CFO, but it’s the one that most often determines whether a self-hosted deployment is actually cheaper than a managed API. Aimprosoft’s team worked through this specific tradeoff in their piece on private LLM hosting costs, including how deployment environment choices (cloud vs. bare metal vs. hybrid) change the staff burden alongside the infrastructure bill.

The honest summary is that 7B and 70B models aren’t competing on the same cost curve. If your use case genuinely needs 70B-class capability, the cost is defensible. If you haven’t tested whether a well-tuned 13B model gets you 90% of the way there, that test is worth running before you commit to the larger infrastructure footprint.

Five cost surprises when you host your own LLM

The VRAM cliff between 7B and 70B

Monthly compute at realistic concurrency levels

A concrete example worth thinking through

The operational staff multiplier

More in artificial-intelligence

4 Best AI Photo Generator Tools in 2026 for Fast, Studio-Quality Headshots

Why “Edge AI” Still Struggles in Real Deployment Environments

Improve SaaS Strategy Using Smarter Industry Benchmarking

Why AI Video Will Destroy Traditional Stock Footage Businesses

Write for entrepreneurs, founders, and builders.

Why write for Venture?

Posts Across the Network

Email Deliverability for SaaS Companies: Common Challenges and Solutions

How to Verify US-Only Services From Outside America

How Rotating Residential Proxies Work: Per-Request vs Sticky vs Timed

How Rotating Residential Proxies Work: Per-Request vs Sticky vs Timed

AI Inference Infrastructure in 2026: Choosing the Right Enterprise Server for Production LLM Workloads

How Developers Can Protect Their Code and Data with a Reliable VPN