Hardware
H100 or H200 GPUs for serious inference. Consumer GPUs (RTX 4090) work for development. Cloud GPU instances (AWS p5, Azure ND-series, GCP A3) for variable workloads. Rough rule: 1 H100 handles ~20 concurrent users for 70B class models.
Orchestration
vLLM or TGI for production inference serving. Kubernetes for scaling. Monitoring with Prometheus/Grafana. Cost tracking per team. Treat LLM serving as a first-class platform service, not a research project.
Cost Math
Self-hosted 70B model: roughly $0.20-0.40 per million tokens at scale. Proprietary API: $3-$15 per million tokens. Self-hosted wins at ~10M tokens/month. Below that, commercial is operationally cheaper.
Skills Required
ML ops or platform engineering expertise. Model updates, scaling, incident response. Don’t embark on self-hosting unless you have the team or budget to hire. Pilot on cloud GPU first; avoid upfront hardware commitment.