Self-Hosted LLM Inference for CRM

Hardware

H100 or H200 GPUs for serious inference. Consumer GPUs (RTX 4090) work for development. Cloud GPU instances (AWS p5, Azure ND-series, GCP A3) for variable workloads. Rough rule: 1 H100 handles ~20 concurrent users for 70B class models.

Orchestration

vLLM or TGI for production inference serving. Kubernetes for scaling. Monitoring with Prometheus/Grafana. Cost tracking per team. Treat LLM serving as a first-class platform service, not a research project.

Cost Math

Self-hosted 70B model: roughly $0.20-0.40 per million tokens at scale. Proprietary API: $3-$15 per million tokens. Self-hosted wins at ~10M tokens/month. Below that, commercial is operationally cheaper.

Skills Required

ML ops or platform engineering expertise. Model updates, scaling, incident response. Don’t embark on self-hosting unless you have the team or budget to hire. Pilot on cloud GPU first; avoid upfront hardware commitment.

Hardware

Orchestration

Cost Math

Skills Required

More in this thread

April 2026 CRM News Roundup

80% of Routine Customer Interactions Handled by AI in 2026

Accessibility for AI CRM in 2026