Why three tiers instead of recommending a specific model?

Models change every quarter. Tier-level recommendations survive model upgrades; specific model recommendations don't. Within each tier we name candidates we'd evaluate — but the choice between Claude Haiku and GPT mini lands after running them against your real workload, not from a wizard.

What if I scored equally across two tiers?

Common — and usually the right answer is model tiering. Use mid-tier for routine work, frontier for the hard cases, with explicit routing logic. Most production AI deployments we ship use 2 or 3 tiers in tandem; the wizard's strongest signal is the tier you should anchor on.

Is self-hosted always the cheapest at scale?

No. Self-hosted economics start beating frontier-API pricing around 100M tokens/month and beating mid-tier API pricing around 1B tokens/month — assuming you have an existing GPU platform team. If you don't, the operational cost (engineering, monitoring, on-call) typically erases the per-token savings until volume gets very high.

Most enterprise AI doesn't benefit from fine-tuning. Frontier API + strong RAG covers ~85% of enterprise workloads. Fine-tuning is the right answer for narrow tasks with consistent input/output shape, latency-sensitive deployments where small models with fine-tuning beat frontier latency, or sovereignty cases where self-hosted is mandatory and fine-tuning compensates for smaller base capability.

What about data sensitivity beyond standard B2B?

BAA / DPA chain matters. AWS Bedrock under HIPAA-eligible accounts and Azure OpenAI under BAA are production options for ePHI workloads. For data sovereignty mandates (regulated jurisdictions, classified workloads, air-gapped environments), self-hosted is typically the only path. The wizard accounts for these constraints; bring your specific requirements to a discovery call for verification.

How do you actually evaluate model candidates?

Curated eval datasets (built with your domain experts), faithfulness and refusal scoring, latency and cost-per-task tracking, and equity-aware subgroup evaluation where applicable. We run candidates head-to-head on your real workload before recommending a final choice — not against generic benchmarks. Most engagements end with a model decision that surprised at least one stakeholder.

Wizard · Free

LLM Selection Wizard

Pick the right LLM tier for your workload — frontier API, mid-tier, or self-hosted — based on data sensitivity, task complexity, volume, latency, fine-tuning needs, and operational appetite. Output: a recommendation with model candidates and explicit tradeoffs.

Estimated time: 4 minutes
Cost: Free · No login

0 of 6 answered

Question 1
What's the data sensitivity?
Question 2
What's the task complexity?
Question 3
What's the workload volume?
Question 4
What's the latency requirement?
Question 5
Will the workload benefit from fine-tuning?
Question 6
What's your engineering team's appetite for ops?

How it works

4 steps, no fluff.

We don’t gate the tool behind a form. Take the assessment; share your email at the end if you want a written report.

01
Answer 6 constraint questions
Data sensitivity, task complexity, workload volume, latency requirement, fine-tuning needs, and operational appetite. Each option weights the three tiers.
02
See the tier scores
Three-tier breakdown: frontier API / mid-tier / self-hosted. The headline recommendation is the highest-scoring tier; the breakdown shows how close the alternatives are.
03
Read the model candidates and tradeoffs
Each tier has specific model candidates we'd evaluate in real engagements (Claude / GPT / Gemini frontier, Haiku / mini / Flash mid-tier, Llama / Mistral self-hosted). Tradeoffs flag what the tier doesn't solve.
04
Bring it to a senior AI engineer
Real model selection lands after running candidates against your actual workload on a curated eval dataset. The wizard gives you the directional starting point; we'll do the calibrated evaluation.

Common questions

Before you start.

01
Why three tiers instead of recommending a specific model?
Models change every quarter. Tier-level recommendations survive model upgrades; specific model recommendations don't. Within each tier we name candidates we'd evaluate — but the choice between Claude Haiku and GPT mini lands after running them against your real workload, not from a wizard.
02
What if I scored equally across two tiers?
Common — and usually the right answer is model tiering. Use mid-tier for routine work, frontier for the hard cases, with explicit routing logic. Most production AI deployments we ship use 2 or 3 tiers in tandem; the wizard's strongest signal is the tier you should anchor on.
03
Is self-hosted always the cheapest at scale?
No. Self-hosted economics start beating frontier-API pricing around 100M tokens/month and beating mid-tier API pricing around 1B tokens/month — assuming you have an existing GPU platform team. If you don't, the operational cost (engineering, monitoring, on-call) typically erases the per-token savings until volume gets very high.
04
Should I fine-tune?
Most enterprise AI doesn't benefit from fine-tuning. Frontier API + strong RAG covers ~85% of enterprise workloads. Fine-tuning is the right answer for narrow tasks with consistent input/output shape, latency-sensitive deployments where small models with fine-tuning beat frontier latency, or sovereignty cases where self-hosted is mandatory and fine-tuning compensates for smaller base capability.
05
What about data sensitivity beyond standard B2B?
BAA / DPA chain matters. AWS Bedrock under HIPAA-eligible accounts and Azure OpenAI under BAA are production options for ePHI workloads. For data sovereignty mandates (regulated jurisdictions, classified workloads, air-gapped environments), self-hosted is typically the only path. The wizard accounts for these constraints; bring your specific requirements to a discovery call for verification.
06
How do you actually evaluate model candidates?
Curated eval datasets (built with your domain experts), faithfulness and refusal scoring, latency and cost-per-task tracking, and equity-aware subgroup evaluation where applicable. We run candidates head-to-head on your real workload before recommending a final choice — not against generic benchmarks. Most engagements end with a model decision that surprised at least one stakeholder.

More tools

Other free tools.

Talk to us

Want a senior engineer to walk through your result?

Bring your scored report to a 30-minute call. Senior engineer plus department lead. No discovery gauntlet, no junior reps.

Book a discovery call All tools

LLM Selection Wizard

4 minutes

Free · No login

LLM Selection Wizard

What's the data sensitivity?

What's the task complexity?

What's the workload volume?

What's the latency requirement?

Will the workload benefit from fine-tuning?

What's your engineering team's appetite for ops?

Answer 6 constraint questions

See the tier scores

Read the model candidates and tradeoffs

Bring it to a senior AI engineer

Why three tiers instead of recommending a specific model?

What if I scored equally across two tiers?

Is self-hosted always the cheapest at scale?

Should I fine-tune?

What about data sensitivity beyond standard B2B?

How do you actually evaluate model candidates?

Other free tools.

AI Readiness Assessment

Enterprise AI ROI Calculator

Legacy Modernization Assessment

Want a senior engineer to walk through your result?

LLM Selection Wizard

What's the data sensitivity?

What's the task complexity?

What's the workload volume?

What's the latency requirement?

Will the workload benefit from fine-tuning?

What's your engineering team's appetite for ops?

Answer 6 constraint questions

See the tier scores

Read the model candidates and tradeoffs

Bring it to a senior AI engineer

Why three tiers instead of recommending a specific model?

What if I scored equally across two tiers?

Is self-hosted always the cheapest at scale?

Should I fine-tune?

What about data sensitivity beyond standard B2B?

How do you actually evaluate model candidates?

Other free tools.

AI Readiness Assessment

Enterprise AI ROI Calculator

Legacy Modernization Assessment

Want a senior engineer to walk through your result?