How do you handle hallucinations in production?

Three layers. First, retrieval grounding — every generated answer cites the source documents it drew from, and outputs without grounding are blocked or flagged. Second, evaluation — we run faithfulness scoring (LLM-as-judge plus ground-truth comparison) on every prompt change in CI. Third, refusal patterns — we explicitly train and prompt for 'I don't know' as an acceptable answer in domains where confident-but-wrong is dangerous (clinical, legal, financial).

When do you fine-tune vs use frontier models with RAG?

Most enterprise generative AI work is better served by frontier APIs (GPT, Claude, Gemini) with strong RAG and prompt engineering. Fine-tuning is the right call for narrow domains with consistent input/output shapes, latency-sensitive applications where you can't afford a frontier-model round trip, or when data sovereignty requires self-hosted models. We assess honestly — fine-tuning is often the wrong answer because it's expensive, brittle to base-model updates, and frequently underperforms a well-engineered RAG system.

How do you choose between OpenAI, Anthropic, Google, and open-weight models?

Architecture before model. We design the retrieval pipeline, evaluation harness, and integration layer to be model-agnostic, then select based on the eval results on your real workload. Frontier API models lead on reasoning-heavy tasks; mid-tier models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle routine work at a fraction of the cost; open-weight models (Llama, Mistral) win when data sovereignty or fine-tuning are non-negotiable. Most production systems we ship use 2–3 models in tiers.

How do you control LLM costs as usage scales?

Cost is an architectural concern, not a procurement one. We design with model tiering (cheap models route routine queries, expensive models reserved for hard cases), prompt caching, semantic caching of common queries, retrieval to reduce context length, and continuous monitoring of token spend per use case. Most clients see 60–80% cost reduction within three months of our engagement compared to their initial implementation, while improving quality.

Can you deploy LLMs on-prem or in a private cloud?

Yes. We deploy self-hosted LLMs on customer infrastructure (AWS, Azure, GCP, or bare-metal) using vLLM, TGI, or llama.cpp depending on the workload. We've shipped private-cloud deployments for healthcare, financial services, and government clients where ePHI, regulated data, or sovereignty requirements rule out third-party APIs. The tradeoff is operational — self-hosted means you own the GPU bill and the on-call rotation.

How do you defend against prompt injection?

Defense in depth. Input filtering for known injection patterns, structured prompts that separate untrusted user content from system instructions, output validation against allowed action schemas, least-privilege tool permissions for agentic systems, and human-in-the-loop checkpoints for high-impact actions. There is no single fix — prompt injection is a class of vulnerabilities, not a single bug. We design the system assuming inputs are adversarial.

Do you provide ongoing evaluation and model upgrades?

Yes — through our Managed Services engagement model. We maintain the evaluation dataset, run continuous faithfulness scoring on production traffic, monitor for drift, and evaluate new model releases on a quarterly cadence. When a better model ships (and they do, frequently), we run an A/B harness against the eval set and the production traffic shadow, and present go/no-go data to the stakeholder. Model selection is a recurring decision, not a one-time architecture choice.

What does a typical generative AI engagement look like?

A focused 8–14 week build for a single high-leverage use case is the most common starting point: $250K – $750K, a senior engineer plus the CORTEX department lead, with a production-grade system at the end (not a demo). From there, most clients expand to additional use cases or move to a Managed Services engagement for ongoing operations. We publish budget brackets honestly so visitors self-qualify before the first call.

AI & Machine Learning · CORTEX

Generative AI & LLM.

Production-grade generative AI — RAG, fine-tuning, agentic systems, and LLM integration — built with the evaluation, governance, and cost controls enterprises actually ship with.

Practice: AI & Machine Learning
Department: CORTEX

The problem

Most LLM projects look great in a demo, never reach production.

The pattern is familiar: a data scientist wires a vector store to GPT-4, the demo wins a budget review, and six months later the project is still pre-production. The reasons are predictable — no evaluation harness, retrieval that works on toy data and breaks on the real corpus, hallucinations that nobody is measuring, prompt regressions that nobody catches, costs that surprise finance, and a security team that can't approve the architecture.

We build the production system. From day one we instrument retrieval quality, ground every generated answer in citations, run evaluation against ground-truth datasets on every prompt change, and design for the cost ceilings finance actually committed to. Generative AI is a software engineering problem dressed up as a research problem; we treat it that way.

Where it ships

6 use cases, in production.

Specific applications we’ve built and operated. Not speculative — every example below is grounded in a real shipped engagement.

01
12M
documents indexed
Knowledge & internal search
RAG over your documents, wikis, tickets, contracts, and research. Hybrid retrieval (dense + sparse + filters), reranking, and citation tracking on every answer.
02
−47%
ticket volume
Customer service automation
Tier-1 response automation grounded in your help center and policies. Human-in-the-loop escalation, full audit trail, refusal patterns for out-of-scope queries.
03
94%
extraction accuracy
Document understanding
Contract review, claims processing, financial filings, clinical notes. OCR + LLM extraction with structured output schemas and confidence scoring.
04
$4.2M
annual savings
Decision support
Domain-specific copilots for clinicians, analysts, and underwriters. Cite-or-refuse defaults, jurisdiction-aware filtering, role-based redaction.
05
+31%
PR throughput
Code & developer assistants
Internal copilots fine-tuned on your codebase, conventions, and runbooks. IDE integrations, PR review automation, and incident-response triage.
06
8x
asset throughput
Content & marketing operations
Brand-consistent generation with style guardrails, fact-checking against approved sources, and editorial review queues. Not a content firehose — a controlled pipeline.

How we engage

4 phases, named in the SOW.

Each phase has a deliverable, an owner, and an acceptance criterion. Not slogans — operating rules.

01
Use case selection
We start by pruning. Many proposed LLM use cases are better solved by classic ML, deterministic rules, or simple search. We identify the workflows where generation, summarization, or natural-language interfaces unlock value the existing stack can't — and rule out the ones where an LLM is just an expensive search bar.
02
RAG & evaluation architecture
Before model selection, we design the retrieval pipeline (chunking, embedding, hybrid search, reranking), the evaluation harness (ground-truth dataset, faithfulness metrics, latency targets), and the observability stack (prompt versioning, response logging, drift detection). The architecture is what carries you across model upgrades; the model itself is a swap-out.
03
Production deployment
Deploy on AWS Bedrock, Azure OpenAI, or Vertex AI with managed endpoints, vector store, and IAM-aware data boundaries. Self-hosted alternatives (vLLM, llama.cpp, on-prem GPU) where data sovereignty requires it. Security review and compliance evidence collection happen in parallel, not after launch.
04
Continuous evaluation
Every prompt change runs against the eval set in CI. Production traffic feeds back to the dataset for drift detection. Quarterly model upgrades get an A/B harness with clear go/no-go gates. Cost-per-query is tracked per use case with monthly optimization sprints.

Capabilities

What’s in scope.

Enterprise RAG: hybrid retrieval, reranking, citation tracking, multi-tenant isolation
Fine-tuning and domain adaptation (SFT, LoRA, QLoRA, continued pre-training)
Production agentic systems with tool use, function calling, and orchestration
Prompt engineering, prompt versioning, and prompt CI/CD
Evaluation harnesses: faithfulness, groundedness, refusal correctness, latency budgets
Cost optimization: model tiering, prompt caching, semantic caching, context compression
Self-hosted and private-cloud LLM deployment for data sovereignty requirements
Safety and governance: PII redaction, prompt injection defenses, audit logs, model cards

Stack

Tools we use in production.

Foundation models: OpenAI GPTAnthropic ClaudeGoogle GeminiCohere CommandMistralLlama
Frameworks: LangChainLlamaIndexHaystackDSPyVercel AI SDK
Vector & retrieval: PineconeWeaviateQdrantpgvectorElasticsearchVespa
Cloud LLM serving: AWS BedrockAzure OpenAIVertex AIvLLMTGI
Evaluation: PromptfooRagasDeepEvalLangSmithOpenAI Evals
Observability: LangSmithHeliconeLangfuseDatadog LLM

Selected work

Quantified outcomes, not adjectives.

All case studies

AI for Healthcare

Building clinical AI? See the healthcare-specific intersection.

FHIR-aligned RAG, ePHI handling, HIPAA evidence collection, and the SOC 2-aligned audit logging clinical buyers need. Same generative-AI capability, scoped to the constraints healthcare procurement actually enforces.

Healthcare intersection

Common questions

Asked before the first call.

01
How do you handle hallucinations in production?
Three layers. First, retrieval grounding — every generated answer cites the source documents it drew from, and outputs without grounding are blocked or flagged. Second, evaluation — we run faithfulness scoring (LLM-as-judge plus ground-truth comparison) on every prompt change in CI. Third, refusal patterns — we explicitly train and prompt for 'I don't know' as an acceptable answer in domains where confident-but-wrong is dangerous (clinical, legal, financial).
02
When do you fine-tune vs use frontier models with RAG?
Most enterprise generative AI work is better served by frontier APIs (GPT, Claude, Gemini) with strong RAG and prompt engineering. Fine-tuning is the right call for narrow domains with consistent input/output shapes, latency-sensitive applications where you can't afford a frontier-model round trip, or when data sovereignty requires self-hosted models. We assess honestly — fine-tuning is often the wrong answer because it's expensive, brittle to base-model updates, and frequently underperforms a well-engineered RAG system.
03
How do you choose between OpenAI, Anthropic, Google, and open-weight models?
Architecture before model. We design the retrieval pipeline, evaluation harness, and integration layer to be model-agnostic, then select based on the eval results on your real workload. Frontier API models lead on reasoning-heavy tasks; mid-tier models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle routine work at a fraction of the cost; open-weight models (Llama, Mistral) win when data sovereignty or fine-tuning are non-negotiable. Most production systems we ship use 2–3 models in tiers.
04
How do you control LLM costs as usage scales?
Cost is an architectural concern, not a procurement one. We design with model tiering (cheap models route routine queries, expensive models reserved for hard cases), prompt caching, semantic caching of common queries, retrieval to reduce context length, and continuous monitoring of token spend per use case. Most clients see 60–80% cost reduction within three months of our engagement compared to their initial implementation, while improving quality.
05
Can you deploy LLMs on-prem or in a private cloud?
Yes. We deploy self-hosted LLMs on customer infrastructure (AWS, Azure, GCP, or bare-metal) using vLLM, TGI, or llama.cpp depending on the workload. We've shipped private-cloud deployments for healthcare, financial services, and government clients where ePHI, regulated data, or sovereignty requirements rule out third-party APIs. The tradeoff is operational — self-hosted means you own the GPU bill and the on-call rotation.
06
How do you defend against prompt injection?
Defense in depth. Input filtering for known injection patterns, structured prompts that separate untrusted user content from system instructions, output validation against allowed action schemas, least-privilege tool permissions for agentic systems, and human-in-the-loop checkpoints for high-impact actions. There is no single fix — prompt injection is a class of vulnerabilities, not a single bug. We design the system assuming inputs are adversarial.
07
Do you provide ongoing evaluation and model upgrades?
Yes — through our Managed Services engagement model. We maintain the evaluation dataset, run continuous faithfulness scoring on production traffic, monitor for drift, and evaluate new model releases on a quarterly cadence. When a better model ships (and they do, frequently), we run an A/B harness against the eval set and the production traffic shadow, and present go/no-go data to the stakeholder. Model selection is a recurring decision, not a one-time architecture choice.
08
What does a typical generative AI engagement look like?
A focused 8–14 week build for a single high-leverage use case is the most common starting point: $250K – $750K, a senior engineer plus the CORTEX department lead, with a production-grade system at the end (not a demo). From there, most clients expand to additional use cases or move to a Managed Services engagement for ongoing operations. We publish budget brackets honestly so visitors self-qualify before the first call.

Within AI & Machine Learning

Other capabilities in this practice.

Back to AI & Machine Learning

Talk to us

Bring a generative ai & llm problem. We’ll bring a senior engineer.

A senior engineer plus the CORTEX department lead joins the first call. No discovery gauntlet, no junior reps.

Book a discovery call Request a proposal

What’s in scope.

Enterprise RAG: hybrid retrieval, reranking, citation tracking, multi-tenant isolation

Fine-tuning and domain adaptation (SFT, LoRA, QLoRA, continued pre-training)

Production agentic systems with tool use, function calling, and orchestration

Prompt engineering, prompt versioning, and prompt CI/CD

Evaluation harnesses: faithfulness, groundedness, refusal correctness, latency budgets

Cost optimization: model tiering, prompt caching, semantic caching, context compression

Self-hosted and private-cloud LLM deployment for data sovereignty requirements

Safety and governance: PII redaction, prompt injection defenses, audit logs, model cards

Tools we use in production.

Foundation models

OpenAI GPTAnthropic ClaudeGoogle GeminiCohere CommandMistralLlama

Frameworks

LangChainLlamaIndexHaystackDSPyVercel AI SDK

Vector & retrieval

PineconeWeaviateQdrantpgvectorElasticsearchVespa

Cloud LLM serving

AWS BedrockAzure OpenAIVertex AIvLLMTGI

Evaluation

PromptfooRagasDeepEvalLangSmithOpenAI Evals

Observability

LangSmithHeliconeLangfuseDatadog LLM

Most LLM projects look great in a demo, never reach production.

6 use cases, in production.

Knowledge & internal search

Customer service automation

Document understanding

Decision support

Code & developer assistants

Content & marketing operations

Use case selection

RAG & evaluation architecture

Production deployment

Continuous evaluation

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

RAG-powered clinical decision support across 240+ clinics.

Contract intelligence platform for a top-50 commercial lender.

Building clinical AI? See the healthcare-specific intersection.

How do you handle hallucinations in production?

When do you fine-tune vs use frontier models with RAG?

How do you choose between OpenAI, Anthropic, Google, and open-weight models?

How do you control LLM costs as usage scales?

Can you deploy LLMs on-prem or in a private cloud?

How do you defend against prompt injection?

Do you provide ongoing evaluation and model upgrades?

What does a typical generative AI engagement look like?

Other capabilities in this practice.

Bring a generative ai & llm problem. We’ll bring a senior engineer.

Most LLM projects look great in a demo, never reach production.

6 use cases, in production.

Knowledge & internal search

Customer service automation

Document understanding

Decision support

Code & developer assistants

Content & marketing operations

Use case selection

RAG & evaluation architecture

Production deployment

Continuous evaluation

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

RAG-powered clinical decision support across 240+ clinics.

Contract intelligence platform for a top-50 commercial lender.

Building clinical AI? See the healthcare-specific intersection.

How do you handle hallucinations in production?

When do you fine-tune vs use frontier models with RAG?

How do you choose between OpenAI, Anthropic, Google, and open-weight models?

How do you control LLM costs as usage scales?

Can you deploy LLMs on-prem or in a private cloud?

How do you defend against prompt injection?

Do you provide ongoing evaluation and model upgrades?

What does a typical generative AI engagement look like?

Other capabilities in this practice.

Bring a generative ai & llm problem. We’ll bring a senior engineer.