How do you prevent agents from doing dangerous things?

Permission boundaries enforced at the tool layer, not the prompt. The agent may only call tools in its grant set, each tool has explicit scope (read, write, scoped-write, escalate-to-human), and high-impact actions land in a human-in-the-loop checkpoint by default. Prompt injection cannot escalate the agent's permissions — the worst-case is the agent calls a tool it was already allowed to call, and that tool's audit log shows it.

Which agent framework do you recommend?

Architecture before framework. We design the agent topology, tool layer, and evaluation harness to be framework-agnostic, then select the framework that fits the workload. LangGraph is our default for stateful, multi-step workflows. CrewAI for role-specialized teams. The OpenAI Agents SDK for OpenAI-native deployments. We've also built custom orchestrators for cases where existing frameworks impose more abstraction tax than they save.

How do you evaluate an agent before production?

Scenario-based evaluation, curated against real workflows. The dataset includes happy paths, ambiguous requests, malformed inputs, prompt injection attempts, tool timeouts, and partial successes. We score success rate, refusal correctness, cost-per-task, and tool-call efficiency on every prompt or tool change. The eval runs in CI and gates deployment — agents that regress on the dataset don't ship.

What does 'human-in-the-loop' actually mean in practice?

Specific checkpoints where the agent pauses, surfaces context, and waits for human approval before proceeding. Each checkpoint is named in the SOW: which actions trigger it, what context is shown, what approval signals are accepted, what the timeout behavior is. Not a generic safety net — a designed pause with explicit semantics. Most production agents have 2–5 named checkpoints, depending on the impact surface.

How do you keep agent costs bounded?

Multiple layers. Model tiering across agents — cheap models route routine work, expensive models handle ambiguous cases. Context compression to reduce per-call token spend. Aggressive caching on stable sub-tasks. Hard cost ceilings per task with explicit fallback to human escalation when exceeded. Continuous cost-per-task monitoring per workflow. Most clients see 60–80% cost reduction within three months versus their initial implementation.

Can the agent execute production actions autonomously?

Sometimes — when the action is reversible, low-blast-radius, and explicitly permitted by the grant set. Examples: updating a CRM field, sending a draft email to a queue for human review, creating a ticket. We do not deploy agents that execute irreversible high-impact actions (database writes outside scoped tables, payment transfers, public-facing communications, production deploys) without a human-in-the-loop checkpoint. The blast-radius rule is in the SOW, not in the prompt.

How do you defend against prompt injection in agentic systems?

Defense in depth. Permission boundaries at the tool layer (worst case the agent does what it was allowed to do). Structured separation between system instructions and untrusted content. Output validation against allowed action schemas. Refusal patterns for instructions that don't match the agent's role. Human-in-the-loop checkpoints for high-impact actions. We assume inputs are adversarial and design the system around that assumption — not around hoping the prompt holds.

What does an agentic engagement look like?

Discovery and safety modeling: 4–6 weeks, $50K–$150K. Production build for one to three agentic workflows: 4–9 months, $400K–$1.5M, with a senior engineer plus the CORTEX department lead and a CITADEL co-pilot for compliance-sensitive workloads. Multi-quarter expansions with managed operations: $1.5M–$4M+. Budget brackets are published honestly so visitors self-qualify before the first call.

AI & Machine Learning · CORTEX

AI Agents & Automation.

Production multi-agent systems with tool use, function calling, and orchestration — built with the safety constraints, evaluation harnesses, and audit logging that turn a flashy demo into a system on-call can actually run.

Practice: AI & Machine Learning
Department: CORTEX

The problem

Most agent demos are five tools and a hope.

Building an agent that calls three APIs and answers a question well is a weekend. Building one that runs in production — gracefully refuses out-of-scope requests, never executes a destructive action without confirmation, fails closed when a tool times out, leaves an audit trail your security team can read, and keeps cost bounded as load scales — is a different category of work entirely.

Prosigns ships agents that survive that gap. We design the safety boundary before picking a framework. We build the evaluation harness against ground-truth scenarios before wiring tools. We treat tool execution as a privileged operation with explicit permission grants, audit logs, and rollback paths. The result isn't a more clever prompt — it's an agentic system the rest of the engineering organization is willing to support on-call.

Where it ships

6 use cases, in production.

Specific applications we’ve built and operated. Not speculative — every example below is grounded in a real shipped engagement.

01
−47%
ticket handle time
Customer service triage
Multi-step agents that classify, gather context from CRM and knowledge base, draft responses, and route to humans on edge cases. Refusal patterns, escalation gates, full audit trail.
02
+38%
throughput
Workflow automation
Agents that orchestrate document review, approval routing, exception handling, and data entry across enterprise systems. Human-in-the-loop on high-impact actions.
03
8x
research output
Research & analysis
Multi-agent research workflows that decompose questions, search across sources, synthesize findings, and produce cited reports — with explicit boundaries on what the agent may execute vs draft.
04
+31%
PR throughput
Code assistants & PR review
Internal agents fine-tuned on your codebase that triage issues, draft PRs, run scoped tests, and propose remediation. Read-write boundaries enforced at the tool layer, not the prompt.
05
−28%
MTTR
Operations copilots
On-call assistants that gather observability data, correlate across signals, suggest runbooks, and execute scoped remediation under SRE supervision. No autonomous production action.
06
+42%
qualified-lead volume
Sales & marketing operations
Lead enrichment, account research, outbound personalization, and pipeline triage. Bounded write access to CRM, full audit, and refusal patterns for compliance-sensitive content.

How we engage

4 phases, named in the SOW.

Each phase has a deliverable, an owner, and an acceptance criterion. Not slogans — operating rules.

01
Safety boundary first
Before a single tool is wired, we define what the agent may read, may write, and must never execute autonomously. Permission grants are explicit, scoped, and reviewable. High-impact actions land in a human-in-the-loop checkpoint by default — escalation, not exception.
02
Eval scenarios from real failures
Ground-truth evaluation set built from real workflow scenarios, including the failure modes you've seen in production: ambiguous requests, malformed inputs, prompt injection attempts, tool timeouts, partial successes. Every prompt or tool change runs against this set in CI.
03
Tool layer with audit boundaries
Tools are enterprise APIs with permissions, rate limits, audit logging, and rollback semantics — not LLM function definitions stitched together. The same tool layer your agents use is the layer your humans use. One source of truth, one access policy, one audit trail.
04
Production with on-call
We deploy on infrastructure your platform team can support. Cost-per-task metrics. Tool execution logs. Drift monitoring on the eval set. Quarterly model upgrades behind A/B gates. Hand off to your team with runbooks or stay on as Managed Services.

Capabilities

What’s in scope.

Multi-agent architectures with explicit role boundaries and orchestration
Tool use, function calling, structured output, and JSON-schema enforcement
Production agent frameworks: LangGraph, CrewAI, AutoGen, custom orchestrators
Safety constraints: permission grants, refusal patterns, human-in-the-loop checkpoints
Evaluation harnesses for agentic workflows — scenario coverage, success rate, cost
Audit logging: tool inputs, outputs, decisions, model version, user identity
Cost optimization: model tiering across agents, caching, context compression
Integration with enterprise systems: CRM, ERP, ticketing, identity, observability

Stack

Tools we use in production.

Foundation models: Anthropic ClaudeOpenAI GPTGoogle GeminiCohere CommandLlama
Agent frameworks: LangGraphCrewAIAutoGenOpenAI Agents SDKVercel AI SDK
Tool layer: Custom MCP serversFunction callingZod schemasOpenAPI integration
Evaluation: LangSmithPromptfooDeepEvalHeliconeLangfuse
Cloud LLM serving: AWS BedrockAzure OpenAIVertex AIvLLMTGI
Observability: Datadog LLMArizeOpenTelemetrySentry

Selected work

Quantified outcomes, not adjectives.

All case studies

Custom Software × Agents

Building agentic features into a product?

When the agent is part of a larger product (not a standalone capability), we co-staff with FORGE — engineers who'll own the broader application. Same agent infrastructure, scoped to the product's surface area and ownership model.

Custom Software practice

Common questions

Asked before the first call.

01
How do you prevent agents from doing dangerous things?
Permission boundaries enforced at the tool layer, not the prompt. The agent may only call tools in its grant set, each tool has explicit scope (read, write, scoped-write, escalate-to-human), and high-impact actions land in a human-in-the-loop checkpoint by default. Prompt injection cannot escalate the agent's permissions — the worst-case is the agent calls a tool it was already allowed to call, and that tool's audit log shows it.
02
Which agent framework do you recommend?
Architecture before framework. We design the agent topology, tool layer, and evaluation harness to be framework-agnostic, then select the framework that fits the workload. LangGraph is our default for stateful, multi-step workflows. CrewAI for role-specialized teams. The OpenAI Agents SDK for OpenAI-native deployments. We've also built custom orchestrators for cases where existing frameworks impose more abstraction tax than they save.
03
How do you evaluate an agent before production?
Scenario-based evaluation, curated against real workflows. The dataset includes happy paths, ambiguous requests, malformed inputs, prompt injection attempts, tool timeouts, and partial successes. We score success rate, refusal correctness, cost-per-task, and tool-call efficiency on every prompt or tool change. The eval runs in CI and gates deployment — agents that regress on the dataset don't ship.
04
What does 'human-in-the-loop' actually mean in practice?
Specific checkpoints where the agent pauses, surfaces context, and waits for human approval before proceeding. Each checkpoint is named in the SOW: which actions trigger it, what context is shown, what approval signals are accepted, what the timeout behavior is. Not a generic safety net — a designed pause with explicit semantics. Most production agents have 2–5 named checkpoints, depending on the impact surface.
05
How do you keep agent costs bounded?
Multiple layers. Model tiering across agents — cheap models route routine work, expensive models handle ambiguous cases. Context compression to reduce per-call token spend. Aggressive caching on stable sub-tasks. Hard cost ceilings per task with explicit fallback to human escalation when exceeded. Continuous cost-per-task monitoring per workflow. Most clients see 60–80% cost reduction within three months versus their initial implementation.
06
Can the agent execute production actions autonomously?
Sometimes — when the action is reversible, low-blast-radius, and explicitly permitted by the grant set. Examples: updating a CRM field, sending a draft email to a queue for human review, creating a ticket. We do not deploy agents that execute irreversible high-impact actions (database writes outside scoped tables, payment transfers, public-facing communications, production deploys) without a human-in-the-loop checkpoint. The blast-radius rule is in the SOW, not in the prompt.
07
How do you defend against prompt injection in agentic systems?
Defense in depth. Permission boundaries at the tool layer (worst case the agent does what it was allowed to do). Structured separation between system instructions and untrusted content. Output validation against allowed action schemas. Refusal patterns for instructions that don't match the agent's role. Human-in-the-loop checkpoints for high-impact actions. We assume inputs are adversarial and design the system around that assumption — not around hoping the prompt holds.
08
What does an agentic engagement look like?
Discovery and safety modeling: 4–6 weeks, $50K–$150K. Production build for one to three agentic workflows: 4–9 months, $400K–$1.5M, with a senior engineer plus the CORTEX department lead and a CITADEL co-pilot for compliance-sensitive workloads. Multi-quarter expansions with managed operations: $1.5M–$4M+. Budget brackets are published honestly so visitors self-qualify before the first call.

Within AI & Machine Learning

Other capabilities in this practice.

Back to AI & Machine Learning

Talk to us

Bring a ai agents & automation problem. We’ll bring a senior engineer.

A senior engineer plus the CORTEX department lead joins the first call. No discovery gauntlet, no junior reps.

Book a discovery call Request a proposal

What’s in scope.

Multi-agent architectures with explicit role boundaries and orchestration

Tool use, function calling, structured output, and JSON-schema enforcement

Production agent frameworks: LangGraph, CrewAI, AutoGen, custom orchestrators

Safety constraints: permission grants, refusal patterns, human-in-the-loop checkpoints

Evaluation harnesses for agentic workflows — scenario coverage, success rate, cost

Audit logging: tool inputs, outputs, decisions, model version, user identity

Cost optimization: model tiering across agents, caching, context compression

Integration with enterprise systems: CRM, ERP, ticketing, identity, observability

Tools we use in production.

Foundation models

Anthropic ClaudeOpenAI GPTGoogle GeminiCohere CommandLlama

Agent frameworks

LangGraphCrewAIAutoGenOpenAI Agents SDKVercel AI SDK

Tool layer

Custom MCP serversFunction callingZod schemasOpenAPI integration

Evaluation

LangSmithPromptfooDeepEvalHeliconeLangfuse

Cloud LLM serving

AWS BedrockAzure OpenAIVertex AIvLLMTGI

Observability

Datadog LLMArizeOpenTelemetrySentry

Most agent demos are five tools and a hope.

6 use cases, in production.

Customer service triage

Workflow automation

Research & analysis

Code assistants & PR review

Operations copilots

Sales & marketing operations

Safety boundary first

Eval scenarios from real failures

Tool layer with audit boundaries

Production with on-call

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

Multi-agent KYC and onboarding orchestrator for a US digital bank.

Prior authorization automation across 6 payer integrations.

Building agentic features into a product?

How do you prevent agents from doing dangerous things?

Which agent framework do you recommend?

How do you evaluate an agent before production?

What does 'human-in-the-loop' actually mean in practice?

How do you keep agent costs bounded?

Can the agent execute production actions autonomously?

How do you defend against prompt injection in agentic systems?

What does an agentic engagement look like?

Other capabilities in this practice.

Bring a ai agents & automation problem. We’ll bring a senior engineer.

Most agent demos are five tools and a hope.

6 use cases, in production.

Customer service triage

Workflow automation

Research & analysis

Code assistants & PR review

Operations copilots

Sales & marketing operations

Safety boundary first

Eval scenarios from real failures

Tool layer with audit boundaries

Production with on-call

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

Multi-agent KYC and onboarding orchestrator for a US digital bank.

Prior authorization automation across 6 payer integrations.

Building agentic features into a product?

How do you prevent agents from doing dangerous things?

Which agent framework do you recommend?

How do you evaluate an agent before production?

What does 'human-in-the-loop' actually mean in practice?

How do you keep agent costs bounded?

Can the agent execute production actions autonomously?

How do you defend against prompt injection in agentic systems?

What does an agentic engagement look like?

Other capabilities in this practice.

Bring a ai agents & automation problem. We’ll bring a senior engineer.