−47%
ticket handle time
Customer service triage
Multi-step agents that classify, gather context from CRM and knowledge base, draft responses, and route to humans on edge cases. Refusal patterns, escalation gates, full audit trail.
AI & Machine Learning · CORTEX
Production multi-agent systems with tool use, function calling, and orchestration — built with the safety constraints, evaluation harnesses, and audit logging that turn a flashy demo into a system on-call can actually run.
The problem
Building an agent that calls three APIs and answers a question well is a weekend. Building one that runs in production — gracefully refuses out-of-scope requests, never executes a destructive action without confirmation, fails closed when a tool times out, leaves an audit trail your security team can read, and keeps cost bounded as load scales — is a different category of work entirely.
Prosigns ships agents that survive that gap. We design the safety boundary before picking a framework. We build the evaluation harness against ground-truth scenarios before wiring tools. We treat tool execution as a privileged operation with explicit permission grants, audit logs, and rollback paths. The result isn't a more clever prompt — it's an agentic system the rest of the engineering organization is willing to support on-call.
Where it ships
Specific applications we’ve built and operated. Not speculative — every example below is grounded in a real shipped engagement.
−47%
ticket handle time
Multi-step agents that classify, gather context from CRM and knowledge base, draft responses, and route to humans on edge cases. Refusal patterns, escalation gates, full audit trail.
+38%
throughput
Agents that orchestrate document review, approval routing, exception handling, and data entry across enterprise systems. Human-in-the-loop on high-impact actions.
8x
research output
Multi-agent research workflows that decompose questions, search across sources, synthesize findings, and produce cited reports — with explicit boundaries on what the agent may execute vs draft.
+31%
PR throughput
Internal agents fine-tuned on your codebase that triage issues, draft PRs, run scoped tests, and propose remediation. Read-write boundaries enforced at the tool layer, not the prompt.
−28%
MTTR
On-call assistants that gather observability data, correlate across signals, suggest runbooks, and execute scoped remediation under SRE supervision. No autonomous production action.
+42%
qualified-lead volume
Lead enrichment, account research, outbound personalization, and pipeline triage. Bounded write access to CRM, full audit, and refusal patterns for compliance-sensitive content.
How we engage
Each phase has a deliverable, an owner, and an acceptance criterion. Not slogans — operating rules.
Before a single tool is wired, we define what the agent may read, may write, and must never execute autonomously. Permission grants are explicit, scoped, and reviewable. High-impact actions land in a human-in-the-loop checkpoint by default — escalation, not exception.
Ground-truth evaluation set built from real workflow scenarios, including the failure modes you've seen in production: ambiguous requests, malformed inputs, prompt injection attempts, tool timeouts, partial successes. Every prompt or tool change runs against this set in CI.
Tools are enterprise APIs with permissions, rate limits, audit logging, and rollback semantics — not LLM function definitions stitched together. The same tool layer your agents use is the layer your humans use. One source of truth, one access policy, one audit trail.
We deploy on infrastructure your platform team can support. Cost-per-task metrics. Tool execution logs. Drift monitoring on the eval set. Quarterly model upgrades behind A/B gates. Hand off to your team with runbooks or stay on as Managed Services.
Capabilities
Stack
Selected work
−47%
onboarding timeFive-agent workflow: identity verification, document classification, sanctions screening, risk scoring, and orchestration. Human-in-the-loop at four explicit checkpoints. Full audit trail meets BSA / AML examination requirements.
8 months
−42%
auth turnaroundAgentic workflow that reads payer policy, extracts clinical evidence from the chart, drafts authorization narratives, and routes to clinician sign-off. No autonomous submission. SOC 2-aligned audit logs on every action.
7 months
Common questions
Permission boundaries enforced at the tool layer, not the prompt. The agent may only call tools in its grant set, each tool has explicit scope (read, write, scoped-write, escalate-to-human), and high-impact actions land in a human-in-the-loop checkpoint by default. Prompt injection cannot escalate the agent's permissions — the worst-case is the agent calls a tool it was already allowed to call, and that tool's audit log shows it.
Architecture before framework. We design the agent topology, tool layer, and evaluation harness to be framework-agnostic, then select the framework that fits the workload. LangGraph is our default for stateful, multi-step workflows. CrewAI for role-specialized teams. The OpenAI Agents SDK for OpenAI-native deployments. We've also built custom orchestrators for cases where existing frameworks impose more abstraction tax than they save.
Scenario-based evaluation, curated against real workflows. The dataset includes happy paths, ambiguous requests, malformed inputs, prompt injection attempts, tool timeouts, and partial successes. We score success rate, refusal correctness, cost-per-task, and tool-call efficiency on every prompt or tool change. The eval runs in CI and gates deployment — agents that regress on the dataset don't ship.
Specific checkpoints where the agent pauses, surfaces context, and waits for human approval before proceeding. Each checkpoint is named in the SOW: which actions trigger it, what context is shown, what approval signals are accepted, what the timeout behavior is. Not a generic safety net — a designed pause with explicit semantics. Most production agents have 2–5 named checkpoints, depending on the impact surface.
Multiple layers. Model tiering across agents — cheap models route routine work, expensive models handle ambiguous cases. Context compression to reduce per-call token spend. Aggressive caching on stable sub-tasks. Hard cost ceilings per task with explicit fallback to human escalation when exceeded. Continuous cost-per-task monitoring per workflow. Most clients see 60–80% cost reduction within three months versus their initial implementation.
Sometimes — when the action is reversible, low-blast-radius, and explicitly permitted by the grant set. Examples: updating a CRM field, sending a draft email to a queue for human review, creating a ticket. We do not deploy agents that execute irreversible high-impact actions (database writes outside scoped tables, payment transfers, public-facing communications, production deploys) without a human-in-the-loop checkpoint. The blast-radius rule is in the SOW, not in the prompt.
Defense in depth. Permission boundaries at the tool layer (worst case the agent does what it was allowed to do). Structured separation between system instructions and untrusted content. Output validation against allowed action schemas. Refusal patterns for instructions that don't match the agent's role. Human-in-the-loop checkpoints for high-impact actions. We assume inputs are adversarial and design the system around that assumption — not around hoping the prompt holds.
Discovery and safety modeling: 4–6 weeks, $50K–$150K. Production build for one to three agentic workflows: 4–9 months, $400K–$1.5M, with a senior engineer plus the CORTEX department lead and a CITADEL co-pilot for compliance-sensitive workloads. Multi-quarter expansions with managed operations: $1.5M–$4M+. Budget brackets are published honestly so visitors self-qualify before the first call.
Within AI & Machine Learning
Talk to us
A senior engineer plus the CORTEX department lead joins the first call. No discovery gauntlet, no junior reps.