How regulated institutions will manage the cost of every AI interaction they put in front of a customer.
AI isn’t free at scale. That’s the very first thing every organization learns during customer-facing AI implementations. When rollouts begin, it’s clear that the cost of answering is not a constant.
Two customer-facing AI experiences with identical behavior can have unit costs that differ by a factor of 10. The customer will not know which one they are on until the institution quietly retires the more expensive one.
This is what token economics means: the compounding effect of every decision behind every customer interaction. Experiences that scale token economics own those decisions. Experiences that don’t will accumulate costs faster than customers.
What Happens Behind One Customer AI Moment?
From the customer’s perspective, an AI moment appears as a single event. A question goes in, and an answer comes out. It’s seamless.
What customers don’t see, however, is what goes on inside the system: six cost events run every time.
One customer. One AI moment. Six cost events.
What actually happens behind a single AI experience, and where the unit economics get decided
A customer opens an AI surface and asks a question. A response appears in seconds. From the customer's perspective, this is a single interaction. From the cost ledger, it is six events.
Model inference is roughly one fifth of the cost. The other four fifths are choices made by the institution building the experience.
Context, routing, output shape, trace, lineage. Every one of them gets touched on every customer interaction. Every one of them compounds at scale.
Experiences that scale token economics own these six events.
They design context once. They route deliberately. They shape output by design. They trace by default. They govern in the pipeline.
Experiences that do not, watch each event grow unbounded as customer volume rises.
Cost shares are illustrative averages across observed implementations. Vary by surface, domain, and deployment maturity.
Figure 1. The six cost events behind a single customer interaction with an AI surface. The customer sees one. The system runs six.
Three things matter about the above image.
- Inference is roughly 21% of the unit cost. The model call is the visible cost and is the smaller portion. The other 79% sits in context, routing, output shape, trace, and lineage.
- The highest cost is the one that the organization controls the most. Context assembly averages 62% of unit cost across observed deployments. Each prompt template, retrieval strategy, and grounding choice impacts that number.
- The cost curve is not the value curve. Customers see latency and quality.
What Happens Next: Two Trajectories
Two AI experience trajectories. One scales. One does not.
Modeled unit cost per interaction across customer growth, based on observed scaling patterns
Unmanaged experiences get more expensive per user as they gain users. The cost curve diverges from the value curve.
Managed experiences converge. Unit cost flattens. Every new feature lands on a cost base that has already been amortized.
Curves are illustrative based on observed scaling patterns. Specific values vary by surface, domain, and architecture.
Figure 2. Unit cost across customer growth. Unmanaged experiences rise 3.6x. Managed experiences flatten at 1.15x.
By the time both experiences hit 10M monthly active customers, the gap is roughly three times in unit cost. Here’s where we see different economics, conversations at the board table, and roadmaps for years afterward.
In a regulated environment, the unmanaged trajectory also drags compliance overhead, audit findings, and rework cycles. The trajectories diverge on more than cost.
The unit cost curve is an experience capability, not a financial artifact. The shape of the curve is decided by the team building the experience.
Where Does the Cost Actually Go?
Where the cost goes inside a single AI moment
Six components, ranked by share of unit cost and by your team's degree of control
The largest cost component is the one you control most. The smallest you control least is the one most teams spend energy debating.
Scaling token economics means inverting the energy. High-cost, high-control components first.
Figure 3. Cost share and degree of control across the six components. The largest components are also the ones the institution controls most.
The context assembly is 62% and 92% under the institution’s control. The model inference accounts for 21% of the cost, and the control accounts for 38%.
The energy ratio for most teams is inverted. Scaling token economics means flipping it.
Four Strategies that Scale Token Economics
Four strategies move unit cost on a customer-facing AI surface. Each has a deep metric tied to it, whether the surface is a statement explainer, an onboarding assistant, a disclosure walkthrough, or any other customer communication that has become an AI experience.
Strategy One: Context Discipline
Context discipline is the largest leverage in any AI experience, and the first place a regulated institution should look. It begins with setting a token ceiling for each AI surface and enforcing it in the experience itself. The customer history, brand voice, regulatory guardrails, and retrieved policy do not need to be shipped fresh on every interaction. The observed unit cost reduction once a surface has a budgeted context window is 30%-55%.
On top of that, caching shared context compounds. System prompts, retrieved knowledge, response templates, and evaluator outputs all benefit. All that needs to happen is a ‘compute once’ and reuse across every customer, every session, every interaction. Observed reduction: 25%-40%. Cache hit rates above 70% become normal in mature deployments and continue to rise as the corpus stabilizes.
What happens next is:
- Retrieval compression closes the loop
- Retrieve fewer chunks
- Re-rank harder
- Send the model only what changes the answer
On retrieval-heavy surfaces typical of regulated communications, where policies, disclosures, and product documentation all compete for context, the observed reduction is 15%-30%.
Together, the three context tactics routinely halve the cost of the largest component on the chart.
Strategy Two: Model Routing
Different questions deserve different models. Routing is a layer inside the experience, not a vendor selection. For example, a simple lookup goes to a small, fast model. A reasoning-heavy explanation goes to a mid-tier model. A genuinely hard case escalates to the most capable model available.
The routing policy lives inside the institution and is expressed as a function of question type, customer segment, surface, and required latency. Observed inference cost reduction at constant quality, measured against a fixed evaluation suite: 40%-70%.
Curious to learn more about AI in Customer Communications Management? Contact us now to book a personalized demo.
Strategy Three: Output Shape
The structure of a customer response is a design decision, not a model default. Templates and structured outputs are faster to generate, easier to validate against the underlying source, and use fewer tokens than free-form prose, including:
- JSON schemas
- Response template
- Constrained generation
- Observed unit cost reduction against free-form baselines: 20%-35%.
In a regulated context, structured output also makes the response easier to inspect, which is the entire point of governance.
Strategy Four: Substitutable Architecture
The model market resets every 9 to 14 months
Substitutable experiences adopt each new frontier. Locked-in experiences do not.
An experience wired to a single model on day one cannot adopt the better model when it arrives. An experience built behind a substitutable layer adopts each release.
Over four cycles, the capability-per-dollar gap widens by roughly an order of magnitude. The team that loses access to the frontier ships an experience that is structurally weaker than what users now expect.
Figure 4. The model market resets every 9 to 14 months. The capability-per-dollar gap between substitutable and locked-in experiences widens 8 to 12x over four cycles.
Substitutability, the ability to replace system components with minimal cost or disruption, is the strategy that protects the cost line on horizons longer than any single planning cycle. It’s the strategy that decides whether a regulated institution can keep pace with the frontier or watch as it pulls away.
Model layer abstraction places every customer-facing surface behind a stable internal API. Models are swapped out when better ones appear. It’s the long-term capability-per-dollar gap between an institution built for substitutability and one that didn’t run 5 -10 times over four release cycles.
Eval-driven swaps make the property operational. Every model candidate runs against the institution’s evaluation suite before it ships into a customer surface. Swap decisions become data-driven, with regression tests against accuracy, tone, compliance markers, and unit cost.
Prompt portability finishes the work. Prompts written to work across model families, not to one model’s quirks, mean time-to-swap drops from quarters to weeks. The institution stops being a hostage to its incumbent vendor’s release schedule.
What This Looks Like for the Customer
Same customer. Same intent. Different experience.
What changes step by step when token economics is enabled
Figure 5. Same customer, same five steps, two experiences. The managed one runs at ~1/12th the cost and ~17x the speed.
Unmanaged: 3.4 second cold start, 8.2k tokens on the first question, $0.082 for the frontier model call, 1240 tokens of free-form reply, cost roughly doubles on the follow-up.
Managed: 0.2 second cache hit, 1.6k tokens because context is budgeted and reused, $0.011 routed to mid-tier, 320 tokens of structured reply, follow-up mostly cached.
Same customer. Same intent. The customer notices the latency. The institution sees the rest.
What This Enables for the Future of Customer Communications
Token economics is not the goal. It is the precondition.
The institutions that scale token economics get to do things their peers cannot:
- They add new AI surfaces without renegotiating the budget every cycle.
- They keep pace with each new model release.
- They ship customer experiences that respond in 200 milliseconds (instead of three seconds) with the same regulatory posture they had before.
- They put AI in front of every customer they serve, not just the ones who happen to land on the most-funded surfaces.
Four Strategies. One Outcome.
Token economics is not a budget conversation. It’s the operating discipline that decides which AI experiences a regulated institution can put in front of its customers, and which ones it cannot.
- Context discipline
- Model routing
- Output shape
- Substitutable architecture
Each of the four strategies moves unit cost by a measurable amount. Together, they decide whether an AI experience scales token economics or stalls.
Across observed deployments, regulated institutions that have invested in all four sit at 8-15 times lower unit cost than those that have invested in none. The gap shows up in which customer experiences survive the next planning cycle, and in how many new ones the institution can stand up beside them.
That’s what enabling AI experiences actually looks like.