Xtended Blog

Why Multi-Tenancy Is Hard with AI

Traditional SaaS multi-tenancy challenges:

Data isolation between tenants
Fair resource allocation
Per-tenant customization

AI adds new dimensions:

Context isolation: Tenant A's context must never leak to Tenant B
Cost attribution: AI calls cost money—who pays for what?
Quality variance: Some tenants have rich context, others have sparse—experience differs
Performance unpredictability: AI latency varies based on context size and query complexity

Context Isolation

The Risk

Context leakage isn't just a bug—it's a catastrophe:

Tenant A asks a question
Agent retrieves context from Tenant B
Response includes competitor's proprietary information
Trust destroyed, lawsuit incoming

Isolation Patterns

// Approach 1: Tenant ID in every query
const context = await vectorDB.query({
  embedding: queryEmbedding,
  filter: { tenant_id: currentTenant },  // ALWAYS filter
  limit: 10
})

// Approach 2: Separate vector stores per tenant
const vectorStore = getVectorStore(currentTenant)
const context = await vectorStore.query(queryEmbedding)

// Approach 3: Row-level security at database
// PostgreSQL example
CREATE POLICY tenant_isolation ON embeddings
  USING (tenant_id = current_setting('app.tenant_id'))

Verification

Defense in depth:

Filter at query time
Verify tenant ownership in application layer
Audit logs for cross-tenant access attempts
Regular security testing with synthetic "leak detection" data

Cost Attribution

The Problem

AI costs vary dramatically per request:

// Simple query: $0.001
"What's our current MRR?"
→ Retrieves 3 records, 500 tokens

// Complex query: $0.15
"Analyze our customer health trends vs competitors over the last year"
→ Retrieves 500 records, 50k tokens, multiple model calls

Attribution Models

// Per-call tracking
{
  tenant_id: "tenant_123",
  request_id: "req_abc",
  costs: {
    embedding_generation: 0.0001,
    vector_search: 0.0002,
    context_retrieval: 0.001,
    llm_input_tokens: 0.003,
    llm_output_tokens: 0.006
  },
  total: 0.0103
}

// Aggregated billing
{
  tenant_id: "tenant_123",
  period: "2025-02",
  usage: {
    queries: 15420,
    tokens_in: 3_400_000,
    tokens_out: 890_000,
    storage_mb: 245
  },
  charges: {
    platform_fee: 49.00,
    ai_usage: 34.20,
    storage: 2.45,
    total: 85.65
  }
}

Performance at Scale

The Challenge

User expectations don't scale with context size:

Tenant with 100 records: 200ms response
Tenant with 100,000 records: Should still be ~200ms
But retrieval complexity grows

Scaling Patterns

// Pattern 1: Smart caching
const cacheKey = `${tenantId}:${queryHash}`
const cached = await cache.get(cacheKey)
if (cached && !contextChanged(tenantId)) return cached

// Pattern 2: Tiered retrieval
// First: Check hot cache (recent, frequent)
// Then: Query warm tier (indexed, common)
// Finally: Deep search (everything)

// Pattern 3: Async processing
// For complex queries, return immediately with job ID
{
  status: "processing",
  job_id: "job_123",
  estimated_completion: "5s",
  poll_url: "/jobs/job_123"
}

Tenant-Level SLAs

// Different tiers, different guarantees
{
  "starter": {
    "p95_latency": "2000ms",
    "availability": "99%",
    "support_response": "48h"
  },
  "growth": {
    "p95_latency": "1000ms",
    "availability": "99.5%",
    "support_response": "24h"
  },
  "enterprise": {
    "p95_latency": "500ms",
    "availability": "99.9%",
    "support_response": "4h",
    "dedicated_capacity": true
  }
}

Subaccount Architecture

When You Need It

Agencies managing multiple clients
Enterprise with multiple business units
Partners reselling your platform

Structure

// Hierarchy
Organization (billing entity)
├── Workspace 1 (isolated context)
│   ├── User A (permissions)
│   └── User B (permissions)
├── Workspace 2 (isolated context)
│   └── User C (permissions)
└── Shared Resources (optional cross-workspace)

Permission Model

// Role-based within workspace
{
  "org_admin": ["*"],
  "workspace_admin": ["workspace:*"],
  "workspace_member": ["workspace:read", "records:*"],
  "workspace_viewer": ["workspace:read", "records:read"]
}

// Cross-workspace (carefully controlled)
{
  "shared_templates": true,
  "shared_integrations": false,
  "cross_workspace_search": false  // Usually no!
}

Monitoring Multi-Tenant AI

// Per-tenant metrics dashboard
{
  tenant_id: "tenant_123",
  period: "last_24h",
  metrics: {
    // Usage
    queries: 1420,
    tokens_consumed: 340000,

    // Performance
    p50_latency: 180,
    p95_latency: 450,
    p99_latency: 890,

    // Quality
    successful_responses: 1389,
    failed_responses: 31,
    user_satisfaction: 0.89,

    // Cost
    ai_cost: 3.40,
    projected_monthly: 102.00
  }
}

Getting It Right

Isolation first. Build tenant isolation into the core architecture, not as an afterthought.
Track everything. You can't bill for what you don't measure.
Test at scale. What works for 10 tenants may break at 1,000.
Plan for variance. Some tenants will use 100x what others use.
Automate limits. Runaway usage can bankrupt you before you notice.

Multi-tenant AI is hard. But get it right, and you have a platform that scales with the market.

The Multi-Tenant Agent Challenge: Serving Thousands of Customers with AI