// Cascade

Multi-provider AI middleware.

A self-hosted REST middleware that routes AI prompts across multiple free providers — Gemini, Groq, Mistral, and Cerebras. Deploy once, call from anywhere.

Node.js Express Self-hosted Multiple AI Endpoints v1.1.0
▶  Try the Demo
🔀
Multi-Provider Cascade
Route prompts across Gemini, Groq, Mistral, and Cerebras automatically.
🧠
Difficulty Scaling
A 0.0–1.0 float maps your prompt to the right starting model.
⬆️
Complexity Escalation
Too hard for this tier? Cascade climbs toward more capable models.
⬇️
Failure Fallback
On errors or rate limits, Cascade falls back toward cheaper models.
🔑
Key Rotation
Multiple API keys per provider, rotated automatically on 429s and auth errors.
🛡️
Rate Limiting
Per-IP rate limiting on /ask-ai via express-rate-limit.
🔒
Origin Whitelist
Restrict which domains can call your middleware.
Full Sweep Guarantee
Every model is tried exactly once before returning an error.

How it Works

Cascade maps a difficulty float (0.0–1.0) to a starting position in the MODELS array, then sweeps through models until one succeeds. A visited set guarantees every model is tried exactly once.

1
Difficulty → Index
The difficulty float is mapped to the nearest index in the MODELS array. 0.0 starts at the cheapest/fastest model, 1.0 starts at the most capable.
2
Call the model
Cascade calls the provider at that index with your prompt. Each provider is instructed to return a strict JSON response.
3
too_complex → climb
If the model returns state: "too_complex", Cascade searches upward only — toward more capable models — and tries the next unvisited one.
4
Error → fall back
On a network error, 429, or 503, Cascade searches downward first toward cheaper models, then upward if nothing is left below. Rate limits trigger a 1.5s cooldown.
5
Full sweep guarantee
A visited set tracks every attempted model. Cascade guarantees every model is tried exactly once before returning an error — no model is skipped or retried.
Key rotation: Each provider supports multiple API keys separated by |. On a 429 or auth error, Cascade rotates to the next key before marking the model as failed.

Quick Start

1. Fork and clone

Fork the repo on GitHub so you have your own copy to modify and deploy, then clone your fork:

git clone https://github.com/YOUR_USERNAME/cascade
cd cascade
npm install

2. Set up environment variables

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
MY_APP_SECRET=your-secret-here
GEMINI_KEY=your-gemini-key
GROQ_KEY=your-groq-key
MISTRAL_KEY=your-mistral-key
CEREBRAS_KEY=your-cerebras-key

3. Run locally

npm start
# Middleware listening on port 10000 at 0.0.0.0

4. Deploy to Render (free)

Push your fork to GitHub and create a new Web Service on Render pointing at your fork. Set the start command to npm start. Add your environment variables in the Render dashboard. Your endpoint will be at https://your-service.onrender.com.

Cold starts: Render's free tier spins down after 15 minutes of inactivity. The first request after idle may take 30–60 seconds. Call /wake to warm the server before your first prompt.

5. Make your first request

const res = await fetch("https://your-service.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: "your-secret-here",
    difficulty: 0.5,
    prompt: "Return a JSON object with key 'status' and value 'ok'."
  })
});

const data = await res.json();
console.log(data.package);          // { status: "ok" }
console.log(data.answeredBy.model); // "llama-3.3-70b-versatile"

Endpoints

GET /wake

Health ping. Returns 200 "Full Sweep Online". Use this to warm the server after a cold start before sending your first prompt.

GET /health

Returns server status, uptime, model count, key pool sizes, rate limit config, and provider timeout.

{
  "status": "ok",
  "uptime": "0h 4m 21s",
  "models": 10,
  "keyPools": { "GEMINI": 1, "GROQ": 1, "MISTRAL": 1, "CEREBRAS": 1 },
  "rateLimit": { "max": 30, "windowMs": 60000 },
  "providerTimeoutMs": 15000
}
POST /ask-ai

Send a prompt through the cascade. Returns the first successful model response.

Request body

FieldTypeDescription
secretstringMust match MY_APP_SECRET env var.
difficultyfloat 0.0–1.0Starting position in the model array. 0.0 = cheapest, 1.0 = most capable.
promptstringYour prompt. Models are instructed to return a JSON object with state and package fields.

Response

{
  "state": "complete",
  "package": { /* your data */ },
  "answeredBy": {
    "index": 1,
    "provider": "groq",
    "model": "llama-3.3-70b-versatile"
  }
}

Error responses

StatusMeaning
403Invalid or missing secret.
429Rate limit exceeded for your IP.
503All models exhausted — no provider responded successfully.

Model Array

MODELS is a plain array ordered from least to most capable. Difficulty 0.0 maps to index 0, difficulty 1.0 maps to the last entry. Add, remove, or reorder entries freely — the cascade adapts automatically. If a provider has no API key configured, it is skipped at runtime and treated as a failure.

const MODELS = [
  { provider: 'cerebras', model: 'llama3.1-8b' },             // 0.00 — fastest, lightest
  { provider: 'groq',     model: 'llama-3.3-70b-versatile' }, // 0.11 — fast, reliable
  { provider: 'cerebras', model: 'llama-3.3-70b' },           // 0.22
  { provider: 'gemini',   model: 'gemini-2.0-flash' },        // 0.33
  { provider: 'groq',     model: 'llama-3.3-70b-specdec' },   // 0.44
  { provider: 'mistral',  model: 'mistral-small-latest' },    // 0.56
  { provider: 'groq',     model: 'llama-3.3-70b-versatile' }, // 0.67
  { provider: 'mistral',  model: 'mistral-large-latest' },    // 0.78
  { provider: 'gemini',   model: 'gemini-2.5-flash' },        // 0.89
  { provider: 'gemini',   model: 'gemini-2.5-pro' },          // 1.00 — most capable
];

Supported providers

ProviderKey env varAPI
geminiGEMINI_KEYGoogle Generative Language
groqGROQ_KEYOpenAI-compatible
mistralMISTRAL_KEYOpenAI-compatible
cerebrasCEREBRAS_KEYOpenAI-compatible

Difficulty

The difficulty float maps to a starting index in the MODELS array using:

index = Math.round(difficulty * (MODELS.length - 1))

The mapping is proportional regardless of array size — a 5-model array and a 20-model array both treat 0.5 as "start in the middle."

DifficultyUse when
0.0 – 0.2Simple tasks — classification, short answers, data formatting
0.3 – 0.5Medium tasks — summaries, Q&A, moderate reasoning
0.6 – 0.8Complex tasks — code generation, long-form content, analysis
0.9 – 1.0Hard tasks — deep reasoning, multi-step problems, nuanced generation

Key Rotation

Each provider supports multiple API keys. Separate them with | in the env var. Cascade loads them into a pool at startup and tries each one in order if the previous key hits a rate limit (429) or auth error (401/403).

# .env
GEMINI_KEY=keyA|keyB|keyC
GROQ_KEY=keyOne|keyTwo

On startup, the server logs how many keys were loaded per provider:

[Keys] GEMINI: 3 key(s) loaded
[Keys] GROQ: 2 key(s) loaded
[Keys] MISTRAL: 1 key(s) loaded
[Keys] CEREBRAS: 1 key(s) loaded
Separator note: Use | as the separator — not - or ,. The pipe character is guaranteed to never appear in API keys from any supported provider.

Origin Whitelist

The ALLOWED_ORIGINS array at the top of cascade.js controls which origins can call /ask-ai. Browser-based clients always send a real Origin header that cannot be spoofed.

// Allow everyone (dev / testing)
const ALLOWED_ORIGINS = [];

// Restrict to your domain
const ALLOWED_ORIGINS = [
  'https://yourdomain.com',
  'https://yourgame.itch.io'
];
Note: An empty array disables origin checking entirely — useful during local development. The secret field still provides a second layer of protection.

Rate Limiting

Cascade uses express-rate-limit to limit requests per IP on the /ask-ai endpoint. Defaults to 30 requests per minute. Configurable via env vars.

Env varDefaultDescription
RATE_LIMIT_MAX30Max requests per window per IP
RATE_LIMIT_WINDOW_MS60000Window size in milliseconds

When the limit is exceeded, the server returns:

429 { "state": "error", "content": "Too many requests, please slow down." }

Environment Variables

VariableRequiredDefaultDescription
MY_APP_SECRETYesSecret key required on every /ask-ai request
GEMINI_KEYIf using GeminiGoogle AI key(s), pipe-separated
GROQ_KEYIf using GroqGroq key(s), pipe-separated
MISTRAL_KEYIf using MistralMistral key(s), pipe-separated
CEREBRAS_KEYIf using CerebrasCerebras key(s), pipe-separated
PORTNo10000Server port. Render sets this automatically.
PROVIDER_TIMEOUT_MSNo15000Per-request timeout in ms
RATE_LIMIT_MAXNo30Max requests per window per IP
RATE_LIMIT_WINDOW_MSNo60000Rate limit window in ms

Known Limitations

Prompt structure is your responsibility

Models are instructed to return { state, package } JSON, but what goes inside package depends entirely on how you phrase your prompt. Be explicit about the shape you want.

too_complex is model-reported

Whether a model escalates is up to the model itself, not a measurable metric. Some models may never return too_complex regardless of prompt difficulty.

No streaming

/ask-ai is a blocking request. Cascade waits for a complete response before returning. For long prompts this can approach the PROVIDER_TIMEOUT_MS limit.

Free tier cold starts

On Render's free tier, the first request after 15 minutes of idle can take 30–60 seconds. Use /wake or point an uptime monitor at it to prevent this.

In-memory only

No request logging or response caching. Each call is fully stateless.

See it in Action

This assistant is built on top of Cascade — a simple chat interface that uses POST /ask-ai at difficulty 0.5 to answer questions about yzzy.online products. It demonstrates how easy it is to add AI to any browser app with a single fetch call.

yzzy.online AI Assistant

Ask questions about NetClient setup, Apex Racer tuning, Bullet Game enemies, and more. Powered by Cascade running on Render.

Try the demo →
// The entire integration is one fetch call
const res = await fetch("https://your-cascade.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: SECRET,
    difficulty: 0.5,
    prompt: SYSTEM_PROMPT + "\n\nUser: " + userMessage
  })
});

const data = await res.json();
const reply = data.package.reply;
console.log("Answered by:", data.answeredBy.model);