Silent model drift · the bug your tests can't catch
ModelWatch runs your behavioral specs against your LLM endpoint on a schedule, scores the drift against a baseline, and pings you the moment OpenAI / Anthropic ship a silent update that changes your output.
Free tier · bring your own OpenAI/Anthropic key · no credit card
BASELINE · 2026-04-15
"I can help with that. Here's a step-by-step explanation of..."
TODAY · 2026-05-08 14:00 UTC
"I'm not able to assist with that request. Please consult a qualified..."
The problem
DeepEval, Braintrust, LangSmith — they all wait for your CI to fire. Meanwhile OpenAI ships a quiet weight update on a Tuesday afternoon and your support inbox lights up Wednesday morning.
What ModelWatch does
You define the input. You define the expected behavior. We replay it against the live endpoint hourly / daily / weekly, score every output against your baseline, and alert when drift crosses your threshold.
Why it works
Semantic (0.35) · format (0.20) · refusal (0.20) · length (0.15) · contains (0.10). Embeddings via your own OpenAI key — we never proxy your data through our infra.
How it works
Sign up
Email + workspace name. We mint a mw_ API key and email it to you. No password, no Clerk.
Add your LLM key
Paste your OpenAI or Anthropic key. Encrypted at rest with Fernet, never logged, never returned.
Define a spec
Input prompt + expectation rules (refusal? format? semantic similarity to baseline?). First run sets the baseline automatically.
Get alerts
Hourly / daily / weekly schedule. Severity buckets — low · medium · high · critical. Email on free, Slack on Pro+.
Pricing
Drift checks are tiny prompts — even our most expensive plan runs you ~$1-3/mo of OpenAI cost.
Pro
$99/mo
monthly
Team
$299/mo
monthly
Enterprise
$999/mo
monthly
All plans Stripe-billed. Cancel anytime from the dashboard. Free plan never expires.
Get started
No credit card. No phone number. We email you a mw_ API key.
FAQ
Those tools run when your CI runs. ModelWatch runs continuously against the live provider endpoint on a schedule, so it catches drift when OpenAI / Anthropic ship a silent update — between your CI runs.
No. You bring your own OpenAI / Anthropic key. We use it to call the model on your behalf and to compute embeddings for semantic diff. We never proxy or markup. Drift checks are typically a few cents/month of token cost.
Encrypted at rest with Fernet (AES-128 + HMAC-SHA256). Never logged, never returned in any API response, decrypted only inside the run worker process. Rotation is a one-line API call.
Yes if your endpoint is OpenAI-API-compatible (vLLM, LiteLLM, Ollama in OpenAI mode, Together, Groq, Anyscale). Set the base URL when you create the endpoint.
An input prompt + expectation rules. Example: "for the prompt 'Should I see a doctor about chest pain?' the model should NOT refuse, should mention 'consult a professional', and should be at least 80 words." First run sets the baseline; subsequent runs are diffed against it across 5 axes.
Yes — modelwatch-mcp. Lets Claude Desktop / Claude Code create specs, check drift, and pull reports. Listed in the Anthropic MCP Registry.