ModelWatch — Catch silent LLM drift before your customers do

The problem

Your eval tools only run when you trigger them.

DeepEval, Braintrust, LangSmith — they all wait for your CI to fire. Meanwhile OpenAI ships a quiet weight update on a Tuesday afternoon and your support inbox lights up Wednesday morning.

What ModelWatch does

Continuous behavioral monitoring of their model.

You define the input. You define the expected behavior. We replay it against the live endpoint hourly / daily / weekly, score every output against your baseline, and alert when drift crosses your threshold.

Why it works

Five-axis diff catches what blanket evals miss.

Semantic (0.35) · format (0.20) · refusal (0.20) · length (0.15) · contains (0.10). Embeddings via your own OpenAI key — we never proxy your data through our infra.

How it works

Four minutes to first drift signal

1

Sign up

Email + workspace name. We mint a mw_ API key and email it to you. No password, no Clerk.

2

Add your LLM key

Paste your OpenAI or Anthropic key. Encrypted at rest with Fernet, never logged, never returned.

3

Define a spec

Input prompt + expectation rules (refusal? format? semantic similarity to baseline?). First run sets the baseline automatically.

4

Get alerts

Hourly / daily / weekly schedule. Severity buckets — low · medium · high · critical. Email on free, Slack on Pro+.

Pricing

Customer brings the LLM key. We never charge for tokens.

Drift checks are tiny prompts — even our most expensive plan runs you ~$1-3/mo of OpenAI cost.

Free

$0

forever

5 specs
500 runs / month
1 endpoint
Email alerts

Start free

Free tier — key emailed in seconds

No credit card. No phone number. We email you a mw_ API key.

FAQ

Common questions

How is this different from DeepEval / Braintrust / LangSmith?

Those tools run when your CI runs. ModelWatch runs continuously against the live provider endpoint on a schedule, so it catches drift when OpenAI / Anthropic ship a silent update — between your CI runs.

Do you charge for LLM tokens?

No. You bring your own OpenAI / Anthropic key. We use it to call the model on your behalf and to compute embeddings for semantic diff. We never proxy or markup. Drift checks are typically a few cents/month of token cost.

How do you protect my LLM API key?

Encrypted at rest with Fernet (AES-128 + HMAC-SHA256). Never logged, never returned in any API response, decrypted only inside the run worker process. Rotation is a one-line API call.

Can I monitor self-hosted / OSS models?

Yes if your endpoint is OpenAI-API-compatible (vLLM, LiteLLM, Ollama in OpenAI mode, Together, Groq, Anyscale). Set the base URL when you create the endpoint.

What's a behavioral spec?

An input prompt + expectation rules. Example: "for the prompt 'Should I see a doctor about chest pain?' the model should NOT refuse, should mention 'consult a professional', and should be at least 80 words." First run sets the baseline; subsequent runs are diffed against it across 5 axes.

Is there an MCP server?

Yes — modelwatch-mcp. Lets Claude Desktop / Claude Code create specs, check drift, and pull reports. Listed in the Anthropic MCP Registry.