1. Front Matter
Title: AI Scenario Seed — Content Safety Layer
Author: Joshua Uriel Tribiana
Reviewers: Sean Patrick (scorevi)
Creation Date: 2026-06-29
Status: Approved
References:
Issue: [2.4.1] AI Scenario Seed — Content Safety Layer
2. Introduction & Goals
Problem Summary
The AI Scenario Seed feature allows creators to submit free-form prompts, creating opportunities for misuse such as harmful or inappropriate outputs, off-brand content, and prompt injection attacks. Addressing these risks requires a layered content safety approach that protects users while ensuring the AI remains focused on its intended educational use.Goals
Prevent Harmful Content: Block the generation of scenarios related to illegal activities, hate speech, explicit content, and other policy violations.
Contextual Analysis: Differentiate between malicious use of a term (e.g., "hacking a bank") and its legitimate educational use (e.g., "preventing bank hacking").
Defense in Depth: Implement a multi-layered system combining fast regex filters with nuanced AI-based classification.
High Availability: Ensure the safety system remains operational even if the AI classifier service experiences temporary outages by using a circuit breaker pattern.
Evasion Resistance: Normalize user input to detect and block common evasion techniques (e.g., using special characters or misspellings).
Output Scanning: Scan the AI-generated output to ensure it does not contain harmful instructions or content that was not present in the prompt.
Non-Goals
100% Foolproof Guarantee: The safety layer aims to be robust but cannot guarantee blocking every conceivable malicious prompt. It is an evolving system.
Manual Moderation: The system is designed to be fully automated and does not include a manual review queue.
User-configurable Safety Levels: The safety policy is global and cannot be adjusted by individual users or agencies.
Glossary
Hard Block: An immediate rejection of a prompt based on zero-tolerance regex patterns for severe violations.
Sensitive Term: A keyword (e.g., "phishing") that is permissible only within an approved educational context.
AI Classifier: A specialized, secondary Gemini model used to perform nuanced safety analysis on prompts containing sensitive terms.
Circuit Breaker: A resilience pattern that temporarily bypasses the AI Classifier and falls back to a local-only check if the classifier fails repeatedly.
3. High-Level Architecture
System Diagram
AI Prompt Safety Workflow +----------------------+| User Input |+----------------------+ | v+----------------------+| Normalize Text |+----------------------+ | v+----------------------+| Hard Block Rules || (Regex Filter) |+----------------------+ | | Reject| |Pass v v+-----------+ +----------------------+| Reject | | Sensitive Term Check || (403) | +----------------------++-----------+ | +-----+-----+ | | Clean Sensitive | | v v +----------------+ +----------------------+ | Allow | | Circuit Breaker | +----------------+ +----------------------+ | | | +-------+-------+ | | | | Service OK Service Down | | | | v v | +----------------+ +----------------+ | | AI Classifier | | Local Check | | +----------------+ +----------------+ | | | | +------+------+ +----+----+ | | | | | | Pass Reject Pass Reject | | | | | +------+-------------+---+---------+ | v +----------------------+ | Generation Service | +----------------------+
Technologies Used
Next.js: For API routes.
Zod: For initial input validation at the API boundary.
Google Gemini: Used as the AI Classifier model.
Standard Regex: For fast, local pattern matching.
4. Detailed Design & Implementation
Data Model / Schema
This is a logic-based service and does not introduce its own database tables. It interacts with the
ai_usage_trackingtable to log safety-related events.Logic & Workflows
The core logic resides inlib/ai/scenario-seed-content-safety.service.tswithin theanalyzeScenarioSeedSafetyfunctionText Normalization:
Before any checks, the input
topicandcontextare passed through anormalizeTextfunction.This function converts text to lowercase, removes diacritics, and strips out zero-width spaces and other invisible characters used for evasion.
Layer 0a: Hard Blocks (Local Regex):
The normalized text is checked against a list of high-severity regex patterns defined in
ABSOLUTE_HARD_BLOCK_PATTERNS.These patterns cover unambiguous violations like illegal acts, hate speech, and self-harm.
A match at this stage results in an immediate
passed: falseresponse, and the request is rejected with a403 Forbiddenerror.
Layer 0b: Sensitive Term Detection (Local Regex):
If no hard blocks are found, the text is checked against
SENSITIVE_TERM_PATTERNS.These patterns identify terms that could be misused but are acceptable in a valid educational context (e.g., "phishing," "malware," "social engineering").
If a sensitive term is found, the request is flagged for further analysis by the AI Classifier. If no sensitive terms are found, the request is approved.
Layer 1: AI Classification (Gemini API):
For flagged requests, a prompt is constructed and sent to a specialized Gemini "classifier" model.
The prompt asks the model for a simple
passed: true/falseverdict based on whether the user's intent is educational or malicious.The system prompt for the classifier is heavily engineered to be strict and safety-focused.
The classifier's verdict determines the final outcome.
Circuit Breaker Logic:
The AI Classifier call is wrapped in a circuit breaker.
If the classifier API call fails (e.g., due to a timeout or
5xxerror) three consecutive times, the circuit breaker "trips" and enters an "open" state for 5 minutes.While open, all calls to the AI Classifier are bypassed, and the system falls back to a stricter, local-only regex check. This ensures the generation feature remains available, albeit with a less nuanced safety check.
After 5 minutes, the circuit breaker moves to a "half-open" state, allowing one test call. If it succeeds, the breaker closes; if it fails, it remains open.
Output Scanning:
After the main AI model generates the scenario content, a final, quick regex scan is performed on the output text.
This scan looks for patterns that would indicate the AI was jailbroken into providing harmful instructions.
If a violation is found in the output, the entire response is discarded, and an error is returned.
Key Files
lib/ai/scenario-seed-content-safety.service.ts: The core service containing all safety logic.app/api/creator/scenarios/generate/route.ts: The API route that calls the safety service.lib/ai/ai-usage-tracking.service.ts: Used to log safety-related events, including blocked prompts.
5. Infrastructure & Operations
Dependencies
Internal:
ai_usage_trackingservice for logging.External: Google Gemini API (for the classifier model).
Monitoring & Alerting
Logging:
All blocked prompts (both hard blocks and AI classifier blocks) are logged via
trackAIUsageSafewithsuccess: falseand a reason.Circuit breaker state changes (tripped, reset) are logged to
console.errorwith a specific prefix[Scenario Seed Circuit Breaker].
Alerts:
An alert should be configured for a high rate of
403 Forbiddenresponses, which could indicate a coordinated abuse attempt.A high-severity alert should be triggered when the circuit breaker trips, as this indicates a problem with the AI provider or our configuration that requires immediate attention.
Deployment Plan
The content safety service is an integral part of the AI Scenario Seed feature. It is deployed alongside the main API route (
.../generate/route.ts).Regex patterns for hard blocks and sensitive terms can be updated and deployed without requiring a full application rebuild.
6. Testing & Quality Assurance
Test Strategy
Unit Tests:
Create a test suite for
analyzeScenarioSeedSafety.Test hard block patterns with known malicious strings and verify they are blocked.
Test sensitive term patterns with both legitimate (e.g., "phishing awareness training") and malicious (e.g., "how to create a phishing email") prompts.
Mock the Gemini API call to test the AI classifier path and the circuit breaker logic (e.g., simulate API failures).
Integration Tests:
Test the
POST /api/creator/scenarios/generateendpoint with a variety of prompts.Verify that prompts matching hard block patterns correctly return a
403status code.Verify that prompts that should be allowed pass through and result in a
200status code.
E2E / QA:
AI-2 (Safety): As defined in
FEATURE_AUDIT_MAY2026.md, manually test the UI with flagged prompts to ensure the content-safety filter works as expected. Attempt to bypass the filter using common evasion techniques (leetspeak, special characters, etc.).
Known Limitations
False Positives/Negatives: The AI classifier, while powerful, is not perfect and may occasionally block a legitimate prompt (false positive) or allow a malicious one (false negative).
Static Regex: The regex patterns are static and require manual updates to counter new evasion techniques or address new threat vectors.
7. Maintenance & Support
Troubleshooting
User reports a legitimate prompt is being blocked:
Ask for the exact prompt text.
Check the
ai_usage_trackinglogs for the blocked request to see the reason (hard_blockorai_classifier).If it was a hard block, review the regex in
ABSOLUTE_HARD_BLOCK_PATTERNSto see if it's too broad.If it was the AI classifier, the system prompt for the classifier may need to be adjusted to be more lenient for that specific context.
Generation is slow or timing out:
Check the logs for circuit breaker alerts. If the breaker is tripped, it indicates a problem with the Gemini API for the classifier model.
Investigate the health of the external Gemini API.
Changelog
1.0 - Approved, Initial feature implementation, 2026-06-29