AI Scenario Seed — Content Safety Layer

1. Front Matter

  • Title: AI Scenario Seed — Content Safety Layer

  • Author: Joshua Uriel Tribiana

  • Reviewers: Sean Patrick (scorevi)

  • Creation Date: 2026-06-29

  • Status: Approved

  • References:

    • Issue: [2.4.1] AI Scenario Seed — Content Safety Layer

2. Introduction & Goals

  • Problem Summary
    The AI Scenario Seed feature allows creators to submit free-form prompts, creating opportunities for misuse such as harmful or inappropriate outputs, off-brand content, and prompt injection attacks. Addressing these risks requires a layered content safety approach that protects users while ensuring the AI remains focused on its intended educational use.

  • Goals

    • Prevent Harmful Content: Block the generation of scenarios related to illegal activities, hate speech, explicit content, and other policy violations.

    • Contextual Analysis: Differentiate between malicious use of a term (e.g., "hacking a bank") and its legitimate educational use (e.g., "preventing bank hacking").

    • Defense in Depth: Implement a multi-layered system combining fast regex filters with nuanced AI-based classification.

    • High Availability: Ensure the safety system remains operational even if the AI classifier service experiences temporary outages by using a circuit breaker pattern.

    • Evasion Resistance: Normalize user input to detect and block common evasion techniques (e.g., using special characters or misspellings).

    • Output Scanning: Scan the AI-generated output to ensure it does not contain harmful instructions or content that was not present in the prompt.

  • Non-Goals

    • 100% Foolproof Guarantee: The safety layer aims to be robust but cannot guarantee blocking every conceivable malicious prompt. It is an evolving system.

    • Manual Moderation: The system is designed to be fully automated and does not include a manual review queue.

    • User-configurable Safety Levels: The safety policy is global and cannot be adjusted by individual users or agencies.

  • Glossary

    • Hard Block: An immediate rejection of a prompt based on zero-tolerance regex patterns for severe violations.

    • Sensitive Term: A keyword (e.g., "phishing") that is permissible only within an approved educational context.

    • AI Classifier: A specialized, secondary Gemini model used to perform nuanced safety analysis on prompts containing sensitive terms.

    • Circuit Breaker: A resilience pattern that temporarily bypasses the AI Classifier and falls back to a local-only check if the classifier fails repeatedly.

3. High-Level Architecture

  • System Diagram

AI Prompt Safety Workflow
 
+----------------------+
| User Input |
+----------------------+
|
v
+----------------------+
| Normalize Text |
+----------------------+
|
v
+----------------------+
| Hard Block Rules |
| (Regex Filter) |
+----------------------+
| |
Reject| |Pass
v v
+-----------+ +----------------------+
| Reject | | Sensitive Term Check |
| (403) | +----------------------+
+-----------+ |
+-----+-----+
| |
Clean Sensitive
| |
v v
+----------------+ +----------------------+
| Allow | | Circuit Breaker |
+----------------+ +----------------------+
| |
| +-------+-------+
| | |
| Service OK Service Down
| | |
| v v
| +----------------+ +----------------+
| | AI Classifier | | Local Check |
| +----------------+ +----------------+
| | |
| +------+------+ +----+----+
| | | | |
| Pass Reject Pass Reject
| | | | |
+------+-------------+---+---------+
|
v
+----------------------+
| Generation Service |
+----------------------+
  • Technologies Used

    • Next.js: For API routes.

    • Zod: For initial input validation at the API boundary.

    • Google Gemini: Used as the AI Classifier model.

    • Standard Regex: For fast, local pattern matching.

4. Detailed Design & Implementation

  • Data Model / Schema

    This is a logic-based service and does not introduce its own database tables. It interacts with the ai_usage_tracking table to log safety-related events.

  • Logic & Workflows
    The core logic resides in lib/ai/scenario-seed-content-safety.service.ts within the analyzeScenarioSeedSafety function

    1. Text Normalization:

      • Before any checks, the input topic and context are passed through a normalizeText function.

      • This function converts text to lowercase, removes diacritics, and strips out zero-width spaces and other invisible characters used for evasion.

    2. Layer 0a: Hard Blocks (Local Regex):

      • The normalized text is checked against a list of high-severity regex patterns defined in ABSOLUTE_HARD_BLOCK_PATTERNS.

      • These patterns cover unambiguous violations like illegal acts, hate speech, and self-harm.

      • A match at this stage results in an immediate passed: false response, and the request is rejected with a 403 Forbidden error.

    3. Layer 0b: Sensitive Term Detection (Local Regex):

      • If no hard blocks are found, the text is checked against SENSITIVE_TERM_PATTERNS.

      • These patterns identify terms that could be misused but are acceptable in a valid educational context (e.g., "phishing," "malware," "social engineering").

      • If a sensitive term is found, the request is flagged for further analysis by the AI Classifier. If no sensitive terms are found, the request is approved.

    4. Layer 1: AI Classification (Gemini API):

      • For flagged requests, a prompt is constructed and sent to a specialized Gemini "classifier" model.

      • The prompt asks the model for a simple passed: true/false verdict based on whether the user's intent is educational or malicious.

      • The system prompt for the classifier is heavily engineered to be strict and safety-focused.

      • The classifier's verdict determines the final outcome.

    5. Circuit Breaker Logic:

      • The AI Classifier call is wrapped in a circuit breaker.

      • If the classifier API call fails (e.g., due to a timeout or 5xx error) three consecutive times, the circuit breaker "trips" and enters an "open" state for 5 minutes.

      • While open, all calls to the AI Classifier are bypassed, and the system falls back to a stricter, local-only regex check. This ensures the generation feature remains available, albeit with a less nuanced safety check.

      • After 5 minutes, the circuit breaker moves to a "half-open" state, allowing one test call. If it succeeds, the breaker closes; if it fails, it remains open.

    6. Output Scanning:

      • After the main AI model generates the scenario content, a final, quick regex scan is performed on the output text.

      • This scan looks for patterns that would indicate the AI was jailbroken into providing harmful instructions.

      • If a violation is found in the output, the entire response is discarded, and an error is returned.

  • Key Files

    • lib/ai/scenario-seed-content-safety.service.ts: The core service containing all safety logic.

    • app/api/creator/scenarios/generate/route.ts: The API route that calls the safety service.

    • lib/ai/ai-usage-tracking.service.ts: Used to log safety-related events, including blocked prompts.

5. Infrastructure & Operations

Dependencies

  • Internal: ai_usage_tracking service for logging.

  • External: Google Gemini API (for the classifier model).

Monitoring & Alerting

  • Logging:

    • All blocked prompts (both hard blocks and AI classifier blocks) are logged via trackAIUsageSafe with success: false and a reason.

    • Circuit breaker state changes (tripped, reset) are logged to console.error with a specific prefix [Scenario Seed Circuit Breaker].

  • Alerts:

    • An alert should be configured for a high rate of 403 Forbidden responses, which could indicate a coordinated abuse attempt.

    • A high-severity alert should be triggered when the circuit breaker trips, as this indicates a problem with the AI provider or our configuration that requires immediate attention.

Deployment Plan

  • The content safety service is an integral part of the AI Scenario Seed feature. It is deployed alongside the main API route (.../generate/route.ts).

  • Regex patterns for hard blocks and sensitive terms can be updated and deployed without requiring a full application rebuild.

6. Testing & Quality Assurance

Test Strategy

  • Unit Tests:

    • Create a test suite for analyzeScenarioSeedSafety.

    • Test hard block patterns with known malicious strings and verify they are blocked.

    • Test sensitive term patterns with both legitimate (e.g., "phishing awareness training") and malicious (e.g., "how to create a phishing email") prompts.

    • Mock the Gemini API call to test the AI classifier path and the circuit breaker logic (e.g., simulate API failures).

  • Integration Tests:

    • Test the POST /api/creator/scenarios/generate endpoint with a variety of prompts.

    • Verify that prompts matching hard block patterns correctly return a 403 status code.

    • Verify that prompts that should be allowed pass through and result in a 200 status code.

  • E2E / QA:

    • AI-2 (Safety): As defined in FEATURE_AUDIT_MAY2026.md, manually test the UI with flagged prompts to ensure the content-safety filter works as expected. Attempt to bypass the filter using common evasion techniques (leetspeak, special characters, etc.).

Known Limitations

  • False Positives/Negatives: The AI classifier, while powerful, is not perfect and may occasionally block a legitimate prompt (false positive) or allow a malicious one (false negative).

  • Static Regex: The regex patterns are static and require manual updates to counter new evasion techniques or address new threat vectors.

7. Maintenance & Support

Troubleshooting

  • User reports a legitimate prompt is being blocked:

    1. Ask for the exact prompt text.

    2. Check the ai_usage_tracking logs for the blocked request to see the reason (hard_block or ai_classifier).

    3. If it was a hard block, review the regex in ABSOLUTE_HARD_BLOCK_PATTERNS to see if it's too broad.

    4. If it was the AI classifier, the system prompt for the classifier may need to be adjusted to be more lenient for that specific context.

  • Generation is slow or timing out:

    1. Check the logs for circuit breaker alerts. If the breaker is tripped, it indicates a problem with the Gemini API for the classifier model.

    2. Investigate the health of the external Gemini API.

Changelog

1.0 - Approved, Initial feature implementation, 2026-06-29


Was this article helpful?