Security Middleware Regex Matching JSON Keys Causes False Positives

A bare XSS/injection regex pattern like (script|javascript|eval|expression) applied to raw JSON request bodies matches field names as well as field values. Field names like audio_description_vtt, transcript, and subscription all contain "script". The middleware rejects these valid payloads with 400, breaking application features. The fix is to only scan values, not keys, and to use word boundaries (\b) or require HTML-context markers for XSS patterns.

Key Points

Scanning raw JSON text (not parsed structure) means regex matches both keys and values indiscriminately
Common field names with false-positive substrings: audio_description_vtt (→ "script"), transcript (→ "script"), subscription (→ "script"), evaluate (→ "eval"), expression_tree (→ "expression")
Fix option 1: parse JSON first, then only validate the values recursively
Fix option 2: use word boundaries \bscript\b — won't match transcript but still catches <script>
Fix option 3: require HTML context markers (<, >, or javascript:) before flagging as XSS

Details

The Problematic Pattern

# ❌ BAD — matches field names and values equally
XSS_PATTERNS = re.compile(
    r'(script|javascript|eval|expression|vbscript|onload|onerror)',
    re.IGNORECASE
)

async def security_middleware(request: Request, call_next):
    body = await request.body()
    body_text = body.decode('utf-8')
    if XSS_PATTERNS.search(body_text):
        return JSONResponse(status_code=400, content={"error": "Invalid content"})
    return await call_next(request)

This rejects {"transcript": "Hello world"} because transcript contains script.

Fix 1: Parse JSON and Validate Values Only

import json

def check_values_recursive(obj, pattern) -> bool:
    """Returns True if any VALUE (not key) matches the pattern."""
    if isinstance(obj, dict):
        return any(check_values_recursive(v, pattern) for v in obj.values())
    elif isinstance(obj, list):
        return any(check_values_recursive(item, pattern) for item in obj)
    elif isinstance(obj, str):
        return bool(pattern.search(obj))
    return False

async def security_middleware(request: Request, call_next):
    body = await request.body()
    try:
        parsed = json.loads(body)
        if check_values_recursive(parsed, XSS_PATTERNS):
            return JSONResponse(status_code=400, content={"error": "Invalid content"})
    except json.JSONDecodeError:
        pass  # Non-JSON body — handle separately
    return await call_next(request)

Fix 2: Word Boundaries + HTML Context

# ✅ BETTER — requires HTML context for XSS patterns
XSS_PATTERNS = re.compile(
    r'(<\s*script|javascript\s*:|on\w+\s*=|<\s*iframe|vbscript\s*:)',
    re.IGNORECASE
)
# "transcript" → no match (no < before script)
# "<script>" → match ✓
# "javascript:alert(1)" → match ✓

Fix 3: Combined Approach (Recommended)

# Only flag values, and require HTML attack vectors
def is_suspicious_value(value: str) -> bool:
    html_xss = re.compile(r'(<\s*script|javascript\s*:|on\w+=)', re.IGNORECASE)
    sql_injection = re.compile(r"(';|\";\s*(drop|delete|insert|update)\s)", re.IGNORECASE)
    return bool(html_xss.search(value) or sql_injection.search(value))

False Positive Field Names to Watch For

Field Name	Offending Substring
`audio_description_vtt`	`script`
`transcript`	`script`
`subscription`	`script`
`transcription_id`	`script`
`evaluate_result`	`eval`
`expression_type`	`expression`

Command Injection Patterns: `\b` Word Boundary Is Mandatory

XSS patterns are not the only problem. Command injection regex patterns suffer the same false-positive issue when applied to raw JSON text, and the fix is different: word boundaries (\b) are required because HTML-context markers don't apply.

The False Positive Table for Command Tokens

Pattern (no `\b`)	False positive example
`sh\s+`	`"Josh Smith"` — the `sh` in "Josh S..." matches
`rm\s+`	`"Norm "` — `rm` in "Norm " matches
`nc\s+`	Any name containing `nc` followed by a space
`wget\s+`	Field value `"get widgets"` — `get w` doesn't match, but `wget` as substring would
`curl\s+`	`"security curl"` — substring match

Fix: Prefix All Command Tokens with `\b`

# ❌ BAD — matches substrings in names, field values, natural text
COMMAND_INJECTION = re.compile(
    r'(sh\s+|rm\s+|nc\s+|wget\s+|curl\s+|bash\s+|python\s+|perl\s+)',
    re.IGNORECASE
)

# ✅ GOOD — \b ensures we only match at a word boundary
COMMAND_INJECTION = re.compile(
    r'(\bsh\b|\brm\b|\bnc\b|\bwget\b|\bcurl\b|\bbash\b|\bpython\b|\bperl\b)',
    re.IGNORECASE
)
# Note: \b before the token, but also after — \bsh\b won't match "Josh" or "shell"

Combined Pattern (XSS + Command Injection, Value-Only Scan)

# Apply after parsing JSON — scan VALUES only
XSS_PATTERNS = re.compile(
    r'(<\s*script|javascript\s*:|on\w+\s*=)',
    re.IGNORECASE
)
CMD_INJECTION = re.compile(
    r'(\bsh\b|\brm\s+-rf|\bnc\b\s+\S+\s+\d+|\bwget\b|\bcurl\b)',
    re.IGNORECASE
)

def is_suspicious_value(value: str) -> bool:
    return bool(XSS_PATTERNS.search(value) or CMD_INJECTION.search(value))

[!tip] Parse first, then scan Word boundaries reduce false positives but don't eliminate them for common 2-letter tokens like rm and nc. The most robust approach is always: parse JSON first, scan string values only — as described in Fix 1 above.

wiki/tech-patterns/xss-validation-false-positive-word-boundary — existing tech-pattern covering this same fix
wiki/concepts/azure-ad-yaml-allowlist-pattern — related security middleware pattern (authz layer)

Sources

daily/2026-04-29.md — Identified when Video Accessibility transcript/VTT upload endpoints returned 400

6.3 KiB Raw Blame History