obsidian/wiki/concepts/security-middleware-json-key-regex.md
2026-05-06 21:05:03 +01:00

6.3 KiB

title aliases tags sources created updated
Security Middleware Regex Matching JSON Keys Causes False Positives
security-middleware-false-positive
xss-regex-json-keys
middleware-json-field-names
security
middleware
regex
xss
fastapi
python
daily/2026-04-29.md
2026-04-29 2026-05-06

Security Middleware Regex Matching JSON Keys Causes False Positives

A bare XSS/injection regex pattern like (script|javascript|eval|expression) applied to raw JSON request bodies matches field names as well as field values. Field names like audio_description_vtt, transcript, and subscription all contain "script". The middleware rejects these valid payloads with 400, breaking application features. The fix is to only scan values, not keys, and to use word boundaries (\b) or require HTML-context markers for XSS patterns.

Key Points

  • Scanning raw JSON text (not parsed structure) means regex matches both keys and values indiscriminately
  • Common field names with false-positive substrings: audio_description_vtt (→ "script"), transcript (→ "script"), subscription (→ "script"), evaluate (→ "eval"), expression_tree (→ "expression")
  • Fix option 1: parse JSON first, then only validate the values recursively
  • Fix option 2: use word boundaries \bscript\b — won't match transcript but still catches <script>
  • Fix option 3: require HTML context markers (<, >, or javascript:) before flagging as XSS

Details

The Problematic Pattern

# ❌ BAD — matches field names and values equally
XSS_PATTERNS = re.compile(
    r'(script|javascript|eval|expression|vbscript|onload|onerror)',
    re.IGNORECASE
)

async def security_middleware(request: Request, call_next):
    body = await request.body()
    body_text = body.decode('utf-8')
    if XSS_PATTERNS.search(body_text):
        return JSONResponse(status_code=400, content={"error": "Invalid content"})
    return await call_next(request)

This rejects {"transcript": "Hello world"} because transcript contains script.

Fix 1: Parse JSON and Validate Values Only

import json

def check_values_recursive(obj, pattern) -> bool:
    """Returns True if any VALUE (not key) matches the pattern."""
    if isinstance(obj, dict):
        return any(check_values_recursive(v, pattern) for v in obj.values())
    elif isinstance(obj, list):
        return any(check_values_recursive(item, pattern) for item in obj)
    elif isinstance(obj, str):
        return bool(pattern.search(obj))
    return False

async def security_middleware(request: Request, call_next):
    body = await request.body()
    try:
        parsed = json.loads(body)
        if check_values_recursive(parsed, XSS_PATTERNS):
            return JSONResponse(status_code=400, content={"error": "Invalid content"})
    except json.JSONDecodeError:
        pass  # Non-JSON body — handle separately
    return await call_next(request)

Fix 2: Word Boundaries + HTML Context

# ✅ BETTER — requires HTML context for XSS patterns
XSS_PATTERNS = re.compile(
    r'(<\s*script|javascript\s*:|on\w+\s*=|<\s*iframe|vbscript\s*:)',
    re.IGNORECASE
)
# "transcript" → no match (no < before script)
# "<script>" → match ✓
# "javascript:alert(1)" → match ✓
# Only flag values, and require HTML attack vectors
def is_suspicious_value(value: str) -> bool:
    html_xss = re.compile(r'(<\s*script|javascript\s*:|on\w+=)', re.IGNORECASE)
    sql_injection = re.compile(r"(';|\";\s*(drop|delete|insert|update)\s)", re.IGNORECASE)
    return bool(html_xss.search(value) or sql_injection.search(value))

False Positive Field Names to Watch For

Field Name Offending Substring
audio_description_vtt script
transcript script
subscription script
transcription_id script
evaluate_result eval
expression_type expression

Command Injection Patterns: \b Word Boundary Is Mandatory

XSS patterns are not the only problem. Command injection regex patterns suffer the same false-positive issue when applied to raw JSON text, and the fix is different: word boundaries (\b) are required because HTML-context markers don't apply.

The False Positive Table for Command Tokens

Pattern (no \b) False positive example
sh\s+ "Josh Smith" — the sh in "Josh S..." matches
rm\s+ "Norm "rm in "Norm " matches
nc\s+ Any name containing nc followed by a space
wget\s+ Field value "get widgets"get w doesn't match, but wget as substring would
curl\s+ "security curl" — substring match

Fix: Prefix All Command Tokens with \b

# ❌ BAD — matches substrings in names, field values, natural text
COMMAND_INJECTION = re.compile(
    r'(sh\s+|rm\s+|nc\s+|wget\s+|curl\s+|bash\s+|python\s+|perl\s+)',
    re.IGNORECASE
)

# ✅ GOOD — \b ensures we only match at a word boundary
COMMAND_INJECTION = re.compile(
    r'(\bsh\b|\brm\b|\bnc\b|\bwget\b|\bcurl\b|\bbash\b|\bpython\b|\bperl\b)',
    re.IGNORECASE
)
# Note: \b before the token, but also after — \bsh\b won't match "Josh" or "shell"

Combined Pattern (XSS + Command Injection, Value-Only Scan)

# Apply after parsing JSON — scan VALUES only
XSS_PATTERNS = re.compile(
    r'(<\s*script|javascript\s*:|on\w+\s*=)',
    re.IGNORECASE
)
CMD_INJECTION = re.compile(
    r'(\bsh\b|\brm\s+-rf|\bnc\b\s+\S+\s+\d+|\bwget\b|\bcurl\b)',
    re.IGNORECASE
)

def is_suspicious_value(value: str) -> bool:
    return bool(XSS_PATTERNS.search(value) or CMD_INJECTION.search(value))

[!tip] Parse first, then scan Word boundaries reduce false positives but don't eliminate them for common 2-letter tokens like rm and nc. The most robust approach is always: parse JSON first, scan string values only — as described in Fix 1 above.

Sources

  • daily/2026-04-29.md — Identified when Video Accessibility transcript/VTT upload endpoints returned 400