| title |
aliases |
tags |
sources |
created |
updated |
| Security Middleware Regex Matching JSON Keys Causes False Positives |
| security-middleware-false-positive |
| xss-regex-json-keys |
| middleware-json-field-names |
|
| security |
| middleware |
| regex |
| xss |
| fastapi |
| python |
|
|
2026-04-29 |
2026-05-06 |
Security Middleware Regex Matching JSON Keys Causes False Positives
A bare XSS/injection regex pattern like (script|javascript|eval|expression) applied to raw JSON request bodies matches field names as well as field values. Field names like audio_description_vtt, transcript, and subscription all contain "script". The middleware rejects these valid payloads with 400, breaking application features. The fix is to only scan values, not keys, and to use word boundaries (\b) or require HTML-context markers for XSS patterns.
Key Points
- Scanning raw JSON text (not parsed structure) means regex matches both keys and values indiscriminately
- Common field names with false-positive substrings:
audio_description_vtt (→ "script"), transcript (→ "script"), subscription (→ "script"), evaluate (→ "eval"), expression_tree (→ "expression")
- Fix option 1: parse JSON first, then only validate the values recursively
- Fix option 2: use word boundaries
\bscript\b — won't match transcript but still catches <script>
- Fix option 3: require HTML context markers (
<, >, or javascript:) before flagging as XSS
Details
The Problematic Pattern
# ❌ BAD — matches field names and values equally
XSS_PATTERNS = re.compile(
r'(script|javascript|eval|expression|vbscript|onload|onerror)',
re.IGNORECASE
)
async def security_middleware(request: Request, call_next):
body = await request.body()
body_text = body.decode('utf-8')
if XSS_PATTERNS.search(body_text):
return JSONResponse(status_code=400, content={"error": "Invalid content"})
return await call_next(request)
This rejects {"transcript": "Hello world"} because transcript contains script.
Fix 1: Parse JSON and Validate Values Only
import json
def check_values_recursive(obj, pattern) -> bool:
"""Returns True if any VALUE (not key) matches the pattern."""
if isinstance(obj, dict):
return any(check_values_recursive(v, pattern) for v in obj.values())
elif isinstance(obj, list):
return any(check_values_recursive(item, pattern) for item in obj)
elif isinstance(obj, str):
return bool(pattern.search(obj))
return False
async def security_middleware(request: Request, call_next):
body = await request.body()
try:
parsed = json.loads(body)
if check_values_recursive(parsed, XSS_PATTERNS):
return JSONResponse(status_code=400, content={"error": "Invalid content"})
except json.JSONDecodeError:
pass # Non-JSON body — handle separately
return await call_next(request)
Fix 2: Word Boundaries + HTML Context
# ✅ BETTER — requires HTML context for XSS patterns
XSS_PATTERNS = re.compile(
r'(<\s*script|javascript\s*:|on\w+\s*=|<\s*iframe|vbscript\s*:)',
re.IGNORECASE
)
# "transcript" → no match (no < before script)
# "<script>" → match ✓
# "javascript:alert(1)" → match ✓
Fix 3: Combined Approach (Recommended)
# Only flag values, and require HTML attack vectors
def is_suspicious_value(value: str) -> bool:
html_xss = re.compile(r'(<\s*script|javascript\s*:|on\w+=)', re.IGNORECASE)
sql_injection = re.compile(r"(';|\";\s*(drop|delete|insert|update)\s)", re.IGNORECASE)
return bool(html_xss.search(value) or sql_injection.search(value))
False Positive Field Names to Watch For
| Field Name |
Offending Substring |
audio_description_vtt |
script |
transcript |
script |
subscription |
script |
transcription_id |
script |
evaluate_result |
eval |
expression_type |
expression |
Command Injection Patterns: \b Word Boundary Is Mandatory
XSS patterns are not the only problem. Command injection regex patterns suffer the same false-positive issue when applied to raw JSON text, and the fix is different: word boundaries (\b) are required because HTML-context markers don't apply.
The False Positive Table for Command Tokens
Pattern (no \b) |
False positive example |
sh\s+ |
"Josh Smith" — the sh in "Josh S..." matches |
rm\s+ |
"Norm " — rm in "Norm " matches |
nc\s+ |
Any name containing nc followed by a space |
wget\s+ |
Field value "get widgets" — get w doesn't match, but wget as substring would |
curl\s+ |
"security curl" — substring match |
Fix: Prefix All Command Tokens with \b
# ❌ BAD — matches substrings in names, field values, natural text
COMMAND_INJECTION = re.compile(
r'(sh\s+|rm\s+|nc\s+|wget\s+|curl\s+|bash\s+|python\s+|perl\s+)',
re.IGNORECASE
)
# ✅ GOOD — \b ensures we only match at a word boundary
COMMAND_INJECTION = re.compile(
r'(\bsh\b|\brm\b|\bnc\b|\bwget\b|\bcurl\b|\bbash\b|\bpython\b|\bperl\b)',
re.IGNORECASE
)
# Note: \b before the token, but also after — \bsh\b won't match "Josh" or "shell"
Combined Pattern (XSS + Command Injection, Value-Only Scan)
# Apply after parsing JSON — scan VALUES only
XSS_PATTERNS = re.compile(
r'(<\s*script|javascript\s*:|on\w+\s*=)',
re.IGNORECASE
)
CMD_INJECTION = re.compile(
r'(\bsh\b|\brm\s+-rf|\bnc\b\s+\S+\s+\d+|\bwget\b|\bcurl\b)',
re.IGNORECASE
)
def is_suspicious_value(value: str) -> bool:
return bool(XSS_PATTERNS.search(value) or CMD_INJECTION.search(value))
[!tip] Parse first, then scan
Word boundaries reduce false positives but don't eliminate them for common 2-letter tokens like rm and nc. The most robust approach is always: parse JSON first, scan string values only — as described in Fix 1 above.
Related Concepts
Sources
- daily/2026-04-29.md — Identified when Video Accessibility transcript/VTT upload endpoints returned 400