obsidian/wiki/concepts/llamaextract-data-none-gotcha.md
2026-05-05 11:22:10 +01:00

2.5 KiB

title tags source created
LlamaExtract — Truthy Object with data=None
concepts
llamaextract
llama-index
python
pdf
extraction
auto-generated
sandbox-notebookllamalm-nextjs 2026-05-05

LlamaExtract — Truthy Object with data=None

The Gotcha

LlamaExtract.aextract() always returns an object — even when the PDF cannot be parsed (scanned image, password-protected, or structured in a way the model cannot interpret). The returned object evaluates as truthy in Python, but its .data attribute is None.

result = await extractor.aextract(file_path)

# WRONG — passes the truthiness check, crashes on result.data usage
if result:
    process(result.data)  # AttributeError or NoneType iteration

# CORRECT — guard both object existence and data presence
if result and result.data:
    process(result.data)

Why It Happens

LlamaExtract returns a response envelope object regardless of extraction success. The envelope tracks metadata (request ID, status, errors) even when no structured data was produced. Python's default __bool__ for custom objects is True unless explicitly overridden — and LlamaExtract does not override it based on data content.

Failure Modes for PDFs

LlamaExtract silently returns data=None (rather than raising) for:

PDF Type Reason
Scanned image PDF No text layer for LLM to read
Password-protected Cannot decrypt
Heavily structured tables LLM fails to map to schema
Corrupt / truncated Parse error caught internally

Resilient Pipeline Pattern

For document processing pipelines, prefer an LLM fallback over hard failure:

async def extract_with_fallback(file_path: str) -> dict:
    result = await extractor.aextract(file_path)
    
    if result and result.data:
        return result.data[0]  # structured extraction succeeded
    
    # Fallback: send raw text to a general LLM
    raw_text = extract_text_with_pdfplumber(file_path)
    return await llm_parse_freeform(raw_text)

This keeps the pipeline running for all document types — structured extraction for machine-readable PDFs, LLM freeform parsing for the rest.

[!tip] Always log extraction failures Log when result.data is None with the file name so you can audit which PDFs are falling back. Silent fallbacks without logging make debugging impossible later.

  • Python truthiness — if obj: tests bool(obj), not obj is not None. Use if obj is not None or check specific attributes.