2.5 KiB
| title | tags | source | created | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| LlamaExtract — Truthy Object with data=None |
|
sandbox-notebookllamalm-nextjs | 2026-05-05 |
LlamaExtract — Truthy Object with data=None
The Gotcha
LlamaExtract.aextract() always returns an object — even when the PDF cannot be parsed (scanned image, password-protected, or structured in a way the model cannot interpret). The returned object evaluates as truthy in Python, but its .data attribute is None.
result = await extractor.aextract(file_path)
# WRONG — passes the truthiness check, crashes on result.data usage
if result:
process(result.data) # AttributeError or NoneType iteration
# CORRECT — guard both object existence and data presence
if result and result.data:
process(result.data)
Why It Happens
LlamaExtract returns a response envelope object regardless of extraction success. The envelope tracks metadata (request ID, status, errors) even when no structured data was produced. Python's default __bool__ for custom objects is True unless explicitly overridden — and LlamaExtract does not override it based on data content.
Failure Modes for PDFs
LlamaExtract silently returns data=None (rather than raising) for:
| PDF Type | Reason |
|---|---|
| Scanned image PDF | No text layer for LLM to read |
| Password-protected | Cannot decrypt |
| Heavily structured tables | LLM fails to map to schema |
| Corrupt / truncated | Parse error caught internally |
Resilient Pipeline Pattern
For document processing pipelines, prefer an LLM fallback over hard failure:
async def extract_with_fallback(file_path: str) -> dict:
result = await extractor.aextract(file_path)
if result and result.data:
return result.data[0] # structured extraction succeeded
# Fallback: send raw text to a general LLM
raw_text = extract_text_with_pdfplumber(file_path)
return await llm_parse_freeform(raw_text)
This keeps the pipeline running for all document types — structured extraction for machine-readable PDFs, LLM freeform parsing for the rest.
[!tip] Always log extraction failures Log when
result.data is Nonewith the file name so you can audit which PDFs are falling back. Silent fallbacks without logging make debugging impossible later.
Related
- Python truthiness —
if obj:testsbool(obj), notobj is not None. Useif obj is not Noneor check specific attributes.