pdf-accessibility/README's/INTEGRATION_GUIDE.md
DJP bf83a409bb Initial commit: Enterprise PDF Accessibility Checker
- Complete WCAG 2.1 accessibility checking system
- AI-powered analysis with Claude 4.5 and Google Vision
- Web interface with drag-and-drop upload
- REST API backend (PHP)
- Python checker with parallel processing
- Quick mode for fast scans (~10 seconds)
- Full mode with AI analysis (~2 minutes)
- .env file support for API keys
- Error logging and debugging tools
- Comprehensive documentation

Performance improvements:
- Parallel image processing (3x faster)
- Smart API timeouts (10s)
- Reduced DPI for faster conversions
- Real-time progress updates

🤖 Generated with Claude Code
2025-10-20 15:50:56 -04:00

24 KiB

Integration Guide: Augmenting PDF Accessibility Checker

This guide shows how to integrate external APIs and tools to check WCAG requirements that can't be validated programmatically with basic PDF parsing.

🎯 Integration Strategy Matrix

WCAG Gap Solution API/Tool Coverage Improvement
Alt text quality AI Vision OpenAI GPT-4V, Claude, Google Vision 90%+
Color contrast Image analysis Custom + Color libraries 95%+
OCR for scanned docs Text extraction Tesseract, Google Cloud Vision 100%
Link text quality NLP analysis OpenAI, spaCy 80%
Content readability NLP analysis TextBlob, GPT-4 75%
Heading hierarchy Structure parsing pdf-lib, pypdf enhanced 70%
Form field validation PDF parsing pypdf, pdf-lib 85%
Table structure ML models Custom + Camelot 80%

1. 🖼️ AI Vision APIs for Image Analysis (WCAG 1.1.1)

Problem We're Solving:

  • Basic tool can only detect images exist
  • AI can generate/validate alt text descriptions

Solution A: OpenAI GPT-4 Vision

import openai
import base64

def check_image_alt_text_openai(image_bytes: bytes, existing_alt_text: str = None):
    """Use GPT-4V to analyze image and suggest/validate alt text"""
    
    # Encode image
    base64_image = base64.b64encode(image_bytes).decode('utf-8')
    
    client = openai.OpenAI(api_key="your-api-key")
    
    if existing_alt_text:
        # Validate existing alt text
        prompt = f"""Analyze this image and the provided alt text.
        
        Alt text: "{existing_alt_text}"
        
        Rate the alt text quality (1-10) and provide:
        1. What's missing from the description
        2. What's good about it
        3. Suggested improvement
        
        Consider: Is it accurate? Concise? Informative? Appropriate detail level?"""
    else:
        # Generate alt text suggestion
        prompt = """Describe this image for someone who cannot see it. 
        Provide a concise alt text (1-2 sentences) suitable for accessibility.
        Focus on the information the image conveys, not artistic details."""
    
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )
    
    return response.choices[0].message.content

# Usage in checker:
def _check_images_with_openai(self):
    """Enhanced image checking with OpenAI"""
    for i, page in enumerate(self.pdf_plumber.pages):
        for img in page.images:
            # Extract image bytes from PDF
            image_bytes = self._extract_image_bytes(img)
            
            # Get AI analysis
            analysis = check_image_alt_text_openai(image_bytes)
            
            # Check if alt text exists in PDF structure
            alt_text = self._get_image_alt_text(page, img)
            
            if not alt_text:
                self.add_issue(
                    Severity.ERROR,
                    "Missing Alt Text",
                    f"Page {i+1}: Image has no alt text. AI suggests: {analysis[:100]}...",
                    wcag_criterion="1.1.1"
                )
            else:
                # Validate quality
                validation = check_image_alt_text_openai(image_bytes, alt_text)
                # Parse validation response and create issues if needed

Cost: ~$0.01-0.03 per image Setup: pip install openai


Solution B: Anthropic Claude Vision

import anthropic
import base64

def check_image_with_claude(image_bytes: bytes):
    """Use Claude to analyze image accessibility"""
    
    client = anthropic.Anthropic(api_key="your-api-key")
    
    base64_image = base64.b64encode(image_bytes).decode('utf-8')
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/jpeg",
                            "data": base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": """Analyze this image for accessibility:
                        
                        1. Provide a concise alt text (1-2 sentences)
                        2. Identify any text in the image (would fail WCAG 1.4.5)
                        3. Note any color-only information (would fail WCAG 1.4.1)
                        4. Assess if this is decorative or informational
                        
                        Format as JSON."""
                    }
                ],
            }
        ],
    )
    
    return message.content[0].text

Cost: ~$0.015 per image Setup: pip install anthropic


Solution C: Google Cloud Vision API

from google.cloud import vision

def check_image_google_vision(image_bytes: bytes):
    """Use Google Cloud Vision for comprehensive image analysis"""
    
    client = vision.ImageAnnotatorClient()
    image = vision.Image(content=image_bytes)
    
    # Multiple detection types
    response = client.annotate_image({
        'image': image,
        'features': [
            {'type_': vision.Feature.Type.TEXT_DETECTION},  # OCR
            {'type_': vision.Feature.Type.LABEL_DETECTION},  # Content labels
            {'type_': vision.Feature.Type.IMAGE_PROPERTIES},  # Colors
            {'type_': vision.Feature.Type.OBJECT_LOCALIZATION},  # Objects
        ],
    })
    
    results = {
        'has_text': bool(response.text_annotations),
        'text_content': response.text_annotations[0].description if response.text_annotations else None,
        'labels': [label.description for label in response.label_annotations],
        'dominant_colors': response.image_properties_annotation.dominant_colors.colors[:5],
        'objects': [obj.name for obj in response.localized_object_annotations]
    }
    
    # Generate issues based on findings
    issues = []
    
    if results['has_text']:
        issues.append({
            'severity': 'ERROR',
            'wcag': '1.4.5',
            'description': f"Image contains text: '{results['text_content'][:100]}'",
            'recommendation': 'Text in images should be avoided. Use actual text or provide full text alternative.'
        })
    
    # Generate alt text suggestion from labels and objects
    suggested_alt = f"Image showing {', '.join(results['labels'][:3])}"
    
    return results, suggested_alt, issues

Cost: $1.50 per 1,000 images (first 1,000/month free) Setup:

pip install google-cloud-vision
# Requires Google Cloud project and credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

2. 🎨 Color Contrast Checking (WCAG 1.4.3, 1.4.11)

Solution A: PIL + Color Math

from PIL import Image
import numpy as np
from pdf2image import convert_from_path

def calculate_contrast_ratio(color1, color2):
    """Calculate WCAG contrast ratio between two colors"""
    
    def get_luminance(rgb):
        """Calculate relative luminance"""
        rgb = [x / 255.0 for x in rgb]
        rgb = [
            x / 12.92 if x <= 0.03928 
            else ((x + 0.055) / 1.055) ** 2.4 
            for x in rgb
        ]
        return 0.2126 * rgb[0] + 0.7152 * rgb[1] + 0.0722 * rgb[2]
    
    l1 = get_luminance(color1)
    l2 = get_luminance(color2)
    
    lighter = max(l1, l2)
    darker = min(l1, l2)
    
    return (lighter + 0.05) / (darker + 0.05)

def check_page_contrast(pdf_path, page_num, sample_size=100):
    """Check color contrast on a PDF page"""
    
    images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num, dpi=150)
    image = images[0]
    
    # Convert to RGB
    rgb_image = image.convert('RGB')
    width, height = rgb_image.size
    
    # Sample points across the page
    low_contrast_areas = []
    
    for _ in range(sample_size):
        x = np.random.randint(0, width - 1)
        y = np.random.randint(0, height - 1)
        
        # Get pixel and adjacent pixel
        pixel1 = rgb_image.getpixel((x, y))
        pixel2 = rgb_image.getpixel((min(x + 1, width - 1), y))
        
        ratio = calculate_contrast_ratio(pixel1, pixel2)
        
        # WCAG AA requires 4.5:1 for normal text, 3:1 for large text
        if ratio < 4.5:
            low_contrast_areas.append({
                'position': (x, y),
                'colors': (pixel1, pixel2),
                'ratio': ratio
            })
    
    return low_contrast_areas

# Integration
def _check_color_contrast_enhanced(self):
    """Enhanced contrast checking"""
    for i in range(len(self.pdf_reader.pages)):
        low_contrast = check_page_contrast(str(self.pdf_path), i + 1)
        
        if len(low_contrast) > 10:  # More than 10% of samples
            self.add_issue(
                Severity.ERROR,
                "Color Contrast",
                f"Page {i+1}: {len(low_contrast)} potential contrast issues detected",
                wcag_criterion="1.4.3",
                recommendation="Use Colour Contrast Analyser to verify specific areas"
            )

Cost: Free Setup: pip install pillow pdf2image numpy


Solution B: Colorblind Simulation

def simulate_colorblindness(image, cb_type='protanopia'):
    """Simulate how image appears to colorblind users"""
    
    # Transformation matrices for different types
    matrices = {
        'protanopia': [  # Red-blind
            [0.567, 0.433, 0],
            [0.558, 0.442, 0],
            [0, 0.242, 0.758]
        ],
        'deuteranopia': [  # Green-blind
            [0.625, 0.375, 0],
            [0.7, 0.3, 0],
            [0, 0.3, 0.7]
        ],
        'tritanopia': [  # Blue-blind
            [0.95, 0.05, 0],
            [0, 0.433, 0.567],
            [0, 0.475, 0.525]
        ]
    }
    
    # Apply transformation
    # ... image processing code ...
    
    return transformed_image

def check_accessibility_for_colorblind(pdf_path, page_num):
    """Check if content is accessible to colorblind users"""
    
    images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num)
    original = images[0]
    
    issues = []
    
    for cb_type in ['protanopia', 'deuteranopia', 'tritanopia']:
        simulated = simulate_colorblindness(original, cb_type)
        
        # Compare information loss
        # If significant difference, color might be only differentiator
        # ... comparison logic ...
        
    return issues

3. 📝 OCR for Scanned Documents (WCAG 1.1.1)

Solution A: Tesseract OCR (Free)

import pytesseract
from pdf2image import convert_from_path

def add_ocr_layer(pdf_path, output_path):
    """Add OCR text layer to scanned PDF"""
    
    from pypdf import PdfWriter, PdfReader
    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import letter
    from io import BytesIO
    
    images = convert_from_path(pdf_path, dpi=300)
    
    writer = PdfWriter()
    
    for i, image in enumerate(images):
        # Run OCR with detailed data
        ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
        
        # Create PDF page with invisible text layer
        packet = BytesIO()
        c = canvas.Canvas(packet, pagesize=letter)
        
        # Add invisible text at correct positions
        for j, text in enumerate(ocr_data['text']):
            if text.strip():
                x = ocr_data['left'][j]
                y = ocr_data['top'][j]
                c.drawString(x, y, text)
        
        c.save()
        
        # Merge with original page
        # ... merging logic ...
    
    with open(output_path, 'wb') as f:
        writer.write(f)
    
    return output_path

Cost: Free Setup:

pip install pytesseract pdf2image
# Install Tesseract: https://github.com/tesseract-ocr/tesseract

Solution B: Google Cloud Document AI

from google.cloud import documentai_v1 as documentai

def ocr_with_google_document_ai(pdf_bytes):
    """Use Google Document AI for superior OCR"""
    
    client = documentai.DocumentProcessorServiceClient()
    
    # Configure processor
    name = "projects/PROJECT_ID/locations/us/processors/PROCESSOR_ID"
    
    raw_document = documentai.RawDocument(
        content=pdf_bytes,
        mime_type="application/pdf"
    )
    
    request = documentai.ProcessRequest(
        name=name,
        raw_document=raw_document
    )
    
    result = client.process_document(request=request)
    document = result.document
    
    # Extract text with confidence scores
    return {
        'text': document.text,
        'confidence': document.text_styles[0].confidence if document.text_styles else 0,
        'pages': len(document.pages),
        'entities': document.entities  # Structured data extraction
    }

Cost: $1.50 per 1,000 pages (first 1,000/month free) Better than Tesseract: Higher accuracy, handles complex layouts


Solution: OpenAI for Context Analysis

def check_link_quality_with_ai(link_text, surrounding_context):
    """Use AI to assess if link text is descriptive"""
    
    import openai
    
    client = openai.OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """You are a WCAG accessibility expert. Evaluate link text quality.
                
                GOOD link text:
                - Describes destination clearly
                - Makes sense out of context
                - Unique (not repeated for different destinations)
                
                BAD link text:
                - "click here", "here", "read more", "link"
                - Repeated generic text
                - No indication of destination"""
            },
            {
                "role": "user",
                "content": f"""Evaluate this link:
                
                Link text: "{link_text}"
                Context: "{surrounding_context}"
                
                Respond with JSON:
                {{
                    "quality_score": 1-10,
                    "issues": ["list", "of", "problems"],
                    "suggestion": "better link text",
                    "wcag_pass": true/false
                }}"""
            }
        ]
    )
    
    return response.choices[0].message.content

Cost: ~$0.001 per link Alternative: Use regex + NLP library (spaCy) for simpler checks


5. 📖 Content Readability Analysis (WCAG 3.1.5)

Solution A: TextBlob (Simple, Free)

from textblob import TextBlob
import re

def analyze_readability(text):
    """Analyze text readability for WCAG 3.1.5 (AAA)"""
    
    # Clean text
    text = re.sub(r'\s+', ' ', text)
    
    # Split into sentences
    blob = TextBlob(text)
    sentences = blob.sentences
    
    # Calculate metrics
    total_words = len(blob.words)
    total_sentences = len(sentences)
    total_syllables = sum(count_syllables(word) for word in blob.words)
    
    # Flesch Reading Ease
    if total_sentences > 0 and total_words > 0:
        flesch = 206.835 - 1.015 * (total_words / total_sentences) - 84.6 * (total_syllables / total_words)
    else:
        flesch = 0
    
    # Flesch-Kincaid Grade Level
    if total_sentences > 0 and total_words > 0:
        fk_grade = 0.39 * (total_words / total_sentences) + 11.8 * (total_syllables / total_words) - 15.59
    else:
        fk_grade = 0
    
    return {
        'flesch_score': flesch,  # 60-70 = acceptable, 90-100 = very easy
        'grade_level': fk_grade,  # School grade level
        'avg_sentence_length': total_words / total_sentences if total_sentences else 0,
        'avg_word_length': sum(len(word) for word in blob.words) / total_words if total_words else 0,
        'recommendation': 'Target grade 8 or lower for general audience'
    }

def count_syllables(word):
    """Simple syllable counter"""
    word = word.lower()
    count = 0
    vowels = 'aeiouy'
    previous_was_vowel = False
    
    for char in word:
        is_vowel = char in vowels
        if is_vowel and not previous_was_vowel:
            count += 1
        previous_was_vowel = is_vowel
    
    if word.endswith('e'):
        count -= 1
    if count == 0:
        count = 1
    
    return count

Cost: Free Setup: pip install textblob


Solution B: GPT-4 for Advanced Analysis

def analyze_content_quality_with_gpt(text_excerpt):
    """Use GPT-4 for comprehensive content analysis"""
    
    import openai
    
    client = openai.OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "user",
                "content": f"""Analyze this content for accessibility:
                
                {text_excerpt[:2000]}
                
                Provide:
                1. Reading level (grade)
                2. Jargon/complex terms that need explanation
                3. Sentences over 25 words (too complex)
                4. Passive voice usage
                5. Suggestions for simplification
                
                Format as JSON."""
            }
        ]
    )
    
    return response.choices[0].message.content

6. 🏗️ Structure and Heading Analysis

Solution: Enhanced PDF Tag Parsing

def analyze_heading_structure(pdf_path):
    """Parse PDF structure tree and check heading hierarchy"""
    
    from pypdf import PdfReader
    
    reader = PdfReader(pdf_path)
    
    catalog = reader.trailer.get("/Root", {})
    
    if "/StructTreeRoot" not in catalog:
        return {"error": "No structure tree"}
    
    struct_tree = catalog["/StructTreeRoot"]
    
    headings = []
    
    def traverse_structure(element, level=0):
        """Recursively traverse structure tree"""
        if hasattr(element, 'get_object'):
            element = element.get_object()
        
        if "/Type" in element and element["/Type"] == "/StructElem":
            struct_type = element.get("/S", "")
            
            # Check if it's a heading
            if struct_type in ["/H1", "/H2", "/H3", "/H4", "/H5", "/H6"]:
                headings.append({
                    'level': int(str(struct_type).replace("/H", "")),
                    'type': str(struct_type)
                })
            
            # Traverse children
            if "/K" in element:
                children = element["/K"]
                if not isinstance(children, list):
                    children = [children]
                
                for child in children:
                    traverse_structure(child, level + 1)
    
    traverse_structure(struct_tree)
    
    # Check for heading hierarchy issues
    issues = []
    
    for i in range(1, len(headings)):
        prev_level = headings[i-1]['level']
        curr_level = headings[i]['level']
        
        # Check for skipped levels (H1 -> H3)
        if curr_level > prev_level + 1:
            issues.append({
                'type': 'skipped_level',
                'message': f'Heading jumps from H{prev_level} to H{curr_level}',
                'wcag': '1.3.1'
            })
    
    # Check for H1
    if not any(h['level'] == 1 for h in headings):
        issues.append({
            'type': 'no_h1',
            'message': 'Document has no H1 heading',
            'wcag': '1.3.1'
        })
    
    return {
        'headings': headings,
        'issues': issues
    }

7. 📋 Form Field Accessibility

Solution: Complete Form Analysis

def analyze_form_fields(pdf_path):
    """Comprehensive form field accessibility check"""
    
    from pypdf import PdfReader
    
    reader = PdfReader(pdf_path)
    
    if "/AcroForm" not in reader.trailer.get("/Root", {}):
        return {"has_forms": False}
    
    acro_form = reader.trailer["/Root"]["/AcroForm"]
    fields = acro_form.get("/Fields", [])
    
    issues = []
    field_details = []
    
    for field in fields:
        field = field.get_object()
        
        field_info = {
            'name': field.get("/T", "Unnamed"),
            'type': field.get("/FT", "Unknown"),
            'has_tooltip': "/TU" in field,  # Tooltip = description
            'required': field.get("/Ff", 0) & 2 != 0,  # Required flag
            'read_only': field.get("/Ff", 0) & 1 != 0,
        }
        
        # Check for issues
        if not field_info['has_tooltip']:
            issues.append({
                'field': field_info['name'],
                'issue': 'No tooltip/description',
                'wcag': '3.3.2',
                'severity': 'ERROR'
            })
        
        if field_info['required'] and not field_info['has_tooltip']:
            issues.append({
                'field': field_info['name'],
                'issue': 'Required field missing description',
                'wcag': '3.3.2',
                'severity': 'CRITICAL'
            })
        
        field_details.append(field_info)
    
    return {
        'has_forms': True,
        'field_count': len(fields),
        'fields': field_details,
        'issues': issues
    }

8. 📊 Complete Integration Example

# config.py
class AccessibilityConfig:
    # API Keys
    OPENAI_API_KEY = "sk-..."
    GOOGLE_CLOUD_CREDENTIALS = "path/to/creds.json"
    
    # Feature flags
    ENABLE_AI_IMAGE_ANALYSIS = True
    ENABLE_OCR = True
    ENABLE_CONTRAST_CHECK = True
    ENABLE_CONTENT_ANALYSIS = True
    
    # Thresholds
    MIN_CONTRAST_RATIO = 4.5
    MAX_SENTENCE_LENGTH = 25
    TARGET_READING_LEVEL = 8

# Usage
from enhanced_pdf_checker import EnhancedPDFAccessibilityChecker, EnhancedCheckConfig

config = EnhancedCheckConfig(
    vision_api_provider="openai",
    vision_api_key=AccessibilityConfig.OPENAI_API_KEY,
    enable_ocr=True,
    enable_contrast_check=True,
    enable_content_analysis=True,
    verbose=True
)

checker = EnhancedPDFAccessibilityChecker("document.pdf", config)
issues = checker.check_all()
report = checker.generate_report("html")

💰 Cost Comparison

Service Cost Use Case Coverage
Tesseract OCR Free Scanned docs 100%
TextBlob Free Readability 80%
OpenAI GPT-4V $0.01-0.03/image Alt text validation 95%
Google Vision $1.50/1000 images OCR + analysis 95%
Google Document AI $1.50/1000 pages Complex OCR 98%
Claude Vision $0.015/image Alt text + analysis 95%

Free Tier (~60% WCAG Coverage)

pip install pytesseract textblob pillow pdf2image
# + Basic tool (20%) + OCR (15%) + Readability (15%) + Contrast check (10%)

Budget Tier (~80% WCAG Coverage) - $10/month

  • Basic tool (20%)
  • Tesseract OCR (15%)
  • TextBlob (15%)
  • OpenAI API for critical images only (20%)
  • Custom contrast checking (10%)

Professional Tier (~95% WCAG Coverage) - $100/month

  • All free tools
  • OpenAI GPT-4V for all images (30%)
  • Google Document AI for OCR (20%)
  • GPT-4 for content analysis (15%)
  • Automated link checking (10%)

🚀 Implementation Roadmap

  1. Week 1: Integrate OCR (Tesseract) - Free, high impact
  2. Week 2: Add color contrast checking - Free, fills major gap
  3. Week 3: Integrate TextBlob for readability - Free, easy win
  4. Week 4: Add OpenAI vision for critical documents - Paid, but transformative
  5. Week 5: Polish and optimize API usage - Reduce costs
  6. Week 6: Add batch processing and caching - Scale efficiently

Total implementation time: ~6 weeks for production-ready enhanced checker