pdf-accessibility/README's/INTEGRATION_GUIDE.md

# Integration Guide: Augmenting PDF Accessibility Checker

This guide shows how to integrate external APIs and tools to check WCAG requirements that can't be validated programmatically with basic PDF parsing.

## 🎯 Integration Strategy Matrix

| WCAG Gap | Solution | API/Tool | Coverage Improvement |
|----------|----------|----------|---------------------|
| Alt text quality | AI Vision | OpenAI GPT-4V, Claude, Google Vision | ✅ 90%+ |
| Color contrast | Image analysis | Custom + Color libraries | ✅ 95%+ |
| OCR for scanned docs | Text extraction | Tesseract, Google Cloud Vision | ✅ 100% |
| Link text quality | NLP analysis | OpenAI, spaCy | ✅ 80% |
| Content readability | NLP analysis | TextBlob, GPT-4 | ✅ 75% |
| Heading hierarchy | Structure parsing | pdf-lib, pypdf enhanced | ✅ 70% |
| Form field validation | PDF parsing | pypdf, pdf-lib | ✅ 85% |
| Table structure | ML models | Custom + Camelot | ✅ 80% |

---

## 1. 🖼️ AI Vision APIs for Image Analysis (WCAG 1.1.1)

### Problem We're Solving:
- ❌ Basic tool can only detect images exist
- ✅ AI can generate/validate alt text descriptions

### Solution A: OpenAI GPT-4 Vision

```python
import openai
import base64

def check_image_alt_text_openai(image_bytes: bytes, existing_alt_text: str = None):
    """Use GPT-4V to analyze image and suggest/validate alt text"""

    # Encode image
    base64_image = base64.b64encode(image_bytes).decode('utf-8')

    client = openai.OpenAI(api_key="your-api-key")

    if existing_alt_text:
        # Validate existing alt text
        prompt = f"""Analyze this image and the provided alt text.

        Alt text: "{existing_alt_text}"

        Rate the alt text quality (1-10) and provide:
        1. What's missing from the description
        2. What's good about it
        3. Suggested improvement

        Consider: Is it accurate? Concise? Informative? Appropriate detail level?"""
    else:
        # Generate alt text suggestion
        prompt = """Describe this image for someone who cannot see it.
        Provide a concise alt text (1-2 sentences) suitable for accessibility.
        Focus on the information the image conveys, not artistic details."""

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )

    return response.choices[0].message.content

# Usage in checker:
def _check_images_with_openai(self):
    """Enhanced image checking with OpenAI"""
    for i, page in enumerate(self.pdf_plumber.pages):
        for img in page.images:
            # Extract image bytes from PDF
            image_bytes = self._extract_image_bytes(img)

            # Get AI analysis
            analysis = check_image_alt_text_openai(image_bytes)

            # Check if alt text exists in PDF structure
            alt_text = self._get_image_alt_text(page, img)

            if not alt_text:
                self.add_issue(
                    Severity.ERROR,
                    "Missing Alt Text",
                    f"Page {i+1}: Image has no alt text. AI suggests: {analysis[:100]}...",
                    wcag_criterion="1.1.1"
                )
            else:
                # Validate quality
                validation = check_image_alt_text_openai(image_bytes, alt_text)
                # Parse validation response and create issues if needed
```

**Cost**: ~$0.01-0.03 per image
**Setup**: `pip install openai`

---

### Solution B: Anthropic Claude Vision

```python
import anthropic
import base64

def check_image_with_claude(image_bytes: bytes):
    """Use Claude to analyze image accessibility"""

    client = anthropic.Anthropic(api_key="your-api-key")

    base64_image = base64.b64encode(image_bytes).decode('utf-8')

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/jpeg",
                            "data": base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": """Analyze this image for accessibility:

                        1. Provide a concise alt text (1-2 sentences)
                        2. Identify any text in the image (would fail WCAG 1.4.5)
                        3. Note any color-only information (would fail WCAG 1.4.1)
                        4. Assess if this is decorative or informational

                        Format as JSON."""
                    }
                ],
            }
        ],
    )

    return message.content[0].text
```

**Cost**: ~$0.015 per image
**Setup**: `pip install anthropic`

---

### Solution C: Google Cloud Vision API

```python
from google.cloud import vision

def check_image_google_vision(image_bytes: bytes):
    """Use Google Cloud Vision for comprehensive image analysis"""

    client = vision.ImageAnnotatorClient()
    image = vision.Image(content=image_bytes)

    # Multiple detection types
    response = client.annotate_image({
        'image': image,
        'features': [
            {'type_': vision.Feature.Type.TEXT_DETECTION},  # OCR
            {'type_': vision.Feature.Type.LABEL_DETECTION},  # Content labels
            {'type_': vision.Feature.Type.IMAGE_PROPERTIES},  # Colors
            {'type_': vision.Feature.Type.OBJECT_LOCALIZATION},  # Objects
        ],
    })

    results = {
        'has_text': bool(response.text_annotations),
        'text_content': response.text_annotations[0].description if response.text_annotations else None,
        'labels': [label.description for label in response.label_annotations],
        'dominant_colors': response.image_properties_annotation.dominant_colors.colors[:5],
        'objects': [obj.name for obj in response.localized_object_annotations]
    }

    # Generate issues based on findings
    issues = []

    if results['has_text']:
        issues.append({
            'severity': 'ERROR',
            'wcag': '1.4.5',
            'description': f"Image contains text: '{results['text_content'][:100]}'",
            'recommendation': 'Text in images should be avoided. Use actual text or provide full text alternative.'
        })

    # Generate alt text suggestion from labels and objects
    suggested_alt = f"Image showing {', '.join(results['labels'][:3])}"

    return results, suggested_alt, issues
```

**Cost**: $1.50 per 1,000 images (first 1,000/month free)
**Setup**:
```bash
pip install google-cloud-vision
# Requires Google Cloud project and credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
```

---

## 2. 🎨 Color Contrast Checking (WCAG 1.4.3, 1.4.11)

### Solution A: PIL + Color Math

```python
from PIL import Image
import numpy as np
from pdf2image import convert_from_path

def calculate_contrast_ratio(color1, color2):
    """Calculate WCAG contrast ratio between two colors"""

    def get_luminance(rgb):
        """Calculate relative luminance"""
        rgb = [x / 255.0 for x in rgb]
        rgb = [
            x / 12.92 if x <= 0.03928
            else ((x + 0.055) / 1.055) ** 2.4
            for x in rgb
        ]
        return 0.2126 * rgb[0] + 0.7152 * rgb[1] + 0.0722 * rgb[2]

    l1 = get_luminance(color1)
    l2 = get_luminance(color2)

    lighter = max(l1, l2)
    darker = min(l1, l2)

    return (lighter + 0.05) / (darker + 0.05)

def check_page_contrast(pdf_path, page_num, sample_size=100):
    """Check color contrast on a PDF page"""

    images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num, dpi=150)
    image = images[0]

    # Convert to RGB
    rgb_image = image.convert('RGB')
    width, height = rgb_image.size

    # Sample points across the page
    low_contrast_areas = []

    for _ in range(sample_size):
        x = np.random.randint(0, width - 1)
        y = np.random.randint(0, height - 1)

        # Get pixel and adjacent pixel
        pixel1 = rgb_image.getpixel((x, y))
        pixel2 = rgb_image.getpixel((min(x + 1, width - 1), y))

        ratio = calculate_contrast_ratio(pixel1, pixel2)

        # WCAG AA requires 4.5:1 for normal text, 3:1 for large text
        if ratio < 4.5:
            low_contrast_areas.append({
                'position': (x, y),
                'colors': (pixel1, pixel2),
                'ratio': ratio
            })

    return low_contrast_areas

# Integration
def _check_color_contrast_enhanced(self):
    """Enhanced contrast checking"""
    for i in range(len(self.pdf_reader.pages)):
        low_contrast = check_page_contrast(str(self.pdf_path), i + 1)

        if len(low_contrast) > 10:  # More than 10% of samples
            self.add_issue(
                Severity.ERROR,
                "Color Contrast",
                f"Page {i+1}: {len(low_contrast)} potential contrast issues detected",
                wcag_criterion="1.4.3",
                recommendation="Use Colour Contrast Analyser to verify specific areas"
            )
```

**Cost**: Free
**Setup**: `pip install pillow pdf2image numpy`

---

### Solution B: Colorblind Simulation

```python
def simulate_colorblindness(image, cb_type='protanopia'):
    """Simulate how image appears to colorblind users"""

    # Transformation matrices for different types
    matrices = {
        'protanopia': [  # Red-blind
            [0.567, 0.433, 0],
            [0.558, 0.442, 0],
            [0, 0.242, 0.758]
        ],
        'deuteranopia': [  # Green-blind
            [0.625, 0.375, 0],
            [0.7, 0.3, 0],
            [0, 0.3, 0.7]
        ],
        'tritanopia': [  # Blue-blind
            [0.95, 0.05, 0],
            [0, 0.433, 0.567],
            [0, 0.475, 0.525]
        ]
    }

    # Apply transformation
    # ... image processing code ...

    return transformed_image

def check_accessibility_for_colorblind(pdf_path, page_num):
    """Check if content is accessible to colorblind users"""

    images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num)
    original = images[0]

    issues = []

    for cb_type in ['protanopia', 'deuteranopia', 'tritanopia']:
        simulated = simulate_colorblindness(original, cb_type)

        # Compare information loss
        # If significant difference, color might be only differentiator
        # ... comparison logic ...

    return issues
```

---

## 3. 📝 OCR for Scanned Documents (WCAG 1.1.1)

### Solution A: Tesseract OCR (Free)

```python
import pytesseract
from pdf2image import convert_from_path

def add_ocr_layer(pdf_path, output_path):
    """Add OCR text layer to scanned PDF"""

    from pypdf import PdfWriter, PdfReader
    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import letter
    from io import BytesIO

    images = convert_from_path(pdf_path, dpi=300)

    writer = PdfWriter()

    for i, image in enumerate(images):
        # Run OCR with detailed data
        ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

        # Create PDF page with invisible text layer
        packet = BytesIO()
        c = canvas.Canvas(packet, pagesize=letter)

        # Add invisible text at correct positions
        for j, text in enumerate(ocr_data['text']):
            if text.strip():
                x = ocr_data['left'][j]
                y = ocr_data['top'][j]
                c.drawString(x, y, text)

        c.save()

        # Merge with original page
        # ... merging logic ...

    with open(output_path, 'wb') as f:
        writer.write(f)

    return output_path
```

**Cost**: Free
**Setup**:
```bash
pip install pytesseract pdf2image
# Install Tesseract: https://github.com/tesseract-ocr/tesseract
```

---

### Solution B: Google Cloud Document AI

```python
from google.cloud import documentai_v1 as documentai

def ocr_with_google_document_ai(pdf_bytes):
    """Use Google Document AI for superior OCR"""

    client = documentai.DocumentProcessorServiceClient()

    # Configure processor
    name = "projects/PROJECT_ID/locations/us/processors/PROCESSOR_ID"

    raw_document = documentai.RawDocument(
        content=pdf_bytes,
        mime_type="application/pdf"
    )

    request = documentai.ProcessRequest(
        name=name,
        raw_document=raw_document
    )

    result = client.process_document(request=request)
    document = result.document

    # Extract text with confidence scores
    return {
        'text': document.text,
        'confidence': document.text_styles[0].confidence if document.text_styles else 0,
        'pages': len(document.pages),
        'entities': document.entities  # Structured data extraction
    }
```

**Cost**: $1.50 per 1,000 pages (first 1,000/month free)
**Better than Tesseract**: Higher accuracy, handles complex layouts

---

## 4. 🔗 Link Text Quality Check (WCAG 2.4.4)

### Solution: OpenAI for Context Analysis

```python
def check_link_quality_with_ai(link_text, surrounding_context):
    """Use AI to assess if link text is descriptive"""

    import openai

    client = openai.OpenAI()

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """You are a WCAG accessibility expert. Evaluate link text quality.

                GOOD link text:
                - Describes destination clearly
                - Makes sense out of context
                - Unique (not repeated for different destinations)

                BAD link text:
                - "click here", "here", "read more", "link"
                - Repeated generic text
                - No indication of destination"""
            },
            {
                "role": "user",
                "content": f"""Evaluate this link:

                Link text: "{link_text}"
                Context: "{surrounding_context}"

                Respond with JSON:
                {{
                    "quality_score": 1-10,
                    "issues": ["list", "of", "problems"],
                    "suggestion": "better link text",
                    "wcag_pass": true/false
                }}"""
            }
        ]
    )

    return response.choices[0].message.content
```

**Cost**: ~$0.001 per link
**Alternative**: Use regex + NLP library (spaCy) for simpler checks

---

## 5. 📖 Content Readability Analysis (WCAG 3.1.5)

### Solution A: TextBlob (Simple, Free)

```python
from textblob import TextBlob
import re

def analyze_readability(text):
    """Analyze text readability for WCAG 3.1.5 (AAA)"""

    # Clean text
    text = re.sub(r'\s+', ' ', text)

    # Split into sentences
    blob = TextBlob(text)
    sentences = blob.sentences

    # Calculate metrics
    total_words = len(blob.words)
    total_sentences = len(sentences)
    total_syllables = sum(count_syllables(word) for word in blob.words)

    # Flesch Reading Ease
    if total_sentences > 0 and total_words > 0:
        flesch = 206.835 - 1.015 * (total_words / total_sentences) - 84.6 * (total_syllables / total_words)
    else:
        flesch = 0

    # Flesch-Kincaid Grade Level
    if total_sentences > 0 and total_words > 0:
        fk_grade = 0.39 * (total_words / total_sentences) + 11.8 * (total_syllables / total_words) - 15.59
    else:
        fk_grade = 0

    return {
        'flesch_score': flesch,  # 60-70 = acceptable, 90-100 = very easy
        'grade_level': fk_grade,  # School grade level
        'avg_sentence_length': total_words / total_sentences if total_sentences else 0,
        'avg_word_length': sum(len(word) for word in blob.words) / total_words if total_words else 0,
        'recommendation': 'Target grade 8 or lower for general audience'
    }

def count_syllables(word):
    """Simple syllable counter"""
    word = word.lower()
    count = 0
    vowels = 'aeiouy'
    previous_was_vowel = False

    for char in word:
        is_vowel = char in vowels
        if is_vowel and not previous_was_vowel:
            count += 1
        previous_was_vowel = is_vowel

    if word.endswith('e'):
        count -= 1
    if count == 0:
        count = 1

    return count
```

**Cost**: Free
**Setup**: `pip install textblob`

---

### Solution B: GPT-4 for Advanced Analysis

```python
def analyze_content_quality_with_gpt(text_excerpt):
    """Use GPT-4 for comprehensive content analysis"""

    import openai

    client = openai.OpenAI()

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "user",
                "content": f"""Analyze this content for accessibility:

                {text_excerpt[:2000]}

                Provide:
                1. Reading level (grade)
                2. Jargon/complex terms that need explanation
                3. Sentences over 25 words (too complex)
                4. Passive voice usage
                5. Suggestions for simplification

                Format as JSON."""
            }
        ]
    )

    return response.choices[0].message.content
```

---

## 6. 🏗️ Structure and Heading Analysis

### Solution: Enhanced PDF Tag Parsing

```python
def analyze_heading_structure(pdf_path):
    """Parse PDF structure tree and check heading hierarchy"""

    from pypdf import PdfReader

    reader = PdfReader(pdf_path)

    catalog = reader.trailer.get("/Root", {})

    if "/StructTreeRoot" not in catalog:
        return {"error": "No structure tree"}

    struct_tree = catalog["/StructTreeRoot"]

    headings = []

    def traverse_structure(element, level=0):
        """Recursively traverse structure tree"""
        if hasattr(element, 'get_object'):
            element = element.get_object()

        if "/Type" in element and element["/Type"] == "/StructElem":
            struct_type = element.get("/S", "")

            # Check if it's a heading
            if struct_type in ["/H1", "/H2", "/H3", "/H4", "/H5", "/H6"]:
                headings.append({
                    'level': int(str(struct_type).replace("/H", "")),
                    'type': str(struct_type)
                })

            # Traverse children
            if "/K" in element:
                children = element["/K"]
                if not isinstance(children, list):
                    children = [children]

                for child in children:
                    traverse_structure(child, level + 1)

    traverse_structure(struct_tree)

    # Check for heading hierarchy issues
    issues = []

    for i in range(1, len(headings)):
        prev_level = headings[i-1]['level']
        curr_level = headings[i]['level']

        # Check for skipped levels (H1 -> H3)
        if curr_level > prev_level + 1:
            issues.append({
                'type': 'skipped_level',
                'message': f'Heading jumps from H{prev_level} to H{curr_level}',
                'wcag': '1.3.1'
            })

    # Check for H1
    if not any(h['level'] == 1 for h in headings):
        issues.append({
            'type': 'no_h1',
            'message': 'Document has no H1 heading',
            'wcag': '1.3.1'
        })

    return {
        'headings': headings,
        'issues': issues
    }
```

---

## 7. 📋 Form Field Accessibility

### Solution: Complete Form Analysis

```python
def analyze_form_fields(pdf_path):
    """Comprehensive form field accessibility check"""

    from pypdf import PdfReader

    reader = PdfReader(pdf_path)

    if "/AcroForm" not in reader.trailer.get("/Root", {}):
        return {"has_forms": False}

    acro_form = reader.trailer["/Root"]["/AcroForm"]
    fields = acro_form.get("/Fields", [])

    issues = []
    field_details = []

    for field in fields:
        field = field.get_object()

        field_info = {
            'name': field.get("/T", "Unnamed"),
            'type': field.get("/FT", "Unknown"),
            'has_tooltip': "/TU" in field,  # Tooltip = description
            'required': field.get("/Ff", 0) & 2 != 0,  # Required flag
            'read_only': field.get("/Ff", 0) & 1 != 0,
        }

        # Check for issues
        if not field_info['has_tooltip']:
            issues.append({
                'field': field_info['name'],
                'issue': 'No tooltip/description',
                'wcag': '3.3.2',
                'severity': 'ERROR'
            })

        if field_info['required'] and not field_info['has_tooltip']:
            issues.append({
                'field': field_info['name'],
                'issue': 'Required field missing description',
                'wcag': '3.3.2',
                'severity': 'CRITICAL'
            })

        field_details.append(field_info)

    return {
        'has_forms': True,
        'field_count': len(fields),
        'fields': field_details,
        'issues': issues
    }
```

---

## 8. 📊 Complete Integration Example

```python
# config.py
class AccessibilityConfig:
    # API Keys
    OPENAI_API_KEY = "sk-..."
    GOOGLE_CLOUD_CREDENTIALS = "path/to/creds.json"

    # Feature flags
    ENABLE_AI_IMAGE_ANALYSIS = True
    ENABLE_OCR = True
    ENABLE_CONTRAST_CHECK = True
    ENABLE_CONTENT_ANALYSIS = True

    # Thresholds
    MIN_CONTRAST_RATIO = 4.5
    MAX_SENTENCE_LENGTH = 25
    TARGET_READING_LEVEL = 8

# Usage
from enhanced_pdf_checker import EnhancedPDFAccessibilityChecker, EnhancedCheckConfig

config = EnhancedCheckConfig(
    vision_api_provider="openai",
    vision_api_key=AccessibilityConfig.OPENAI_API_KEY,
    enable_ocr=True,
    enable_contrast_check=True,
    enable_content_analysis=True,
    verbose=True
)

checker = EnhancedPDFAccessibilityChecker("document.pdf", config)
issues = checker.check_all()
report = checker.generate_report("html")
```

---

## 💰 Cost Comparison

| Service | Cost | Use Case | Coverage |
|---------|------|----------|----------|
| Tesseract OCR | Free | Scanned docs | 100% |
| TextBlob | Free | Readability | 80% |
| OpenAI GPT-4V | $0.01-0.03/image | Alt text validation | 95% |
| Google Vision | $1.50/1000 images | OCR + analysis | 95% |
| Google Document AI | $1.50/1000 pages | Complex OCR | 98% |
| Claude Vision | $0.015/image | Alt text + analysis | 95% |

---

## 🎯 Recommended Setup for Different Budgets

### Free Tier (~60% WCAG Coverage)
```bash
pip install pytesseract textblob pillow pdf2image
# + Basic tool (20%) + OCR (15%) + Readability (15%) + Contrast check (10%)
```

### Budget Tier (~80% WCAG Coverage) - $10/month
- Basic tool (20%)
- Tesseract OCR (15%)
- TextBlob (15%)
- OpenAI API for critical images only (20%)
- Custom contrast checking (10%)

### Professional Tier (~95% WCAG Coverage) - $100/month
- All free tools
- OpenAI GPT-4V for all images (30%)
- Google Document AI for OCR (20%)
- GPT-4 for content analysis (15%)
- Automated link checking (10%)

---

## 🚀 Implementation Roadmap

1. **Week 1**: Integrate OCR (Tesseract) - Free, high impact
2. **Week 2**: Add color contrast checking - Free, fills major gap
3. **Week 3**: Integrate TextBlob for readability - Free, easy win
4. **Week 4**: Add OpenAI vision for critical documents - Paid, but transformative
5. **Week 5**: Polish and optimize API usage - Reduce costs
6. **Week 6**: Add batch processing and caching - Scale efficiently

Total implementation time: ~6 weeks for production-ready enhanced checker