# Integration Guide: Augmenting PDF Accessibility Checker This guide shows how to integrate external APIs and tools to check WCAG requirements that can't be validated programmatically with basic PDF parsing. ## 🎯 Integration Strategy Matrix | WCAG Gap | Solution | API/Tool | Coverage Improvement | |----------|----------|----------|---------------------| | Alt text quality | AI Vision | OpenAI GPT-4V, Claude, Google Vision | ✅ 90%+ | | Color contrast | Image analysis | Custom + Color libraries | ✅ 95%+ | | OCR for scanned docs | Text extraction | Tesseract, Google Cloud Vision | ✅ 100% | | Link text quality | NLP analysis | OpenAI, spaCy | ✅ 80% | | Content readability | NLP analysis | TextBlob, GPT-4 | ✅ 75% | | Heading hierarchy | Structure parsing | pdf-lib, pypdf enhanced | ✅ 70% | | Form field validation | PDF parsing | pypdf, pdf-lib | ✅ 85% | | Table structure | ML models | Custom + Camelot | ✅ 80% | --- ## 1. 🖼️ AI Vision APIs for Image Analysis (WCAG 1.1.1) ### Problem We're Solving: - ❌ Basic tool can only detect images exist - ✅ AI can generate/validate alt text descriptions ### Solution A: OpenAI GPT-4 Vision ```python import openai import base64 def check_image_alt_text_openai(image_bytes: bytes, existing_alt_text: str = None): """Use GPT-4V to analyze image and suggest/validate alt text""" # Encode image base64_image = base64.b64encode(image_bytes).decode('utf-8') client = openai.OpenAI(api_key="your-api-key") if existing_alt_text: # Validate existing alt text prompt = f"""Analyze this image and the provided alt text. Alt text: "{existing_alt_text}" Rate the alt text quality (1-10) and provide: 1. What's missing from the description 2. What's good about it 3. Suggested improvement Consider: Is it accurate? Concise? Informative? Appropriate detail level?""" else: # Generate alt text suggestion prompt = """Describe this image for someone who cannot see it. Provide a concise alt text (1-2 sentences) suitable for accessibility. Focus on the information the image conveys, not artistic details.""" response = client.chat.completions.create( model="gpt-4-vision-preview", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" } } ] } ], max_tokens=300 ) return response.choices[0].message.content # Usage in checker: def _check_images_with_openai(self): """Enhanced image checking with OpenAI""" for i, page in enumerate(self.pdf_plumber.pages): for img in page.images: # Extract image bytes from PDF image_bytes = self._extract_image_bytes(img) # Get AI analysis analysis = check_image_alt_text_openai(image_bytes) # Check if alt text exists in PDF structure alt_text = self._get_image_alt_text(page, img) if not alt_text: self.add_issue( Severity.ERROR, "Missing Alt Text", f"Page {i+1}: Image has no alt text. AI suggests: {analysis[:100]}...", wcag_criterion="1.1.1" ) else: # Validate quality validation = check_image_alt_text_openai(image_bytes, alt_text) # Parse validation response and create issues if needed ``` **Cost**: ~$0.01-0.03 per image **Setup**: `pip install openai` --- ### Solution B: Anthropic Claude Vision ```python import anthropic import base64 def check_image_with_claude(image_bytes: bytes): """Use Claude to analyze image accessibility""" client = anthropic.Anthropic(api_key="your-api-key") base64_image = base64.b64encode(image_bytes).decode('utf-8') message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ { "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": base64_image, }, }, { "type": "text", "text": """Analyze this image for accessibility: 1. Provide a concise alt text (1-2 sentences) 2. Identify any text in the image (would fail WCAG 1.4.5) 3. Note any color-only information (would fail WCAG 1.4.1) 4. Assess if this is decorative or informational Format as JSON.""" } ], } ], ) return message.content[0].text ``` **Cost**: ~$0.015 per image **Setup**: `pip install anthropic` --- ### Solution C: Google Cloud Vision API ```python from google.cloud import vision def check_image_google_vision(image_bytes: bytes): """Use Google Cloud Vision for comprehensive image analysis""" client = vision.ImageAnnotatorClient() image = vision.Image(content=image_bytes) # Multiple detection types response = client.annotate_image({ 'image': image, 'features': [ {'type_': vision.Feature.Type.TEXT_DETECTION}, # OCR {'type_': vision.Feature.Type.LABEL_DETECTION}, # Content labels {'type_': vision.Feature.Type.IMAGE_PROPERTIES}, # Colors {'type_': vision.Feature.Type.OBJECT_LOCALIZATION}, # Objects ], }) results = { 'has_text': bool(response.text_annotations), 'text_content': response.text_annotations[0].description if response.text_annotations else None, 'labels': [label.description for label in response.label_annotations], 'dominant_colors': response.image_properties_annotation.dominant_colors.colors[:5], 'objects': [obj.name for obj in response.localized_object_annotations] } # Generate issues based on findings issues = [] if results['has_text']: issues.append({ 'severity': 'ERROR', 'wcag': '1.4.5', 'description': f"Image contains text: '{results['text_content'][:100]}'", 'recommendation': 'Text in images should be avoided. Use actual text or provide full text alternative.' }) # Generate alt text suggestion from labels and objects suggested_alt = f"Image showing {', '.join(results['labels'][:3])}" return results, suggested_alt, issues ``` **Cost**: $1.50 per 1,000 images (first 1,000/month free) **Setup**: ```bash pip install google-cloud-vision # Requires Google Cloud project and credentials export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json" ``` --- ## 2. 🎨 Color Contrast Checking (WCAG 1.4.3, 1.4.11) ### Solution A: PIL + Color Math ```python from PIL import Image import numpy as np from pdf2image import convert_from_path def calculate_contrast_ratio(color1, color2): """Calculate WCAG contrast ratio between two colors""" def get_luminance(rgb): """Calculate relative luminance""" rgb = [x / 255.0 for x in rgb] rgb = [ x / 12.92 if x <= 0.03928 else ((x + 0.055) / 1.055) ** 2.4 for x in rgb ] return 0.2126 * rgb[0] + 0.7152 * rgb[1] + 0.0722 * rgb[2] l1 = get_luminance(color1) l2 = get_luminance(color2) lighter = max(l1, l2) darker = min(l1, l2) return (lighter + 0.05) / (darker + 0.05) def check_page_contrast(pdf_path, page_num, sample_size=100): """Check color contrast on a PDF page""" images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num, dpi=150) image = images[0] # Convert to RGB rgb_image = image.convert('RGB') width, height = rgb_image.size # Sample points across the page low_contrast_areas = [] for _ in range(sample_size): x = np.random.randint(0, width - 1) y = np.random.randint(0, height - 1) # Get pixel and adjacent pixel pixel1 = rgb_image.getpixel((x, y)) pixel2 = rgb_image.getpixel((min(x + 1, width - 1), y)) ratio = calculate_contrast_ratio(pixel1, pixel2) # WCAG AA requires 4.5:1 for normal text, 3:1 for large text if ratio < 4.5: low_contrast_areas.append({ 'position': (x, y), 'colors': (pixel1, pixel2), 'ratio': ratio }) return low_contrast_areas # Integration def _check_color_contrast_enhanced(self): """Enhanced contrast checking""" for i in range(len(self.pdf_reader.pages)): low_contrast = check_page_contrast(str(self.pdf_path), i + 1) if len(low_contrast) > 10: # More than 10% of samples self.add_issue( Severity.ERROR, "Color Contrast", f"Page {i+1}: {len(low_contrast)} potential contrast issues detected", wcag_criterion="1.4.3", recommendation="Use Colour Contrast Analyser to verify specific areas" ) ``` **Cost**: Free **Setup**: `pip install pillow pdf2image numpy` --- ### Solution B: Colorblind Simulation ```python def simulate_colorblindness(image, cb_type='protanopia'): """Simulate how image appears to colorblind users""" # Transformation matrices for different types matrices = { 'protanopia': [ # Red-blind [0.567, 0.433, 0], [0.558, 0.442, 0], [0, 0.242, 0.758] ], 'deuteranopia': [ # Green-blind [0.625, 0.375, 0], [0.7, 0.3, 0], [0, 0.3, 0.7] ], 'tritanopia': [ # Blue-blind [0.95, 0.05, 0], [0, 0.433, 0.567], [0, 0.475, 0.525] ] } # Apply transformation # ... image processing code ... return transformed_image def check_accessibility_for_colorblind(pdf_path, page_num): """Check if content is accessible to colorblind users""" images = convert_from_path(pdf_path, first_page=page_num, last_page=page_num) original = images[0] issues = [] for cb_type in ['protanopia', 'deuteranopia', 'tritanopia']: simulated = simulate_colorblindness(original, cb_type) # Compare information loss # If significant difference, color might be only differentiator # ... comparison logic ... return issues ``` --- ## 3. 📝 OCR for Scanned Documents (WCAG 1.1.1) ### Solution A: Tesseract OCR (Free) ```python import pytesseract from pdf2image import convert_from_path def add_ocr_layer(pdf_path, output_path): """Add OCR text layer to scanned PDF""" from pypdf import PdfWriter, PdfReader from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from io import BytesIO images = convert_from_path(pdf_path, dpi=300) writer = PdfWriter() for i, image in enumerate(images): # Run OCR with detailed data ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) # Create PDF page with invisible text layer packet = BytesIO() c = canvas.Canvas(packet, pagesize=letter) # Add invisible text at correct positions for j, text in enumerate(ocr_data['text']): if text.strip(): x = ocr_data['left'][j] y = ocr_data['top'][j] c.drawString(x, y, text) c.save() # Merge with original page # ... merging logic ... with open(output_path, 'wb') as f: writer.write(f) return output_path ``` **Cost**: Free **Setup**: ```bash pip install pytesseract pdf2image # Install Tesseract: https://github.com/tesseract-ocr/tesseract ``` --- ### Solution B: Google Cloud Document AI ```python from google.cloud import documentai_v1 as documentai def ocr_with_google_document_ai(pdf_bytes): """Use Google Document AI for superior OCR""" client = documentai.DocumentProcessorServiceClient() # Configure processor name = "projects/PROJECT_ID/locations/us/processors/PROCESSOR_ID" raw_document = documentai.RawDocument( content=pdf_bytes, mime_type="application/pdf" ) request = documentai.ProcessRequest( name=name, raw_document=raw_document ) result = client.process_document(request=request) document = result.document # Extract text with confidence scores return { 'text': document.text, 'confidence': document.text_styles[0].confidence if document.text_styles else 0, 'pages': len(document.pages), 'entities': document.entities # Structured data extraction } ``` **Cost**: $1.50 per 1,000 pages (first 1,000/month free) **Better than Tesseract**: Higher accuracy, handles complex layouts --- ## 4. 🔗 Link Text Quality Check (WCAG 2.4.4) ### Solution: OpenAI for Context Analysis ```python def check_link_quality_with_ai(link_text, surrounding_context): """Use AI to assess if link text is descriptive""" import openai client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ { "role": "system", "content": """You are a WCAG accessibility expert. Evaluate link text quality. GOOD link text: - Describes destination clearly - Makes sense out of context - Unique (not repeated for different destinations) BAD link text: - "click here", "here", "read more", "link" - Repeated generic text - No indication of destination""" }, { "role": "user", "content": f"""Evaluate this link: Link text: "{link_text}" Context: "{surrounding_context}" Respond with JSON: {{ "quality_score": 1-10, "issues": ["list", "of", "problems"], "suggestion": "better link text", "wcag_pass": true/false }}""" } ] ) return response.choices[0].message.content ``` **Cost**: ~$0.001 per link **Alternative**: Use regex + NLP library (spaCy) for simpler checks --- ## 5. 📖 Content Readability Analysis (WCAG 3.1.5) ### Solution A: TextBlob (Simple, Free) ```python from textblob import TextBlob import re def analyze_readability(text): """Analyze text readability for WCAG 3.1.5 (AAA)""" # Clean text text = re.sub(r'\s+', ' ', text) # Split into sentences blob = TextBlob(text) sentences = blob.sentences # Calculate metrics total_words = len(blob.words) total_sentences = len(sentences) total_syllables = sum(count_syllables(word) for word in blob.words) # Flesch Reading Ease if total_sentences > 0 and total_words > 0: flesch = 206.835 - 1.015 * (total_words / total_sentences) - 84.6 * (total_syllables / total_words) else: flesch = 0 # Flesch-Kincaid Grade Level if total_sentences > 0 and total_words > 0: fk_grade = 0.39 * (total_words / total_sentences) + 11.8 * (total_syllables / total_words) - 15.59 else: fk_grade = 0 return { 'flesch_score': flesch, # 60-70 = acceptable, 90-100 = very easy 'grade_level': fk_grade, # School grade level 'avg_sentence_length': total_words / total_sentences if total_sentences else 0, 'avg_word_length': sum(len(word) for word in blob.words) / total_words if total_words else 0, 'recommendation': 'Target grade 8 or lower for general audience' } def count_syllables(word): """Simple syllable counter""" word = word.lower() count = 0 vowels = 'aeiouy' previous_was_vowel = False for char in word: is_vowel = char in vowels if is_vowel and not previous_was_vowel: count += 1 previous_was_vowel = is_vowel if word.endswith('e'): count -= 1 if count == 0: count = 1 return count ``` **Cost**: Free **Setup**: `pip install textblob` --- ### Solution B: GPT-4 for Advanced Analysis ```python def analyze_content_quality_with_gpt(text_excerpt): """Use GPT-4 for comprehensive content analysis""" import openai client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ { "role": "user", "content": f"""Analyze this content for accessibility: {text_excerpt[:2000]} Provide: 1. Reading level (grade) 2. Jargon/complex terms that need explanation 3. Sentences over 25 words (too complex) 4. Passive voice usage 5. Suggestions for simplification Format as JSON.""" } ] ) return response.choices[0].message.content ``` --- ## 6. 🏗️ Structure and Heading Analysis ### Solution: Enhanced PDF Tag Parsing ```python def analyze_heading_structure(pdf_path): """Parse PDF structure tree and check heading hierarchy""" from pypdf import PdfReader reader = PdfReader(pdf_path) catalog = reader.trailer.get("/Root", {}) if "/StructTreeRoot" not in catalog: return {"error": "No structure tree"} struct_tree = catalog["/StructTreeRoot"] headings = [] def traverse_structure(element, level=0): """Recursively traverse structure tree""" if hasattr(element, 'get_object'): element = element.get_object() if "/Type" in element and element["/Type"] == "/StructElem": struct_type = element.get("/S", "") # Check if it's a heading if struct_type in ["/H1", "/H2", "/H3", "/H4", "/H5", "/H6"]: headings.append({ 'level': int(str(struct_type).replace("/H", "")), 'type': str(struct_type) }) # Traverse children if "/K" in element: children = element["/K"] if not isinstance(children, list): children = [children] for child in children: traverse_structure(child, level + 1) traverse_structure(struct_tree) # Check for heading hierarchy issues issues = [] for i in range(1, len(headings)): prev_level = headings[i-1]['level'] curr_level = headings[i]['level'] # Check for skipped levels (H1 -> H3) if curr_level > prev_level + 1: issues.append({ 'type': 'skipped_level', 'message': f'Heading jumps from H{prev_level} to H{curr_level}', 'wcag': '1.3.1' }) # Check for H1 if not any(h['level'] == 1 for h in headings): issues.append({ 'type': 'no_h1', 'message': 'Document has no H1 heading', 'wcag': '1.3.1' }) return { 'headings': headings, 'issues': issues } ``` --- ## 7. 📋 Form Field Accessibility ### Solution: Complete Form Analysis ```python def analyze_form_fields(pdf_path): """Comprehensive form field accessibility check""" from pypdf import PdfReader reader = PdfReader(pdf_path) if "/AcroForm" not in reader.trailer.get("/Root", {}): return {"has_forms": False} acro_form = reader.trailer["/Root"]["/AcroForm"] fields = acro_form.get("/Fields", []) issues = [] field_details = [] for field in fields: field = field.get_object() field_info = { 'name': field.get("/T", "Unnamed"), 'type': field.get("/FT", "Unknown"), 'has_tooltip': "/TU" in field, # Tooltip = description 'required': field.get("/Ff", 0) & 2 != 0, # Required flag 'read_only': field.get("/Ff", 0) & 1 != 0, } # Check for issues if not field_info['has_tooltip']: issues.append({ 'field': field_info['name'], 'issue': 'No tooltip/description', 'wcag': '3.3.2', 'severity': 'ERROR' }) if field_info['required'] and not field_info['has_tooltip']: issues.append({ 'field': field_info['name'], 'issue': 'Required field missing description', 'wcag': '3.3.2', 'severity': 'CRITICAL' }) field_details.append(field_info) return { 'has_forms': True, 'field_count': len(fields), 'fields': field_details, 'issues': issues } ``` --- ## 8. 📊 Complete Integration Example ```python # config.py class AccessibilityConfig: # API Keys OPENAI_API_KEY = "sk-..." GOOGLE_CLOUD_CREDENTIALS = "path/to/creds.json" # Feature flags ENABLE_AI_IMAGE_ANALYSIS = True ENABLE_OCR = True ENABLE_CONTRAST_CHECK = True ENABLE_CONTENT_ANALYSIS = True # Thresholds MIN_CONTRAST_RATIO = 4.5 MAX_SENTENCE_LENGTH = 25 TARGET_READING_LEVEL = 8 # Usage from enhanced_pdf_checker import EnhancedPDFAccessibilityChecker, EnhancedCheckConfig config = EnhancedCheckConfig( vision_api_provider="openai", vision_api_key=AccessibilityConfig.OPENAI_API_KEY, enable_ocr=True, enable_contrast_check=True, enable_content_analysis=True, verbose=True ) checker = EnhancedPDFAccessibilityChecker("document.pdf", config) issues = checker.check_all() report = checker.generate_report("html") ``` --- ## 💰 Cost Comparison | Service | Cost | Use Case | Coverage | |---------|------|----------|----------| | Tesseract OCR | Free | Scanned docs | 100% | | TextBlob | Free | Readability | 80% | | OpenAI GPT-4V | $0.01-0.03/image | Alt text validation | 95% | | Google Vision | $1.50/1000 images | OCR + analysis | 95% | | Google Document AI | $1.50/1000 pages | Complex OCR | 98% | | Claude Vision | $0.015/image | Alt text + analysis | 95% | --- ## 🎯 Recommended Setup for Different Budgets ### Free Tier (~60% WCAG Coverage) ```bash pip install pytesseract textblob pillow pdf2image # + Basic tool (20%) + OCR (15%) + Readability (15%) + Contrast check (10%) ``` ### Budget Tier (~80% WCAG Coverage) - $10/month - Basic tool (20%) - Tesseract OCR (15%) - TextBlob (15%) - OpenAI API for critical images only (20%) - Custom contrast checking (10%) ### Professional Tier (~95% WCAG Coverage) - $100/month - All free tools - OpenAI GPT-4V for all images (30%) - Google Document AI for OCR (20%) - GPT-4 for content analysis (15%) - Automated link checking (10%) --- ## 🚀 Implementation Roadmap 1. **Week 1**: Integrate OCR (Tesseract) - Free, high impact 2. **Week 2**: Add color contrast checking - Free, fills major gap 3. **Week 3**: Integrate TextBlob for readability - Free, easy win 4. **Week 4**: Add OpenAI vision for critical documents - Paid, but transformative 5. **Week 5**: Polish and optimize API usage - Reduce costs 6. **Week 6**: Add batch processing and caching - Scale efficiently Total implementation time: ~6 weeks for production-ready enhanced checker