volt-newsroom-scraper-report/SETUP_GUIDE.md
DJP be23e2d300 Complete working implementation with all fixes
- Fixed Python 3.14 compatibility (switched to Python 3.12)
- Upgraded Anthropic SDK to v0.75.0
- Updated Claude model to Sonnet 4.5 (claude-sonnet-4-5-20250929)
- Fixed Google Sheets hyperlink extraction using Sheets API v4
- Extracts rich text hyperlinks from cells correctly
- Fixed WeasyPrint PDF generation (upgraded to v67.0)
- Fixed Jinja2 template naming collision (items -> articles)
- Added extract_hyperlinks.py module for Sheets API v4 integration
- Dual-track scraping: Firecrawl for regular URLs, Apify for social media
- Newsletter-style PDF with Montserrat font
- Complete documentation and setup guides
- All components tested and working
2026-01-06 13:11:44 -05:00

4 KiB

Quick Setup Guide

Implementation Complete!

Your Newsroom Daily Report Generator is ready to use. Here's what's been built:

What It Does

  1. Reads today's URLs from your Google Sheet (column D for dates)
  2. Classifies URLs (social media vs regular websites)
  3. Scrapes content:
    • Firecrawl for news articles/blogs
    • Apify for Twitter/X, Instagram, TikTok, LinkedIn
  4. Summarizes with Claude API (title + 2-3 bullets per article)
  5. Generates beautiful newsletter-style PDF with Montserrat font

Categories Supported

  • HARD NEWS
  • POP CULTURE
  • PRODUCT SPOTTING
  • INTERNET CULTURE
  • INDUSTRY NEWS
  • SOCIAL UPDATES
  • INSPIRATION

Next Steps to Get Running

1. Install System Dependencies

macOS (you're on this):

brew install cairo pango gdk-pixbuf libffi

2. Set Up Google Sheets Access

You need to create a Google Service Account:

  1. Go to Google Cloud Console
  2. Create/select project
  3. Enable "Google Sheets API"
  4. Create Service Account
  5. Download JSON key and save as service_account.json in this directory
  6. Important: Share your Google Sheet with the service account email

Detailed instructions in README.md

3. Add Your Anthropic API Key

Edit the .env file:

nano .env

Replace your-anthropic-api-key-here with your actual API key.

4. Install Python Dependencies

# Activate virtual environment (already created)
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

5. Test Run

# Make sure venv is activated
source venv/bin/activate

# Run the generator
python newsroom_report.py

What You Have

Files Created

  • newsroom_report.py - Main script (run this daily)
  • config.py - Configuration management
  • scraper.py - Firecrawl integration
  • apify_scrapers.py - Social media scraping
  • content_processor.py - URL classification
  • summarizer.py - Claude AI summarization
  • pdf_generator.py - PDF generation
  • templates/ - HTML/CSS for newsletter design
  • static/montserrat/ - Montserrat font files
  • requirements.txt - Python dependencies
  • .env - Configuration (edit this with your Anthropic key)
  • README.md - Complete documentation

Git Repository

Already initialized and ready to push:

# Push to Bitbucket when ready (using SSH key djp1971)
git push -u origin master

Costs

Monthly estimates for 20 URLs/day (600/month):

  • Firecrawl: ~$20-30
  • Apify: ~$15-50
  • Claude API: ~$10-20
  • Google Sheets: Free
  • Total: ~$45-100/month

Support Resources

  • README.md - Full documentation
  • Troubleshooting section - Common issues and solutions
  • Test commands - Verify each component works

Daily Usage

Once set up, just run:

cd /Users/daveporter/Desktop/CODING-2024/newsroom-reporter
source venv/bin/activate
python newsroom_report.py

PDF will be saved to: reports/Newsroom_Report_YYYY-MM-DD.pdf

Key Configuration

Already set in .env.example and copied to .env:

  • Google Sheet ID: 1vGSZIST0ruKdYRGSgNz1W8AueQGFHHbZ7D6zXFVNKeA
  • Firecrawl API Key: fc-3dfbb10dca12469998ad9e0db490d622
  • Apify API Key: apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF
  • You need to add: Anthropic API Key

Notes

  • Date format in sheet: "Tuesday, January 6" (no year)
  • Dates should be in Column D
  • Script automatically finds today's date
  • Social media scraping via Apify (reliable for all platforms)
  • Falls back gracefully if any URL fails to scrape

Architecture Highlights

Smart URL Classification

Automatically detects and routes URLs to the appropriate scraper (social vs regular).

Parallel Processing

Scrapes multiple URLs efficiently using batch operations where possible.

Error Handling

Graceful fallbacks for failed scrapes - continues processing other URLs.

Newsletter Design

Professional PDF with:

  • Gradient header
  • Category sections with colored borders
  • Article cards with bullet points
  • Clean typography with Montserrat font
  • Source links for each article

Enjoy your automated newsroom reports!