DJP be23e2d300 Complete working implementation with all fixes

- Fixed Python 3.14 compatibility (switched to Python 3.12)
- Upgraded Anthropic SDK to v0.75.0
- Updated Claude model to Sonnet 4.5 (claude-sonnet-4-5-20250929)
- Fixed Google Sheets hyperlink extraction using Sheets API v4
- Extracts rich text hyperlinks from cells correctly
- Fixed WeasyPrint PDF generation (upgraded to v67.0)
- Fixed Jinja2 template naming collision (items -> articles)
- Added extract_hyperlinks.py module for Sheets API v4 integration
- Dual-track scraping: Firecrawl for regular URLs, Apify for social media
- Newsletter-style PDF with Montserrat font
- Complete documentation and setup guides
- All components tested and working

2026-01-06 13:11:44 -05:00

4 KiB

Raw Permalink Blame History

Quick Setup Guide

Implementation Complete!

Your Newsroom Daily Report Generator is ready to use. Here's what's been built:

What It Does

Reads today's URLs from your Google Sheet (column D for dates)
Classifies URLs (social media vs regular websites)
Scrapes content:
- Firecrawl for news articles/blogs
- Apify for Twitter/X, Instagram, TikTok, LinkedIn
Summarizes with Claude API (title + 2-3 bullets per article)
Generates beautiful newsletter-style PDF with Montserrat font

Categories Supported

HARD NEWS
POP CULTURE
PRODUCT SPOTTING
INTERNET CULTURE
INDUSTRY NEWS
SOCIAL UPDATES
INSPIRATION

Next Steps to Get Running

1. Install System Dependencies

macOS (you're on this):

brew install cairo pango gdk-pixbuf libffi

2. Set Up Google Sheets Access

You need to create a Google Service Account:

Go to Google Cloud Console
Create/select project
Enable "Google Sheets API"
Create Service Account
Download JSON key and save as service_account.json in this directory
Important: Share your Google Sheet with the service account email

Detailed instructions in README.md

3. Add Your Anthropic API Key

Edit the .env file:

nano .env

Replace your-anthropic-api-key-here with your actual API key.

4. Install Python Dependencies

# Activate virtual environment (already created)
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

5. Test Run

# Make sure venv is activated
source venv/bin/activate

# Run the generator
python newsroom_report.py

What You Have

Files Created

newsroom_report.py - Main script (run this daily)
config.py - Configuration management
scraper.py - Firecrawl integration
apify_scrapers.py - Social media scraping
content_processor.py - URL classification
summarizer.py - Claude AI summarization
pdf_generator.py - PDF generation
templates/ - HTML/CSS for newsletter design
static/montserrat/ - Montserrat font files
requirements.txt - Python dependencies
.env - Configuration (edit this with your Anthropic key)
README.md - Complete documentation

Git Repository

Already initialized and ready to push:

# Push to Bitbucket when ready (using SSH key djp1971)
git push -u origin master

Costs

Monthly estimates for 20 URLs/day (600/month):

Firecrawl: ~$20-30
Apify: ~$15-50
Claude API: ~$10-20
Google Sheets: Free
Total: ~$45-100/month

Support Resources

README.md - Full documentation
Troubleshooting section - Common issues and solutions
Test commands - Verify each component works

Daily Usage

Once set up, just run:

cd /Users/daveporter/Desktop/CODING-2024/newsroom-reporter
source venv/bin/activate
python newsroom_report.py

PDF will be saved to: reports/Newsroom_Report_YYYY-MM-DD.pdf

Key Configuration

Already set in .env.example and copied to .env:

Google Sheet ID: 1vGSZIST0ruKdYRGSgNz1W8AueQGFHHbZ7D6zXFVNKeA
Firecrawl API Key: fc-3dfbb10dca12469998ad9e0db490d622
Apify API Key: apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF
You need to add: Anthropic API Key

Notes

Date format in sheet: "Tuesday, January 6" (no year)
Dates should be in Column D
Script automatically finds today's date
Social media scraping via Apify (reliable for all platforms)
Falls back gracefully if any URL fails to scrape

Architecture Highlights

Smart URL Classification

Automatically detects and routes URLs to the appropriate scraper (social vs regular).

Parallel Processing

Scrapes multiple URLs efficiently using batch operations where possible.

Error Handling

Graceful fallbacks for failed scrapes - continues processing other URLs.

Professional PDF with:

Gradient header
Category sections with colored borders
Article cards with bullet points
Clean typography with Montserrat font
Source links for each article

Enjoy your automated newsroom reports!

4 KiB Raw Permalink Blame History