No description
Find a file
DJP 8aebf36a59 Update web/.htaccess with cleaner rules
- Added session timeout settings
- Explicitly blocks test/debug/simple-index.php files
- Clearer comments on what's protected
- Matches simpler approach from root .htaccess
- Maintains security without breaking functionality
2026-01-07 14:40:18 -05:00
reports Add comprehensive .htaccess security files 2026-01-07 14:28:45 -05:00
static/montserrat Initial commit: Newsroom Daily Report Generator 2026-01-06 12:26:53 -05:00
templates Add 'So What' implications section for Molson Coors 2026-01-07 09:13:35 -05:00
web Update web/.htaccess with cleaner rules 2026-01-07 14:40:18 -05:00
.env.example Complete working implementation with all fixes 2026-01-06 13:11:44 -05:00
.gitignore Add screenshot functionality and fix date parameter bug 2026-01-06 14:26:08 -05:00
.htaccess Fix root .htaccess - allow web/ subdirectory access 2026-01-07 14:39:32 -05:00
apify_scrapers.py Disable screenshots for social media 2026-01-06 15:04:01 -05:00
config.py Disable screenshots for social media 2026-01-06 15:04:01 -05:00
content_processor.py Initial commit: Newsroom Daily Report Generator 2026-01-06 12:26:53 -05:00
COST_BREAKDOWN.md Add comprehensive cost breakdown document and fix fonts 2026-01-06 15:38:34 -05:00
extract_hyperlinks.py Fix column mapping and add MAMP/Apache deployment guides 2026-01-06 13:38:01 -05:00
newsroom_report.py Fix URL extraction to use correct date row 2026-01-06 15:13:19 -05:00
pdf_generator.py Add 'So What' implications section for Molson Coors 2026-01-07 09:13:35 -05:00
README.md Fix screenshot paths for WeasyPrint PDF embedding 2026-01-06 15:22:37 -05:00
requirements.txt Complete working implementation with all fixes 2026-01-06 13:11:44 -05:00
scraper.py Add wait time for Firecrawl screenshots 2026-01-06 14:55:46 -05:00
SETUP_GUIDE.md Complete working implementation with all fixes 2026-01-06 13:11:44 -05:00
summarizer.py Add 'So What' implications section for Molson Coors 2026-01-07 09:13:35 -05:00
test_pdf_images.py FIX: Preserve screenshot_path through summarization 2026-01-06 15:25:31 -05:00

Newsroom Daily Report Generator

Automatically generates beautiful PDF reports from your daily newsroom Google Sheet. Scrapes content from both regular websites and social media, summarizes with AI, and produces a professional newsletter-style PDF.

Features

  • Google Sheets Integration - Automatically reads URLs from your newsroom spreadsheet
  • Dual-Track Scraping
    • Firecrawl for news articles and regular websites
    • Apify for social media (Twitter/X, Instagram, TikTok, LinkedIn)
  • AI Summarization - Claude API generates concise bullet-point summaries
  • Beautiful PDF Output - Newsletter-style reports with Montserrat font
  • Category Organization - Maintains your 7 category structure
  • Simple Daily Usage - Just run one command each morning

Prerequisites

System Dependencies

Python Version: Requires Python 3.10, 3.11, 3.12, or 3.13

  • Do NOT use Python 3.14 - it has compatibility issues with dependencies

macOS (Homebrew):

# Verify Python version (should be 3.10-3.13, NOT 3.14)
python3.12 --version  # Use python3.12 specifically

brew install cairo pango gdk-pixbuf libffi

Linux (Ubuntu/Debian):

sudo apt-get install -y python3 python3-pip libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev

API Keys

You'll need:

  1. Google Service Account (for Sheets access) - Setup Guide
  2. Firecrawl API Key - Already provided: fc-3dfbb10dca12469998ad9e0db490d622
  3. Apify API Key - Already provided: apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF
  4. Anthropic API Key - Your Claude API key

Installation

1. Clone or Navigate to Repository

cd newsroom-reporter

2. Create Virtual Environment

Local (macOS with Homebrew):

# IMPORTANT: Use Python 3.12 (NOT 3.14)
python3.12 --version

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

Server (Linux):

# Ensure Python 3.10+ is installed
python3 --version

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

3. Install Python Dependencies

pip install -r requirements.txt

4. Google Sheets Setup

  1. Go to Google Cloud Console
  2. Create a new project (or use existing)
  3. Enable Google Sheets API:
    • Navigate to "APIs & Services" → "Enable APIs and Services"
    • Search for "Google Sheets API" and enable it
  4. Create a Service Account:
    • Go to "APIs & Services" → "Credentials"
    • Click "Create Credentials" → "Service Account"
    • Give it a name (e.g., "newsroom-reporter")
    • Skip optional fields and click "Done"
  5. Download the JSON key:
    • Click on the created service account
    • Go to "Keys" tab
    • Click "Add Key" → "Create new key" → "JSON"
    • Save the downloaded file as service_account.json in the project root
  6. Share your Google Sheet:
    • Open your newsroom Google Sheet
    • Click "Share"
    • Copy the email from service_account.json (looks like: your-service-account@project-id.iam.gserviceaccount.com)
    • Paste it in the share dialog and give "Viewer" access

5. Configuration

Copy the example environment file:

cp .env.example .env

Edit .env and add your Anthropic API key:

nano .env  # or use your preferred editor

The file should look like:

# Google Sheets Configuration
GOOGLE_SHEET_ID=1vGSZIST0ruKdYRGSgNz1W8AueQGFHHbZ7D6zXFVNKeA
GOOGLE_SHEET_TAB=2025 newsroom

# API Keys
FIRECRAWL_API_KEY=fc-3dfbb10dca12469998ad9e0db490d622
APIFY_API_KEY=apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF
ANTHROPIC_API_KEY=your-actual-anthropic-api-key-here

# Google Service Account
GOOGLE_SERVICE_ACCOUNT_FILE=service_account.json

Usage

Daily Report Generation

Make sure your virtual environment is activated:

source venv/bin/activate

Run the report generator:

python newsroom_report.py

The script will:

  1. Find today's date in your Google Sheet
  2. Extract all URLs from the 7 categories
  3. Scrape content from each URL
  4. Generate AI summaries
  5. Create a PDF report in the reports/ folder

Output:

reports/Newsroom_Report_YYYY-MM-DD.pdf

Testing

To test individual components:

Test PDF generation:

python pdf_generator.py

Test configuration:

python -c "from config import Config; print('Configuration valid!')"

Project Structure

newsroom-reporter/
├── venv/                        # Virtual environment
├── newsroom_report.py           # Main script - run this daily
├── config.py                    # Configuration loader
├── scraper.py                   # Firecrawl integration
├── apify_scrapers.py           # Apify social media scrapers
├── content_processor.py         # URL detection & routing
├── summarizer.py                # Claude API summarization
├── pdf_generator.py             # PDF creation
├── requirements.txt             # Python dependencies
├── .env                        # Your API keys (not in git)
├── .gitignore                  # Git ignore rules
├── README.md                   # This file
├── service_account.json        # Google credentials (not in git)
├── templates/
│   ├── newsletter_template.html
│   └── newsletter_styles.css
├── static/
│   └── montserrat/             # Montserrat font files
└── reports/                    # Generated PDFs
    └── Newsroom_Report_YYYY-MM-DD.pdf

Troubleshooting

Virtual Environment Issues

Problem: source venv/bin/activate doesn't work

Solution:

# Make sure you're in the project directory
cd /path/to/newsroom-reporter

# If venv doesn't exist, create it
python3 -m venv venv

# Activate (bash/zsh)
source venv/bin/activate

# Activate (fish shell)
source venv/bin/activate.fish

Google Sheets Connection Failed

Problem: "Error connecting to Google Sheet"

Solutions:

  1. Verify service_account.json is in the project root
  2. Check that you shared the sheet with the service account email
  3. Verify the sheet ID in .env is correct
  4. Ensure Google Sheets API is enabled in Google Cloud Console

Date Not Found

Problem: "Could not find today's date in the sheet"

Solutions:

  1. Check that today's date exists in Column D
  2. Verify date format matches: "Tuesday, January 6"
  3. Make sure you're looking at the correct tab: "2025 newsroom"

PDF Generation Failed

Problem: "Error generating PDF"

Solutions:

  1. macOS: Install system dependencies:

    brew install cairo pango gdk-pixbuf libffi
    
  2. Linux: Install system dependencies:

    sudo apt-get install -y libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev
    
  3. Ensure Montserrat fonts are in static/montserrat/

API Rate Limits

If you hit rate limits:

  • Firecrawl: Space out requests or upgrade plan
  • Apify: Check your monthly credit usage
  • Anthropic: Monitor token usage in console

Production Server Installation

Step 1: Server Setup

Requirements:

  • Ubuntu/Debian Linux server
  • Apache 2.4+ with PHP 8.0+
  • Python 3.10-3.13
  • SSL certificate (for HTTPS)
  • SSH access

Step 2: Upload Files

# From your local machine, upload to server
rsync -avz --exclude='venv' --exclude='.git' --exclude='reports' --exclude='temp_screenshots' \
  /Users/daveporter/Desktop/CODING-2024/newsroom-reporter/ \
  user@yourserver.com:/var/www/newsroom-reporter/

Step 3: Install System Dependencies

SSH into your server:

ssh user@yourserver.com
cd /var/www/newsroom-reporter

Install required packages:

# Python and dependencies
sudo apt-get update
sudo apt-get install -y python3 python3-pip python3-venv

# System dependencies for WeasyPrint PDF generation
sudo apt-get install -y libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev

# Apache and PHP
sudo apt-get install -y apache2 php libapache2-mod-php php-curl php-json

Step 4: Python Virtual Environment

# Create virtual environment
python3 -m venv venv

# Activate
source venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

Step 5: Configuration

A. Python Configuration:

# Copy and edit main .env
cp .env.example .env
nano .env

Add your API keys (same as local setup).

B. Upload Service Account:

From local machine:

scp service_account.json user@yourserver.com:/var/www/newsroom-reporter/

C. Web Interface Configuration:

cd web/
cp .env.example .env
nano .env

Add SSO settings:

SSO_ENABLED=true
SSO_TENANT_ID=your-tenant-id
SSO_CLIENT_ID=your-client-id

Step 6: Set File Permissions

# Make reports directory writable by web server
sudo mkdir -p reports temp_screenshots
sudo chown -R www-data:www-data reports/ temp_screenshots/
sudo chmod -R 775 reports/ temp_screenshots/

# Make Python script executable
chmod +x newsroom_report.py

# Protect sensitive files
chmod 600 .env web/.env service_account.json
sudo chown www-data:www-data .env web/.env service_account.json

Step 7: Apache Virtual Host

Create /etc/apache2/sites-available/newsroom-reporter.conf:

<VirtualHost *:80>
    ServerName newsroom.yourdomain.com
    ServerAdmin admin@yourdomain.com
    DocumentRoot /var/www/newsroom-reporter/web

    <Directory /var/www/newsroom-reporter/web>
        Options -Indexes +FollowSymLinks
        AllowOverride All
        Require all granted

        # PHP settings for long-running scripts
        php_value max_execution_time 600
        php_value memory_limit 512M
    </Directory>

    # Deny access to parent directory
    <Directory /var/www/newsroom-reporter>
        Require all denied
    </Directory>

    # Allow access to reports for downloads
    <Directory /var/www/newsroom-reporter/reports>
        Options -Indexes
        Require all granted
    </Directory>

    ErrorLog ${APACHE_LOG_DIR}/newsroom-error.log
    CustomLog ${APACHE_LOG_DIR}/newsroom-access.log combined

    # Redirect to HTTPS
    RewriteEngine On
    RewriteCond %{HTTPS} off
    RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
</VirtualHost>

<VirtualHost *:443>
    ServerName newsroom.yourdomain.com
    DocumentRoot /var/www/newsroom-reporter/web

    <Directory /var/www/newsroom-reporter/web>
        Options -Indexes +FollowSymLinks
        AllowOverride All
        Require all granted
    </Directory>

    # SSL Configuration
    SSLEngine on
    SSLCertificateFile /etc/ssl/certs/your-cert.crt
    SSLCertificateKeyFile /etc/ssl/private/your-key.key
    SSLCertificateChainFile /etc/ssl/certs/your-chain.crt

    # Security Headers
    Header always set Strict-Transport-Security "max-age=31536000"
    Header always set X-Frame-Options "SAMEORIGIN"
    Header always set X-Content-Type-Options "nosniff"

    ErrorLog ${APACHE_LOG_DIR}/newsroom-ssl-error.log
    CustomLog ${APACHE_LOG_DIR}/newsroom-ssl-access.log combined
</VirtualHost>

Step 8: Enable Apache Modules and Site

# Enable required modules
sudo a2enmod rewrite
sudo a2enmod headers
sudo a2enmod ssl

# Enable the site
sudo a2ensite newsroom-reporter.conf

# Test configuration
sudo apache2ctl configtest

# Restart Apache
sudo systemctl restart apache2

Step 9: Update Azure AD

Add production redirect URI to your Azure AD app:

https://newsroom.yourdomain.com/index-simple.php

Step 10: Test Production

  1. Visit: https://newsroom.yourdomain.com
  2. Sign in with Microsoft SSO
  3. Generate a test report
  4. Verify PDF downloads correctly

Automated Daily Reports (Optional)

Set up cron job for automatic generation:

# Edit crontab for www-data user
sudo crontab -u www-data -e

# Add: Run daily at 9 AM
0 9 * * * cd /var/www/newsroom-reporter && /var/www/newsroom-reporter/venv/bin/python newsroom_report.py >> /var/log/newsroom-reports.log 2>&1

Web Interface Usage

Local Development (MAMP)

# Create symlink in MAMP htdocs
cd /Applications/MAMP/htdocs
ln -s /Users/daveporter/Desktop/CODING-2024/newsroom-reporter/web newsroom-reporter

# Configure SSO (or disable for testing)
cd /Users/daveporter/Desktop/CODING-2024/newsroom-reporter/web
nano .env  # Set SSO_ENABLED=false for local testing

# Access at: http://localhost:8888/newsroom-reporter/index-simple.php

Production (Apache)

Access at: https://newsroom.yourdomain.com

Features:

  • SSO authentication (same as NANO-RESEARCH)
  • Date selection with auto-populated current date
  • Real-time processing feedback
  • Secure PDF downloads
  • Black theme with #FFC407 yellow accents

Command Line Options

Generate report for specific date:

python newsroom_report.py "Tuesday, December 23"

Generate report for today:

python newsroom_report.py

Features

  • Screenshot Capture - Automatic screenshots of regular websites (500px wide)
  • Smart Date Handling - Generate reports for any date in the sheet
  • Dual Scraping - Firecrawl for websites, Apify for social media
  • AI Summaries - Claude Sonnet 4.5 generates concise bullet points
  • Newsletter PDF - Black/yellow theme, Montserrat font
  • Web Interface - Beautiful UI with SSO authentication

Security Checklist for Production

  • HTTPS enabled (SSL certificate)
  • .env files have 600 permissions
  • service_account.json has 600 permissions
  • Reports directory writable by www-data
  • Azure AD redirect URI updated
  • Firewall configured (ports 80, 443 open)
  • Error logging enabled (not display_errors)

Monitoring

# Check Apache logs
tail -f /var/log/apache2/newsroom-error.log

# Check automated report logs (if cron enabled)
tail -f /var/log/newsroom-reports.log

# Check disk space (reports accumulate)
du -sh reports/

Cost Estimates

Monthly costs for ~20 URLs per day (600/month):

  • Firecrawl: ~$20-30
  • Apify: ~$15-50
  • Anthropic Claude API: ~$10-20
  • Google Sheets API: Free

Total: ~$45-100/month

Much cheaper than official social media APIs (which would cost $200-5000/month)!

Support

For issues or questions:

  1. Check the Troubleshooting section
  2. Review error messages carefully
  3. Verify all API keys are correct
  4. Ensure virtual environment is activated

License

Proprietary - Internal Use Only

Git Repository

This project can be backed up to:

bitbucket.org:zlalani/volt-newsroom-scraper-report.git

Using SSH key: djp1971