- Added session timeout settings - Explicitly blocks test/debug/simple-index.php files - Clearer comments on what's protected - Matches simpler approach from root .htaccess - Maintains security without breaking functionality |
||
|---|---|---|
| reports | ||
| static/montserrat | ||
| templates | ||
| web | ||
| .env.example | ||
| .gitignore | ||
| .htaccess | ||
| apify_scrapers.py | ||
| config.py | ||
| content_processor.py | ||
| COST_BREAKDOWN.md | ||
| extract_hyperlinks.py | ||
| newsroom_report.py | ||
| pdf_generator.py | ||
| README.md | ||
| requirements.txt | ||
| scraper.py | ||
| SETUP_GUIDE.md | ||
| summarizer.py | ||
| test_pdf_images.py | ||
Newsroom Daily Report Generator
Automatically generates beautiful PDF reports from your daily newsroom Google Sheet. Scrapes content from both regular websites and social media, summarizes with AI, and produces a professional newsletter-style PDF.
Features
- Google Sheets Integration - Automatically reads URLs from your newsroom spreadsheet
- Dual-Track Scraping
- Firecrawl for news articles and regular websites
- Apify for social media (Twitter/X, Instagram, TikTok, LinkedIn)
- AI Summarization - Claude API generates concise bullet-point summaries
- Beautiful PDF Output - Newsletter-style reports with Montserrat font
- Category Organization - Maintains your 7 category structure
- Simple Daily Usage - Just run one command each morning
Prerequisites
System Dependencies
Python Version: Requires Python 3.10, 3.11, 3.12, or 3.13
- Do NOT use Python 3.14 - it has compatibility issues with dependencies
macOS (Homebrew):
# Verify Python version (should be 3.10-3.13, NOT 3.14)
python3.12 --version # Use python3.12 specifically
brew install cairo pango gdk-pixbuf libffi
Linux (Ubuntu/Debian):
sudo apt-get install -y python3 python3-pip libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev
API Keys
You'll need:
- Google Service Account (for Sheets access) - Setup Guide
- Firecrawl API Key - Already provided:
fc-3dfbb10dca12469998ad9e0db490d622 - Apify API Key - Already provided:
apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF - Anthropic API Key - Your Claude API key
Installation
1. Clone or Navigate to Repository
cd newsroom-reporter
2. Create Virtual Environment
Local (macOS with Homebrew):
# IMPORTANT: Use Python 3.12 (NOT 3.14)
python3.12 --version
# Create virtual environment with Python 3.12
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
Server (Linux):
# Ensure Python 3.10+ is installed
python3 --version
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
3. Install Python Dependencies
pip install -r requirements.txt
4. Google Sheets Setup
- Go to Google Cloud Console
- Create a new project (or use existing)
- Enable Google Sheets API:
- Navigate to "APIs & Services" → "Enable APIs and Services"
- Search for "Google Sheets API" and enable it
- Create a Service Account:
- Go to "APIs & Services" → "Credentials"
- Click "Create Credentials" → "Service Account"
- Give it a name (e.g., "newsroom-reporter")
- Skip optional fields and click "Done"
- Download the JSON key:
- Click on the created service account
- Go to "Keys" tab
- Click "Add Key" → "Create new key" → "JSON"
- Save the downloaded file as
service_account.jsonin the project root
- Share your Google Sheet:
- Open your newsroom Google Sheet
- Click "Share"
- Copy the email from
service_account.json(looks like:your-service-account@project-id.iam.gserviceaccount.com) - Paste it in the share dialog and give "Viewer" access
5. Configuration
Copy the example environment file:
cp .env.example .env
Edit .env and add your Anthropic API key:
nano .env # or use your preferred editor
The file should look like:
# Google Sheets Configuration
GOOGLE_SHEET_ID=1vGSZIST0ruKdYRGSgNz1W8AueQGFHHbZ7D6zXFVNKeA
GOOGLE_SHEET_TAB=2025 newsroom
# API Keys
FIRECRAWL_API_KEY=fc-3dfbb10dca12469998ad9e0db490d622
APIFY_API_KEY=apify_api_61KN8cz07owBqcFAcfcSdPWMwAJEAm3julCF
ANTHROPIC_API_KEY=your-actual-anthropic-api-key-here
# Google Service Account
GOOGLE_SERVICE_ACCOUNT_FILE=service_account.json
Usage
Daily Report Generation
Make sure your virtual environment is activated:
source venv/bin/activate
Run the report generator:
python newsroom_report.py
The script will:
- Find today's date in your Google Sheet
- Extract all URLs from the 7 categories
- Scrape content from each URL
- Generate AI summaries
- Create a PDF report in the
reports/folder
Output:
reports/Newsroom_Report_YYYY-MM-DD.pdf
Testing
To test individual components:
Test PDF generation:
python pdf_generator.py
Test configuration:
python -c "from config import Config; print('Configuration valid!')"
Project Structure
newsroom-reporter/
├── venv/ # Virtual environment
├── newsroom_report.py # Main script - run this daily
├── config.py # Configuration loader
├── scraper.py # Firecrawl integration
├── apify_scrapers.py # Apify social media scrapers
├── content_processor.py # URL detection & routing
├── summarizer.py # Claude API summarization
├── pdf_generator.py # PDF creation
├── requirements.txt # Python dependencies
├── .env # Your API keys (not in git)
├── .gitignore # Git ignore rules
├── README.md # This file
├── service_account.json # Google credentials (not in git)
├── templates/
│ ├── newsletter_template.html
│ └── newsletter_styles.css
├── static/
│ └── montserrat/ # Montserrat font files
└── reports/ # Generated PDFs
└── Newsroom_Report_YYYY-MM-DD.pdf
Troubleshooting
Virtual Environment Issues
Problem: source venv/bin/activate doesn't work
Solution:
# Make sure you're in the project directory
cd /path/to/newsroom-reporter
# If venv doesn't exist, create it
python3 -m venv venv
# Activate (bash/zsh)
source venv/bin/activate
# Activate (fish shell)
source venv/bin/activate.fish
Google Sheets Connection Failed
Problem: "Error connecting to Google Sheet"
Solutions:
- Verify
service_account.jsonis in the project root - Check that you shared the sheet with the service account email
- Verify the sheet ID in
.envis correct - Ensure Google Sheets API is enabled in Google Cloud Console
Date Not Found
Problem: "Could not find today's date in the sheet"
Solutions:
- Check that today's date exists in Column D
- Verify date format matches: "Tuesday, January 6"
- Make sure you're looking at the correct tab: "2025 newsroom"
PDF Generation Failed
Problem: "Error generating PDF"
Solutions:
-
macOS: Install system dependencies:
brew install cairo pango gdk-pixbuf libffi -
Linux: Install system dependencies:
sudo apt-get install -y libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev -
Ensure Montserrat fonts are in
static/montserrat/
API Rate Limits
If you hit rate limits:
- Firecrawl: Space out requests or upgrade plan
- Apify: Check your monthly credit usage
- Anthropic: Monitor token usage in console
Production Server Installation
Step 1: Server Setup
Requirements:
- Ubuntu/Debian Linux server
- Apache 2.4+ with PHP 8.0+
- Python 3.10-3.13
- SSL certificate (for HTTPS)
- SSH access
Step 2: Upload Files
# From your local machine, upload to server
rsync -avz --exclude='venv' --exclude='.git' --exclude='reports' --exclude='temp_screenshots' \
/Users/daveporter/Desktop/CODING-2024/newsroom-reporter/ \
user@yourserver.com:/var/www/newsroom-reporter/
Step 3: Install System Dependencies
SSH into your server:
ssh user@yourserver.com
cd /var/www/newsroom-reporter
Install required packages:
# Python and dependencies
sudo apt-get update
sudo apt-get install -y python3 python3-pip python3-venv
# System dependencies for WeasyPrint PDF generation
sudo apt-get install -y libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev
# Apache and PHP
sudo apt-get install -y apache2 php libapache2-mod-php php-curl php-json
Step 4: Python Virtual Environment
# Create virtual environment
python3 -m venv venv
# Activate
source venv/bin/activate
# Install Python dependencies
pip install -r requirements.txt
Step 5: Configuration
A. Python Configuration:
# Copy and edit main .env
cp .env.example .env
nano .env
Add your API keys (same as local setup).
B. Upload Service Account:
From local machine:
scp service_account.json user@yourserver.com:/var/www/newsroom-reporter/
C. Web Interface Configuration:
cd web/
cp .env.example .env
nano .env
Add SSO settings:
SSO_ENABLED=true
SSO_TENANT_ID=your-tenant-id
SSO_CLIENT_ID=your-client-id
Step 6: Set File Permissions
# Make reports directory writable by web server
sudo mkdir -p reports temp_screenshots
sudo chown -R www-data:www-data reports/ temp_screenshots/
sudo chmod -R 775 reports/ temp_screenshots/
# Make Python script executable
chmod +x newsroom_report.py
# Protect sensitive files
chmod 600 .env web/.env service_account.json
sudo chown www-data:www-data .env web/.env service_account.json
Step 7: Apache Virtual Host
Create /etc/apache2/sites-available/newsroom-reporter.conf:
<VirtualHost *:80>
ServerName newsroom.yourdomain.com
ServerAdmin admin@yourdomain.com
DocumentRoot /var/www/newsroom-reporter/web
<Directory /var/www/newsroom-reporter/web>
Options -Indexes +FollowSymLinks
AllowOverride All
Require all granted
# PHP settings for long-running scripts
php_value max_execution_time 600
php_value memory_limit 512M
</Directory>
# Deny access to parent directory
<Directory /var/www/newsroom-reporter>
Require all denied
</Directory>
# Allow access to reports for downloads
<Directory /var/www/newsroom-reporter/reports>
Options -Indexes
Require all granted
</Directory>
ErrorLog ${APACHE_LOG_DIR}/newsroom-error.log
CustomLog ${APACHE_LOG_DIR}/newsroom-access.log combined
# Redirect to HTTPS
RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
</VirtualHost>
<VirtualHost *:443>
ServerName newsroom.yourdomain.com
DocumentRoot /var/www/newsroom-reporter/web
<Directory /var/www/newsroom-reporter/web>
Options -Indexes +FollowSymLinks
AllowOverride All
Require all granted
</Directory>
# SSL Configuration
SSLEngine on
SSLCertificateFile /etc/ssl/certs/your-cert.crt
SSLCertificateKeyFile /etc/ssl/private/your-key.key
SSLCertificateChainFile /etc/ssl/certs/your-chain.crt
# Security Headers
Header always set Strict-Transport-Security "max-age=31536000"
Header always set X-Frame-Options "SAMEORIGIN"
Header always set X-Content-Type-Options "nosniff"
ErrorLog ${APACHE_LOG_DIR}/newsroom-ssl-error.log
CustomLog ${APACHE_LOG_DIR}/newsroom-ssl-access.log combined
</VirtualHost>
Step 8: Enable Apache Modules and Site
# Enable required modules
sudo a2enmod rewrite
sudo a2enmod headers
sudo a2enmod ssl
# Enable the site
sudo a2ensite newsroom-reporter.conf
# Test configuration
sudo apache2ctl configtest
# Restart Apache
sudo systemctl restart apache2
Step 9: Update Azure AD
Add production redirect URI to your Azure AD app:
https://newsroom.yourdomain.com/index-simple.php
Step 10: Test Production
- Visit: https://newsroom.yourdomain.com
- Sign in with Microsoft SSO
- Generate a test report
- Verify PDF downloads correctly
Automated Daily Reports (Optional)
Set up cron job for automatic generation:
# Edit crontab for www-data user
sudo crontab -u www-data -e
# Add: Run daily at 9 AM
0 9 * * * cd /var/www/newsroom-reporter && /var/www/newsroom-reporter/venv/bin/python newsroom_report.py >> /var/log/newsroom-reports.log 2>&1
Web Interface Usage
Local Development (MAMP)
# Create symlink in MAMP htdocs
cd /Applications/MAMP/htdocs
ln -s /Users/daveporter/Desktop/CODING-2024/newsroom-reporter/web newsroom-reporter
# Configure SSO (or disable for testing)
cd /Users/daveporter/Desktop/CODING-2024/newsroom-reporter/web
nano .env # Set SSO_ENABLED=false for local testing
# Access at: http://localhost:8888/newsroom-reporter/index-simple.php
Production (Apache)
Access at: https://newsroom.yourdomain.com
Features:
- SSO authentication (same as NANO-RESEARCH)
- Date selection with auto-populated current date
- Real-time processing feedback
- Secure PDF downloads
- Black theme with #FFC407 yellow accents
Command Line Options
Generate report for specific date:
python newsroom_report.py "Tuesday, December 23"
Generate report for today:
python newsroom_report.py
Features
- ✅ Screenshot Capture - Automatic screenshots of regular websites (500px wide)
- ✅ Smart Date Handling - Generate reports for any date in the sheet
- ✅ Dual Scraping - Firecrawl for websites, Apify for social media
- ✅ AI Summaries - Claude Sonnet 4.5 generates concise bullet points
- ✅ Newsletter PDF - Black/yellow theme, Montserrat font
- ✅ Web Interface - Beautiful UI with SSO authentication
Security Checklist for Production
- HTTPS enabled (SSL certificate)
.envfiles have 600 permissionsservice_account.jsonhas 600 permissions- Reports directory writable by www-data
- Azure AD redirect URI updated
- Firewall configured (ports 80, 443 open)
- Error logging enabled (not display_errors)
Monitoring
# Check Apache logs
tail -f /var/log/apache2/newsroom-error.log
# Check automated report logs (if cron enabled)
tail -f /var/log/newsroom-reports.log
# Check disk space (reports accumulate)
du -sh reports/
Cost Estimates
Monthly costs for ~20 URLs per day (600/month):
- Firecrawl: ~$20-30
- Apify: ~$15-50
- Anthropic Claude API: ~$10-20
- Google Sheets API: Free
Total: ~$45-100/month
Much cheaper than official social media APIs (which would cost $200-5000/month)!
Support
For issues or questions:
- Check the Troubleshooting section
- Review error messages carefully
- Verify all API keys are correct
- Ensure virtual environment is activated
License
Proprietary - Internal Use Only
Git Repository
This project can be backed up to:
bitbucket.org:zlalani/volt-newsroom-scraper-report.git
Using SSH key: djp1971