Skip to content

Implement comprehensive link checker with image focus and CI/CD integration #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .eleventyignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
./README.md
./_image_sources
./_drafts
./.github
15 changes: 15 additions & 0 deletions .github/linkchecker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM ubuntu:22.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the copy of this linkchecker folder from the _drafts folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Removed the outdated linkchecker copy from _drafts/linkchecker/ since the comprehensive implementation is now in .github/linkchecker/. Commit c0a7eed

RUN apt-get -y update && \
apt-get install -y ca-certificates linkchecker python3-pip curl --no-install-recommends \
&& apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN pip3 install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org jinja2

WORKDIR /linkchecker
COPY filter_csv.py output_template.html linkchecker.conf run_linkcheck.sh ./

# Make script executable
RUN chmod +x run_linkcheck.sh

# Default command to run linkchecker
CMD ["linkchecker", "--config=linkchecker.conf"]
138 changes: 138 additions & 0 deletions .github/linkchecker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# OrionRobots Link Checker

This directory contains the link checking functionality for the OrionRobots website, designed to detect broken links with a focus on image links and internal broken links.

## 🎯 Features

- **Image-focused checking**: Prioritizes broken image links that affect visual content
- **Categorized results**: Separates internal, external, image, and email links
- **HTML reports**: Generates detailed, styled reports with priority indicators
- **Docker integration**: Runs in isolated containers for consistency
- **CI/CD integration**: Automated nightly checks and PR-based checks

## πŸš€ Usage

### Local Usage

Run the link checker locally using the provided script:

```bash
./.github/scripts/local_linkcheck.sh
```

This will:
1. Build the site
2. Start a local HTTP server
3. Run the link checker
4. Generate a report in `./linkchecker_reports/`
5. Clean up containers

### Manual Docker Compose

You can also run individual services manually:

```bash
# Build and serve the site
docker compose --profile manual up -d http_serve

# Run link checker
docker compose --profile manual up broken_links

# View logs
docker compose logs broken_links

# Cleanup
docker compose down
```

### GitHub Actions Integration

#### Nightly Checks
- Runs every night at 2 AM UTC
- Checks the production site (https://orionrobots.co.uk)
- Creates warnings for broken links
- Uploads detailed reports as artifacts

#### PR-based Checks
- Triggered when a PR is labeled with `link-check`
- Deploys a staging version of the PR
- Runs link checker on the staging deployment
- Comments results on the PR
- Automatically cleans up staging deployment

To run link checking on a PR:
1. Add the `link-check` label to the PR
2. The workflow will automatically deploy staging and run checks
3. Results will be commented on the PR

## πŸ“ Files

- `Dockerfile`: Container definition for the link checker
- `linkchecker.conf`: Configuration for linkchecker tool
- `filter_csv.py`: Python script to process and categorize results
- `output_template.html`: HTML template for generating reports
- `run_linkcheck.sh`: Main script that orchestrates the checking process

## πŸ“Š Report Categories

The generated reports categorize broken links by priority:

1. **πŸ–ΌοΈ Images** (High Priority): Broken image links that affect visual content
2. **🏠 Internal Links** (High Priority): Broken internal links under our control
3. **🌐 External Links** (Medium Priority): Broken external links (may be temporary)
4. **πŸ“§ Email Links** (Low Priority): Broken email links (complex to validate)

## βš™οΈ Configuration

The link checker configuration in `linkchecker.conf` includes:

- **Recursion**: Checks up to 10 levels deep
- **Output**: CSV format for easy processing
- **Filtering**: Ignores common social media sites that block crawlers
- **Anchor checking**: Validates internal page anchors
- **Warning handling**: Configurable warning levels

## πŸ”§ Customization

To modify the link checking behavior:

1. **Change checking depth**: Edit `recursionlevel` in `linkchecker.conf`
2. **Add ignored URLs**: Add patterns to the `ignore` section in `linkchecker.conf`
3. **Modify report styling**: Edit `output_template.html`
4. **Change categorization**: Modify `filter_csv.py`

## 🐳 Docker Integration

The link checker integrates with the existing Docker Compose setup:

- Uses the `http_serve` service as the target
- Depends on health checks to ensure site availability
- Outputs reports to a mounted volume for persistence
- Runs in the `manual` profile to avoid automatic execution

## πŸ“‹ Requirements

- Docker and Docker Compose
- Python 3 with Jinja2 (handled in container)
- linkchecker tool (handled in container)
- curl for health checks (handled in container)

## πŸ” Troubleshooting

### Site not available
If you get "Site not available" errors:
1. Ensure the site builds successfully first
2. Check that the HTTP server is running
3. Verify port 8082 is not in use

### Permission errors
If you get permission errors with volumes:
1. Check Docker permissions
2. Ensure the linkchecker_reports directory exists
3. Try running with sudo (not recommended for production)

### Missing dependencies
If linkchecker fails to run:
1. Check the Dockerfile builds successfully
2. Verify Python dependencies are installed
3. Check linkchecker configuration syntax
80 changes: 80 additions & 0 deletions .github/linkchecker/filter_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# -*- coding: utf-8 -*-
import csv
import sys
import os
from urllib.parse import urlparse

from jinja2 import Environment, FileSystemLoader, select_autoescape


def is_image_url(url):
"""Check if URL points to an image file"""
image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp'}
parsed = urlparse(url)
path = parsed.path.lower()
return any(path.endswith(ext) for ext in image_extensions)


def categorize_link(item):
"""Categorize link by type"""
url = item['url']
if is_image_url(url):
return 'image'
elif url.startswith('mailto:'):
return 'email'
elif url.startswith('http'):
return 'external'
else:
return 'internal'


def output_file(items):
# Get the directory where this script is located
script_dir = os.path.dirname(os.path.abspath(__file__))
env = Environment(
loader=FileSystemLoader(script_dir),
autoescape=select_autoescape(['html', 'xml'])
)
template = env.get_template('output_template.html')

# Categorize items
categorized = {}
for item in items:
category = categorize_link(item)
if category not in categorized:
categorized[category] = []
categorized[category].append(item)

print(template.render(
categorized=categorized,
total_count=len(items),
image_count=len(categorized.get('image', [])),
internal_count=len(categorized.get('internal', [])),
external_count=len(categorized.get('external', [])),
email_count=len(categorized.get('email', []))
))


def main():
filename = sys.argv[1] if len(sys.argv) > 1 else '/linkchecker/output.csv'

if not os.path.exists(filename):
print(f"Error: CSV file {filename} not found")
sys.exit(1)

with open(filename, encoding='utf-8') as csv_file:
data = csv_file.readlines()
reader = csv.DictReader((row for row in data if not row.startswith('#')), delimiter=';')

# Filter out successful links and redirects
non_200 = (item for item in reader if 'OK' not in item['result'])
non_redirect = (item for item in non_200 if '307' not in item['result'] and '301' not in item['result'] and '302' not in item['result'])
non_ssl = (item for item in non_redirect if 'ssl' not in item['result'].lower())

total_list = sorted(list(non_ssl), key=lambda item: (categorize_link(item), item['parentname']))

output_file(total_list)


if __name__ == '__main__':
main()
44 changes: 44 additions & 0 deletions .github/linkchecker/linkchecker.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[checking]
# Check links with limited recursion for faster execution
recursionlevel=2
# Focus on internal links
allowedschemes=http,https,file
# Check for broken images specifically
checkextern=1
# Limit number of URLs to check for faster execution
maxrequestspersecond=10
# Timeout for each request
timeout=10
# Hard time limit - 2 minutes maximum for PR checks
maxrunseconds=120
threads=4

[output]
# Output in CSV format for easier processing
log=csv
filename=/linkchecker_reports/output.csv
# Also output to console
verbose=1
warnings=1

[filtering]
# Ignore certain file types that might cause issues
ignorewarnings=url-whitespace,url-content-size-zero,url-content-too-large
# Skip external social media links that often block crawlers
ignore=
url:facebook\.com
url:twitter\.com
url:instagram\.com
url:linkedin\.com
url:youtube\.com
url:tiktok\.com

[AnchorCheck]
# Check for broken internal anchors
add=1

[authentication]
# No authentication required for most checks

[plugins]
# No additional plugins needed for basic checking
Loading