Skip to content

Migrate run_court_session_parser to Celery Task #1830

@ad-m-ss

Description

@ad-m-ss

Migrate run_court_session_parser to Celery Task

Overview

Migrate the run_court_session_parser management command to a Celery task with robust error handling, retry logic, and proper monitoring. This is the most complex and critical background task in the system, parsing court session data from multiple Polish courts daily.

Current Implementation Analysis

Command Location

  • File: poradnia/judgements/management/commands/run_court_session_parser.py
  • Current Schedule: Daily at 23:10 ("10 23 * * *")
  • Execution: Via cron with file locking (.contrib/docker/cron/run_locked.sh)

Task Complexity

  • Purpose: Parses court session data from various Polish courts (NSA, WSA cities)
  • Operations: Web scraping, database operations, event creation
  • Dependencies: Court parsers, database models (Court, CourtCase, CourtSession, Event), User model
  • Runtime: Variable (depends on court data volume and network conditions)
  • Failure Points: Network issues, parser changes, database constraints, memory usage

Scope

This issue covers ONLY the migration of the management command to a Celery task. It does NOT include:

Implementation Tasks

1. Create Celery Task

  • Create poradnia/judgements/tasks.py
  • Convert management command logic to @shared_task decorated function
  • Maintain existing functionality while adding Celery capabilities
  • Preserve all current parsing logic and database operations

2. Error Handling & Reliability

  • Implement exponential backoff retry logic (max 3 retries)
  • Add specific exception handling for:
    • Network connection errors
    • Parser/scraping failures
    • Database constraint violations
    • Memory/timeout issues
  • Graceful failure handling with admin notifications
  • Task timeout configuration (prevent hanging tasks)

3. Logging & Monitoring

  • Structured logging with task progress tracking
  • Log court-by-court processing status
  • Error categorization for better debugging
  • Performance metrics (processing time, records processed)
  • Integration with Celery result backend for status tracking

4. Memory & Performance Optimization

  • Process courts in batches to prevent memory buildup
  • Implement progress checkpoints for long-running operations
  • Add memory monitoring and cleanup
  • Database connection management for long-running tasks

5. Scheduling Configuration

  • Configure Celery beat periodic task (daily at 23:10)
  • Use database-backed scheduling (django-celery-beat)
  • Allow runtime schedule modifications through Django admin
  • Enable parallel execution with existing cron job during transition

Files to Modify/Create

New Files

  • poradnia/judgements/tasks.py - NEW FILE - Celery task implementation

Modified Files

  • poradnia/settings/base.py - Add court parser task to Celery beat schedule
  • docs/celery.rst – Documentation updates

Files NOT Modified (Kept for Safety)

Configuration Files

  • poradnia/settings/base.py - Celery beat schedule configuration:
CELERY_BEAT_SCHEDULE = {
    'run-court-session-parser': {
        'task': 'poradnia.judgements.tasks.run_court_session_parser',
        'schedule': crontab(hour=23, minute=10),  # Daily at 23:10
    },
}

Dependencies

This issue cannot begin until the Celery infrastructure is fully set up and operational.

Related Issues

Testing Requirements

Unit Tests

  • Test task execution with mock court data
  • Test error handling for various failure scenarios
  • Test retry logic with exponential backoff
  • Test memory usage with large datasets

Integration Tests

  • Test full Celery task execution
  • Test scheduling via Celery beat
  • Test task result storage and retrieval
  • Test task monitoring and status tracking
  • Test parallel execution with existing cron job

Performance Tests

  • Compare execution time with original management command
  • Test memory usage under various court data volumes
  • Test concurrent execution handling

Implementation Example Structure

# poradnia/judgements/tasks.py
from celery import shared_task
from celery.utils.log import get_task_logger
import time

logger = get_task_logger(__name__)

@shared_task(bind=True, autoretry_for=(ConnectionError,), retry_kwargs={'max_retries': 3, 'countdown': 60})
def run_court_session_parser(self):
    """
    Parse court session data from various Polish courts.
    Migrated from management command with enhanced error handling.
    """
    try:
        logger.info("Starting court session parser task")
        # Existing parsing logic here
        # Add progress tracking and memory management
        return {"status": "success", "processed": count}
    except Exception as exc:
        logger.error(f"Court parser task failed: {exc}")
        raise self.retry(exc=exc)

Acceptance Criteria

  • Celery task successfully parses court session data
  • Task runs on the same daily schedule (23:10)
  • All current functionality is preserved
  • Error handling improves reliability over cron system
  • Task execution can be monitored through Celery
  • Memory usage is optimized for long-running operations
  • Retry logic handles transient failures automatically
  • Logging provides detailed execution information
  • Task can be manually triggered for testing/debugging
  • Parallel execution with cron job works safely during transition period

Transition Strategy

Success Metrics

  • Reliability: Task success rate > 95% with automatic retry handling
  • Performance: Execution time comparable to or better than original command
  • Monitoring: Full visibility into task execution status and errors
  • Maintainability: Easier debugging and administration through Celery interface
  • Safety: Successful parallel execution with legacy cron system

References

  • Current management command: poradnia/judgements/management/commands/run_court_session_parser.py
  • Cron configuration: .contrib/docker/cron/set_crontab.sh
  • Court parser implementations in poradnia/judgements/ module
  • Related models: Court, CourtCase, CourtSession, Event

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions