-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Migrate run_court_session_parser to Celery Task
Overview
Migrate the run_court_session_parser management command to a Celery task with robust error handling, retry logic, and proper monitoring. This is the most complex and critical background task in the system, parsing court session data from multiple Polish courts daily.
Current Implementation Analysis
Command Location
- File:
poradnia/judgements/management/commands/run_court_session_parser.py - Current Schedule: Daily at 23:10 (
"10 23 * * *") - Execution: Via cron with file locking (
.contrib/docker/cron/run_locked.sh)
Task Complexity
- Purpose: Parses court session data from various Polish courts (NSA, WSA cities)
- Operations: Web scraping, database operations, event creation
- Dependencies: Court parsers, database models (Court, CourtCase, CourtSession, Event), User model
- Runtime: Variable (depends on court data volume and network conditions)
- Failure Points: Network issues, parser changes, database constraints, memory usage
Scope
This issue covers ONLY the migration of the management command to a Celery task. It does NOT include:
- Removing the original management command (kept for manual execution)
- Removing cron configuration files (handled in Phase 3: Phase 3: Remove Legacy Cron-Based Background Job System #1834)
- Disabling the existing cron job (parallel execution during transition)
Implementation Tasks
1. Create Celery Task
- Create
poradnia/judgements/tasks.py - Convert management command logic to
@shared_taskdecorated function - Maintain existing functionality while adding Celery capabilities
- Preserve all current parsing logic and database operations
2. Error Handling & Reliability
- Implement exponential backoff retry logic (max 3 retries)
- Add specific exception handling for:
- Network connection errors
- Parser/scraping failures
- Database constraint violations
- Memory/timeout issues
- Graceful failure handling with admin notifications
- Task timeout configuration (prevent hanging tasks)
3. Logging & Monitoring
- Structured logging with task progress tracking
- Log court-by-court processing status
- Error categorization for better debugging
- Performance metrics (processing time, records processed)
- Integration with Celery result backend for status tracking
4. Memory & Performance Optimization
- Process courts in batches to prevent memory buildup
- Implement progress checkpoints for long-running operations
- Add memory monitoring and cleanup
- Database connection management for long-running tasks
5. Scheduling Configuration
- Configure Celery beat periodic task (daily at 23:10)
- Use database-backed scheduling (
django-celery-beat) - Allow runtime schedule modifications through Django admin
- Enable parallel execution with existing cron job during transition
Files to Modify/Create
New Files
poradnia/judgements/tasks.py- NEW FILE - Celery task implementation
Modified Files
poradnia/settings/base.py- Add court parser task to Celery beat scheduledocs/celery.rst– Documentation updates
Files NOT Modified (Kept for Safety)
poradnia/judgements/management/commands/run_court_session_parser.py- KEPT UNCHANGED for manual execution.contrib/docker/cron/set_crontab.sh- NO CHANGES (handled in Phase 3: Phase 3: Remove Legacy Cron-Based Background Job System #1834)
Configuration Files
poradnia/settings/base.py- Celery beat schedule configuration:
CELERY_BEAT_SCHEDULE = {
'run-court-session-parser': {
'task': 'poradnia.judgements.tasks.run_court_session_parser',
'schedule': crontab(hour=23, minute=10), # Daily at 23:10
},
}Dependencies
- BLOCKED BY: Phase 1: Celery Infrastructure Setup #1828 - Phase 1: Celery Infrastructure Setup
- REQUIRES: Redis service, Celery worker service, Celery beat service
- PARENT: Phase 2: Background Task Migration to Celery (Umbrella Issue) #1829 - Phase 2: Background Task Migration to Celery (Umbrella Issue)
This issue cannot begin until the Celery infrastructure is fully set up and operational.
Related Issues
- Infrastructure: Phase 1: Celery Infrastructure Setup #1828 - Phase 1: Celery Infrastructure Setup (must be completed first)
- Umbrella: Phase 2: Background Task Migration to Celery (Umbrella Issue) #1829 - Phase 2: Background Task Migration to Celery (tracks overall progress)
- Parallel Tasks:
- Migrate send_event_reminders to Celery Task #1831 - Migrate send_event_reminders to Celery Task
- Migrate send_old_cases_reminder to Celery Task #1832 - Migrate send_old_cases_reminder to Celery Task
- Migrate Django clearsessions to Celery Task #1833 - Migrate Django clearsessions to Celery Task
- Follow-up: Phase 3: Remove Legacy Cron-Based Background Job System #1834 - Phase 3: Remove Legacy Cron-Based Background Job System (handles cron cleanup)
Testing Requirements
Unit Tests
- Test task execution with mock court data
- Test error handling for various failure scenarios
- Test retry logic with exponential backoff
- Test memory usage with large datasets
Integration Tests
- Test full Celery task execution
- Test scheduling via Celery beat
- Test task result storage and retrieval
- Test task monitoring and status tracking
- Test parallel execution with existing cron job
Performance Tests
- Compare execution time with original management command
- Test memory usage under various court data volumes
- Test concurrent execution handling
Implementation Example Structure
# poradnia/judgements/tasks.py
from celery import shared_task
from celery.utils.log import get_task_logger
import time
logger = get_task_logger(__name__)
@shared_task(bind=True, autoretry_for=(ConnectionError,), retry_kwargs={'max_retries': 3, 'countdown': 60})
def run_court_session_parser(self):
"""
Parse court session data from various Polish courts.
Migrated from management command with enhanced error handling.
"""
try:
logger.info("Starting court session parser task")
# Existing parsing logic here
# Add progress tracking and memory management
return {"status": "success", "processed": count}
except Exception as exc:
logger.error(f"Court parser task failed: {exc}")
raise self.retry(exc=exc)Acceptance Criteria
- Celery task successfully parses court session data
- Task runs on the same daily schedule (23:10)
- All current functionality is preserved
- Error handling improves reliability over cron system
- Task execution can be monitored through Celery
- Memory usage is optimized for long-running operations
- Retry logic handles transient failures automatically
- Logging provides detailed execution information
- Task can be manually triggered for testing/debugging
- Parallel execution with cron job works safely during transition period
Transition Strategy
- Deploy Celery task alongside existing cron job
- Monitor both systems for consistency
- Verify Celery task reliability over time
- Keep original management command for manual execution
- Do NOT remove cron configuration (handled in Phase 3: Phase 3: Remove Legacy Cron-Based Background Job System #1834)
Success Metrics
- Reliability: Task success rate > 95% with automatic retry handling
- Performance: Execution time comparable to or better than original command
- Monitoring: Full visibility into task execution status and errors
- Maintainability: Easier debugging and administration through Celery interface
- Safety: Successful parallel execution with legacy cron system
References
- Current management command:
poradnia/judgements/management/commands/run_court_session_parser.py - Cron configuration:
.contrib/docker/cron/set_crontab.sh - Court parser implementations in
poradnia/judgements/module - Related models: Court, CourtCase, CourtSession, Event
Metadata
Metadata
Assignees
Labels
No labels