Cleanup Worker#
Overview#
The Cleanup Worker is a Dart-based service that runs independently to manage the lifecycle of deployed functions and their container images. It performs automated cleanup of stale functions and removes unused container images from the system.
Purpose: Prevent resource waste by automatically identifying and removing functions that haven't been invoked within a configurable threshold period, and cleaning up their associated container images.
Architecture#
The cleanup worker operates as a separate containerized service within the Docker Compose environment:
┌─────────────────────────────────────────────────────────────────────┐
│ Docker Compose Environment │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Postgres │◄───│ Backend │ │ Cleanup Worker │ │
│ │ Database │ │ Service │ │ (Dart + Cron) │ │
│ └──────────────┘ └──────────────┘ └──────────┬───────────┘ │
│ ▲ │ │
│ │ │ │
│ └───────────────────────────────────────────┘ │
│ │ │
│ ┌─────────▼───────────┐ │
│ │ Python Podman │ │
│ │ Client │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
How It Works#
The cleanup worker operates on a scheduled cron job with a two-phase cleanup process:
Phase 1: Delete Pending Images#
- Queries the
pending_image_deletionstable for entries awaiting deletion -
For each entry with
retry_count < MAX_RETRY_COUNT:- Calls the Python Podman client to delete the container image
- On success: Removes the entry from the table
- On failure: Increments
retry_count, logs the error, and schedules for retry
Phase 2: Identify Stale Functions#
-
Scans the database for functions matching stale criteria:
- Status is
activewith an active deployment - Last invocation is older than
STALE_THRESHOLD_DAYSOR has never been invoked
- Status is
-
For each stale function:
- Inserts its image into the
pending_image_deletionstable with reasonstale_function - Updates function status to
pending_cleanup
- Inserts its image into the
Configuration#
Configure the cleanup worker via environment variables:
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
CLEANUP_CRON_SCHEDULE |
Cron expression for cleanup job | 0 3 * * * (3 AM daily) |
STALE_THRESHOLD_DAYS |
Days without invocation before marking as stale | 30 |
PODMAN_SOCKET_PATH |
Path to Podman socket for image deletion | /run/podman/podman.sock |
PYTHON_CLIENT_PATH |
Path to podman_client.py script | /app/podman_client.py |
MAX_RETRY_COUNT |
Maximum retry attempts for failed deletions | 3 |
LOG_LEVEL |
Logging verbosity (debug, info, warning, error) | info |
RUN_ON_STARTUP |
Execute cleanup immediately on service startup | false |
Database Schema#
The worker uses the pending_image_deletions table to track images awaiting deletion:
CREATE TABLE pending_image_deletions (
id SERIAL PRIMARY KEY,
uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
function_id INTEGER NOT NULL REFERENCES functions(id) ON DELETE CASCADE,
image_tag VARCHAR(255) NOT NULL,
reason VARCHAR(100) NOT NULL DEFAULT 'stale_function',
retry_count INTEGER DEFAULT 0,
last_error TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_attempted_at TIMESTAMP
);
Running Locally#
To run the cleanup worker in development:
cd dart_cloud_backend/packages/cleanup_worker
# Install dependencies
dart pub get
# Run with custom configuration
DATABASE_URL="postgres://user:pass@localhost:5432/dart_cloud" \
CLEANUP_CRON_SCHEDULE="*/5 * * * *" \
STALE_THRESHOLD_DAYS="7" \
RUN_ON_STARTUP="true" \
dart run bin/worker.dart
Docker Deployment#
Run the cleanup worker in Docker Compose:
cd dart_cloud_backend/deploy
docker-compose up cleanup-worker
The service will automatically connect to the PostgreSQL database and Podman socket defined in the compose configuration.
Logging#
The worker uses structured logging to track all operations:
[2024-01-15T03:00:00.000Z] INFO: CleanupService: Starting cleanup job
[2024-01-15T03:00:00.100Z] INFO: CleanupService: === Phase 1: Deleting pending images ===
[2024-01-15T03:00:00.200Z] INFO: CleanupService: Found 5 pending image deletions
[2024-01-15T03:00:01.000Z] INFO: ImageDeletionService: Successfully deleted image: func-abc123:v1
[2024-01-15T03:00:05.000Z] INFO: CleanupService: === Phase 2: Identifying stale functions ===
[2024-01-15T03:00:05.100Z] INFO: CleanupService: Found 2 stale functions
[2024-01-15T03:00:05.200Z] INFO: CleanupService: Queued function 'old-function' for cleanup
[2024-01-15T03:00:05.300Z] INFO: CleanupService: Cleanup job completed
Dependencies#
- cron - Cron job scheduling for periodic execution
- logging - Structured logging for operations tracking
- database - Internal database package with entity definitions and migrations
Integration with Backend#
The cleanup worker integrates seamlessly with the backend service:
- Database Access: Shares the same PostgreSQL instance as the backend
- Image Deletion: Uses the Python Podman client for container image management
- Status Tracking: Updates function status in the database as cleanup progresses
- Error Handling: Implements retry logic with exponential backoff for failed deletions
Benefits#
- Resource Optimization: Automatically frees up storage by removing unused container images
- Cost Reduction: Reduces infrastructure costs by cleaning up stale deployments
- Autonomous Operation: Runs independently without manual intervention
- Configurable Thresholds: Adjust stale detection based on your usage patterns
- Reliable: Implements retry logic and comprehensive error handling