LogoContainerPub

Cleanup Worker#

Overview#

The Cleanup Worker is a Dart-based service that runs independently to manage the lifecycle of deployed functions and their container images. It performs automated cleanup of stale functions and removes unused container images from the system.

Purpose: Prevent resource waste by automatically identifying and removing functions that haven't been invoked within a configurable threshold period, and cleaning up their associated container images.

Architecture#

The cleanup worker operates as a separate containerized service within the Docker Compose environment:

┌─────────────────────────────────────────────────────────────────────┐
│                     Docker Compose Environment                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Postgres   │◄───│   Backend    │    │   Cleanup Worker     │  │
│  │   Database   │    │   Service    │    │   (Dart + Cron)      │  │
│  └──────────────┘    └──────────────┘    └──────────┬───────────┘  │
│         ▲                                           │               │
│         │                                           │               │
│         └───────────────────────────────────────────┘               │
│                                                     │               │
│                                           ┌─────────▼───────────┐   │
│                                           │   Python Podman     │   │
│                                           │   Client            │   │
│                                           └─────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

How It Works#

The cleanup worker operates on a scheduled cron job with a two-phase cleanup process:

Phase 1: Delete Pending Images#

  1. Queries the pending_image_deletions table for entries awaiting deletion
  2. For each entry with retry_count < MAX_RETRY_COUNT:
    • Calls the Python Podman client to delete the container image
    • On success: Removes the entry from the table
    • On failure: Increments retry_count, logs the error, and schedules for retry

Phase 2: Identify Stale Functions#

  1. Scans the database for functions matching stale criteria:
    • Status is active with an active deployment
    • Last invocation is older than STALE_THRESHOLD_DAYS OR has never been invoked
  2. For each stale function:
    • Inserts its image into the pending_image_deletions table with reason stale_function
    • Updates function status to pending_cleanup

Configuration#

Configure the cleanup worker via environment variables:

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
CLEANUP_CRON_SCHEDULE Cron expression for cleanup job 0 3 * * * (3 AM daily)
STALE_THRESHOLD_DAYS Days without invocation before marking as stale 30
PODMAN_SOCKET_PATH Path to Podman socket for image deletion /run/podman/podman.sock
PYTHON_CLIENT_PATH Path to podman_client.py script /app/podman_client.py
MAX_RETRY_COUNT Maximum retry attempts for failed deletions 3
LOG_LEVEL Logging verbosity (debug, info, warning, error) info
RUN_ON_STARTUP Execute cleanup immediately on service startup false

Database Schema#

The worker uses the pending_image_deletions table to track images awaiting deletion:

CREATE TABLE pending_image_deletions (
    id SERIAL PRIMARY KEY,
    uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
    function_id INTEGER NOT NULL REFERENCES functions(id) ON DELETE CASCADE,
    image_tag VARCHAR(255) NOT NULL,
    reason VARCHAR(100) NOT NULL DEFAULT 'stale_function',
    retry_count INTEGER DEFAULT 0,
    last_error TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_attempted_at TIMESTAMP
);

Running Locally#

To run the cleanup worker in development:

cd dart_cloud_backend/packages/cleanup_worker

# Install dependencies
dart pub get

# Run with custom configuration
DATABASE_URL="postgres://user:pass@localhost:5432/dart_cloud" \
CLEANUP_CRON_SCHEDULE="*/5 * * * *" \
STALE_THRESHOLD_DAYS="7" \
RUN_ON_STARTUP="true" \
dart run bin/worker.dart

Docker Deployment#

Run the cleanup worker in Docker Compose:

cd dart_cloud_backend/deploy
docker-compose up cleanup-worker

The service will automatically connect to the PostgreSQL database and Podman socket defined in the compose configuration.

Logging#

The worker uses structured logging to track all operations:

[2024-01-15T03:00:00.000Z] INFO: CleanupService: Starting cleanup job
[2024-01-15T03:00:00.100Z] INFO: CleanupService: === Phase 1: Deleting pending images ===
[2024-01-15T03:00:00.200Z] INFO: CleanupService: Found 5 pending image deletions
[2024-01-15T03:00:01.000Z] INFO: ImageDeletionService: Successfully deleted image: func-abc123:v1
[2024-01-15T03:00:05.000Z] INFO: CleanupService: === Phase 2: Identifying stale functions ===
[2024-01-15T03:00:05.100Z] INFO: CleanupService: Found 2 stale functions
[2024-01-15T03:00:05.200Z] INFO: CleanupService: Queued function 'old-function' for cleanup
[2024-01-15T03:00:05.300Z] INFO: CleanupService: Cleanup job completed

Dependencies#

  • cron - Cron job scheduling for periodic execution
  • logging - Structured logging for operations tracking
  • database - Internal database package with entity definitions and migrations

Integration with Backend#

The cleanup worker integrates seamlessly with the backend service:

  • Database Access: Shares the same PostgreSQL instance as the backend
  • Image Deletion: Uses the Python Podman client for container image management
  • Status Tracking: Updates function status in the database as cleanup progresses
  • Error Handling: Implements retry logic with exponential backoff for failed deletions

Benefits#

  • Resource Optimization: Automatically frees up storage by removing unused container images
  • Cost Reduction: Reduces infrastructure costs by cleaning up stale deployments
  • Autonomous Operation: Runs independently without manual intervention
  • Configurable Thresholds: Adjust stale detection based on your usage patterns
  • Reliable: Implements retry logic and comprehensive error handling