Cleanup Worker

Overview

The Cleanup Worker is a Dart-based service that runs independently to manage the lifecycle of deployed functions and their container images. It performs automated cleanup of stale functions and removes unused container images from the system.

Purpose: Prevent resource waste by automatically identifying and removing functions that haven't been invoked within a configurable threshold period, and cleaning up their associated container images.

Architecture

The cleanup worker operates as a separate containerized service within the Docker Compose environment:

Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                     Docker Compose Environment                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Postgres   │◄───│   Backend    │    │   Cleanup Worker     │  │
│  │   Database   │    │   Service    │    │   (Dart + Cron)      │  │
│  └──────────────┘    └──────────────┘    └──────────┬───────────┘  │
│         ▲                                           │               │
│         │                                           │               │
│         └───────────────────────────────────────────┘               │
│                                                     │               │
│                                           ┌─────────▼───────────┐   │
│                                           │   Python Podman     │   │
│                                           │   Client            │   │
│                                           └─────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

How It Works

The cleanup worker operates on a scheduled cron job with a two-phase cleanup process:

Phase 1: Delete Pending Images

Queries the pending_image_deletions table for entries awaiting deletion
For each entry with retry_count < MAX_RETRY_COUNT:
Calls the Python Podman client to delete the container image
On success: Removes the entry from the table
On failure: Increments retry_count, logs the error, and schedules for retry

Phase 2: Identify Stale Functions

Scans the database for functions matching stale criteria:
Status is active with an active deployment
Last invocation is older than STALE_THRESHOLD_DAYS OR has never been invoked
For each stale function:
Inserts its image into the pending_image_deletions table with reason stale_function
Updates function status to pending_cleanup

Configuration

Configure the cleanup worker via environment variables:

Variable	Description	Default
DATABASE_URL	PostgreSQL connection string	Required
CLEANUP_CRON_SCHEDULE	Cron expression for cleanup job	0 3 * * * (3 AM daily)
STALE_THRESHOLD_DAYS	Days without invocation before marking as stale	30
PODMAN_SOCKET_PATH	Path to Podman socket for image deletion	/run/podman/podman.sock
PYTHON_CLIENT_PATH	Path to podman_client.py script	/app/podman_client.py
MAX_RETRY_COUNT	Maximum retry attempts for failed deletions	3
LOG_LEVEL	Logging verbosity (debug, info, warning, error)	info
RUN_ON_STARTUP	Execute cleanup immediately on service startup	false

Database Schema

The worker uses the pending_image_deletions table to track images awaiting deletion:

sql

CREATE TABLE pending_image_deletions (
    id SERIAL PRIMARY KEY,
    uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
    function_id INTEGER NOT NULL REFERENCES functions(id) ON DELETE CASCADE,
    image_tag VARCHAR(255) NOT NULL,
    reason VARCHAR(100) NOT NULL DEFAULT 'stale_function',
    retry_count INTEGER DEFAULT 0,
    last_error TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_attempted_at TIMESTAMP
);

Running Locally

To run the cleanup worker in development:

dart

cd dart_cloud_backend/packages/cleanup_worker

# Install dependencies
dart pub get

# Run with custom configuration
DATABASE_URL="postgres://user:pass@localhost:5432/dart_cloud" \
CLEANUP_CRON_SCHEDULE="*/5 * * * *" \
STALE_THRESHOLD_DAYS="7" \
RUN_ON_STARTUP="true" \
dart run bin/worker.dart

Docker Deployment

Run the cleanup worker in Docker Compose:

dart

cd dart_cloud_backend/deploy
docker-compose up cleanup-worker

The service will automatically connect to the PostgreSQL database and Podman socket defined in the compose configuration.

Logging

The worker uses structured logging to track all operations:

json

[2024-01-15T03:00:00.000Z] INFO: CleanupService: Starting cleanup job
[2024-01-15T03:00:00.100Z] INFO: CleanupService: === Phase 1: Deleting pending images ===
[2024-01-15T03:00:00.200Z] INFO: CleanupService: Found 5 pending image deletions
[2024-01-15T03:00:01.000Z] INFO: ImageDeletionService: Successfully deleted image: func-abc123:v1
[2024-01-15T03:00:05.000Z] INFO: CleanupService: === Phase 2: Identifying stale functions ===
[2024-01-15T03:00:05.100Z] INFO: CleanupService: Found 2 stale functions
[2024-01-15T03:00:05.200Z] INFO: CleanupService: Queued function 'old-function' for cleanup
[2024-01-15T03:00:05.300Z] INFO: CleanupService: Cleanup job completed

Dependencies

cron - Cron job scheduling for periodic execution
logging - Structured logging for operations tracking
database - Internal database package with entity definitions and migrations

Integration with Backend

The cleanup worker integrates seamlessly with the backend service:

Database Access: Shares the same PostgreSQL instance as the backend
Image Deletion: Uses the Python Podman client for container image management
Status Tracking: Updates function status in the database as cleanup progresses
Error Handling: Implements retry logic with exponential backoff for failed deletions

Benefits

Resource Optimization: Automatically frees up storage by removing unused container images
Cost Reduction: Reduces infrastructure costs by cleaning up stale deployments
Autonomous Operation: Runs independently without manual intervention
Configurable Thresholds: Adjust stale detection based on your usage patterns
Reliable: Implements retry logic and comprehensive error handling