Skip to content

Observability and Monitoring

Overview

Observability strategy for the ALPR platform covering logging, metrics, and alerting across central server and edge collectors.

Implementation Details: See Logging PRP for logging implementation patterns, code examples, and best practices.

Log Aggregation Platform

TBD: Log aggregation platform not yet selected.

Options under consideration: - Grafana Loki (lightweight, pairs with Prometheus) - ELK Stack (Elasticsearch, Logstash, Kibana) - Datadog (SaaS, APM integration) - AWS CloudWatch (if deploying to AWS)

Design logs to be platform-agnostic by following universal structured logging standards.


Logging Standards

Format: Single-Line JSON

All logs MUST be single-line JSON with timestamp as the first field:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":45,"request_id":"550e8400-e29b-41d4-a716-446655440000","collector_id":"col_abc123","message":"Batch accepted","events_count":50}

Why Single-Line JSON: - ELK: Direct JSON parsing without grok patterns - Splunk: Automatic field extraction - Datadog: Native JSON parsing - Loki: LogQL JSON field extraction - CloudWatch: Filterable dimensions

Required Fields

Field When Required Type Description
timestamp Always (FIRST) string ISO 8601 with timezone
level Always string DEBUG, INFO, WARNING, ERROR, CRITICAL
service Always string "central", "edge", "worker"
message Always string Human-readable description
src_ip HTTP requests string Client IP address
path HTTP requests string Request path
method HTTP requests string HTTP method
status_code HTTP requests number Response status code
duration_ms Timed operations number Duration in milliseconds
request_id When available string UUID for correlation
user_id When authenticated string User identifier
collector_id Edge operations string Collector identifier

Log Levels

Level When to Use Examples
DEBUG Development diagnostics Query details, internal state
INFO Routine operations Request completed, batch accepted
WARNING Recoverable issues Backpressure activated, retry attempt
ERROR Operation failures Database error, upload failed
CRITICAL System failures Service crash, data corruption

Timestamp Format

GOOD: 2025-01-15T12:34:56.789Z
GOOD: 2025-01-15T12:34:56.789+00:00
BAD:  1705322096 (Unix epoch - requires transformation)
BAD:  01/15/2025 12:34:56 (non-standard - requires parsing)

Field Naming Convention

Use snake_case consistently:

GOOD: request_id, user_id, status_code, duration_ms
BAD:  requestId, userId, statusCode (camelCase)
BAD:  request-id, user-id (kebab-case)

Log Output Examples

Implementation Details: See Logging PRP for implementation code and patterns.

Central Server

HTTP Request:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":145,"request_id":"uuid-here","collector_id":"col_abc123","message":"Request completed","events_count":50}

Error with Stack Trace:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"ERROR","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":500,"duration_ms":5023,"request_id":"uuid-here","collector_id":"col_abc123","message":"Database write failed","error_type":"OperationalError","error_message":"connection refused","exc_info":"Traceback..."}

Audit Trail:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.50","path":"/api/v1/watchlist","method":"POST","status_code":201,"request_id":"uuid-here","user_id":"usr_admin_001","message":"Audit: CREATE watchlist_entry","action":"CREATE","entity_type":"watchlist_entry","entity_id":"wl_xyz789","changes":{"plate_number":{"old":null,"new":"ABC123"}}}

Storage Operation:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","path":"s3://alpr-images/2025/01/15/abc123_full.jpg","status_code":200,"duration_ms":89,"request_id":"uuid-here","message":"Storage upload completed","operation":"upload","bucket":"alpr-images","object_name":"2025/01/15/abc123_full.jpg","file_size_bytes":85432}

Edge Collector

Detection Capture:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","camera_id":"cam_entrance","message":"Plate detected","plate_number":"ABC123","confidence":0.95,"direction":"entering"}

Upload Status:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":245,"message":"Batch uploaded","events_count":50}

Connectivity Warning:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"WARNING","service":"edge","collector_id":"col_site01","message":"Central server unreachable","retry_attempt":3,"backoff_seconds":120,"buffer_depth":1500}


Metrics

Central Server Metrics

Metric Type Labels Description
alpr_detections_received_total Counter collector_id Total detections received
alpr_detections_batch_size Histogram collector_id Detections per batch
alpr_request_duration_ms Histogram path, method Request latency
alpr_storage_upload_duration_ms Histogram bucket MinIO upload latency
alpr_storage_size_bytes Histogram bucket Object sizes
alpr_alert_matches_total Counter watchlist_id Watchlist matches
alpr_alert_processing_duration_ms Histogram - Alert engine latency
alpr_alert_queue_depth Gauge - Pending alerts
alpr_websocket_connections Gauge - Active WebSocket clients
alpr_db_query_duration_ms Histogram query_type Database latency
alpr_collectors_online Gauge - Connected collectors

Edge Collector Metrics

Metric Type Labels Description
edge_detections_total Counter camera_id Total detections
edge_buffer_depth Gauge - Detections in SQLite buffer
edge_buffer_oldest_seconds Gauge - Age of oldest buffered detection
edge_upload_success_total Counter - Successful uploads
edge_upload_failure_total Counter error_type Failed uploads
edge_upload_duration_ms Histogram - Upload latency
edge_camera_online Gauge camera_id Camera connectivity
edge_backpressure_active Gauge - Backpressure status

System Metrics

Collected via node_exporter or equivalent:

Metric Source Alert Threshold
CPU usage node_exporter >80% sustained
Memory usage node_exporter >85%
Disk usage node_exporter >80%
Network I/O node_exporter Anomaly detection

Alerting Rules

Critical Alerts (Immediate Response)

Alert Condition Response
Central server down No heartbeat response for 5 min Page on-call
Database unreachable Connection failures for 2 min Page on-call
MinIO unreachable Upload failures for 5 min Page on-call
Alert queue overflow Queue depth > 10,000 for 10 min Page on-call
Disk full >95% usage Page on-call

High Priority Alerts (1-Hour Response)

Alert Condition Response
Collector offline No heartbeat for 10 min Investigate
High error rate >5% requests failing Investigate
Slow requests p99 latency > 5s Investigate
Backpressure sustained Active for > 30 min Scale resources

Warning Alerts (Next Business Day)

Alert Condition Response
Collector degraded Heartbeat but cameras offline Schedule maintenance
Disk usage high >80% Plan cleanup/expansion
Certificate expiring <30 days Renew certificate
Unusual traffic 10x normal volume Review for issues

Dashboards

Operations Dashboard

Panels: - Events ingested (rate, 24h total) - Active collectors (count, map view) - Request latency (p50, p95, p99) - Error rate (%) - Alert queue depth - Storage usage

Collector Health Dashboard

Panels: - Collector status table (online/degraded/offline) - Buffer depth per collector - Upload success rate per collector - Camera status per collector - Last heartbeat times

Security Dashboard

Panels: - Authentication failures (rate, by IP) - Authorization denials (rate, by user) - API key usage (requests per key) - Unusual access patterns


Log Retention

Log Type Retention Storage
Application logs 30 days Log aggregator
Audit logs 7 years PostgreSQL + log aggregator
Security logs 1 year Log aggregator
Debug logs 7 days Local only

Implementation Checklist

Logging Setup

Implementation Details: See Logging PRP for implementation patterns and code examples.

  • [ ] Configure structlog with JSON renderer
  • [ ] Implement timestamp-first processor
  • [ ] Set up request ID generation/propagation
  • [ ] Create logging middleware for FastAPI
  • [ ] Wrap storage operations with logging
  • [ ] Implement audit logging service

Metrics Setup

  • [ ] Add Prometheus client library
  • [ ] Instrument HTTP endpoints
  • [ ] Instrument database queries
  • [ ] Instrument storage operations
  • [ ] Instrument alert processing
  • [ ] Add edge collector metrics

Alerting Setup

  • [ ] Define alerting rules (Prometheus/Grafana)
  • [ ] Configure notification channels
  • [ ] Document runbooks for each alert
  • [ ] Test alert delivery

Dashboard Setup

  • [ ] Create operations dashboard
  • [ ] Create collector health dashboard
  • [ ] Create security dashboard
  • [ ] Configure dashboard refresh rates

Decision Date: 2025-12-29 Status: Draft (log aggregation platform TBD) Rationale: Platform-agnostic logging standards enable flexibility in tooling selection