Skip to content

Search Architecture

Overview

Search is a core feature for investigations - users need to quickly find plate sightings across millions of events spanning 7 years of data.

Note: This document includes SQL examples because they define the search capabilities themselves (WHAT queries are supported), not implementation patterns (HOW to build them). The SQL is inseparable from explaining each search type's functionality. See Database PRP for general database patterns and indexing strategies.

Search Types

Find all sightings of a specific plate number.

Input: ABC123
Match: ABC123 only

Implementation: B-tree index on normalized plate column

CREATE INDEX idx_events_plate ON events (plate_number);

-- Query
SELECT * FROM events
WHERE plate_number = 'ABC123'
ORDER BY captured_at DESC;

2. Partial Plate Search (Wildcard)

Find plates matching a pattern when full plate is unknown.

Input: ABC*     → Matches ABC123, ABC456, ABCXYZ
Input: *123     → Matches ABC123, XYZ123
Input: *BC1*    → Matches ABC123, XBC199

Implementation: pg_trgm extension for efficient LIKE queries

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_events_plate_trgm ON events USING gin (plate_number gin_trgm_ops);

-- Query
SELECT * FROM events
WHERE plate_number LIKE 'ABC%'
ORDER BY captured_at DESC;

3. Fuzzy Plate Search (Similarity)

Find plates similar to input, accounting for OCR misreads.

Common OCR confusions: - 0OQD - 1IL7 - 5S - 8B - 2Z - 6G

Input: ABC123
Also finds: A8C123, ABC1Z3, ABCI23

Implementation: pg_trgm similarity with threshold

-- Query with similarity
SELECT *, similarity(plate_number, 'ABC123') AS sim
FROM events
WHERE plate_number % 'ABC123'  -- Uses trigram similarity
  AND similarity(plate_number, 'ABC123') > 0.6
ORDER BY sim DESC, captured_at DESC;

Alternative: Custom OCR-aware normalization

-- Create OCR-normalized column for matching
ALTER TABLE events ADD COLUMN plate_normalized VARCHAR(20);

-- Normalize on insert: O→0, I→1, S→5, etc.
CREATE OR REPLACE FUNCTION normalize_plate_ocr(plate TEXT) RETURNS TEXT AS $$
BEGIN
    RETURN UPPER(
        REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
            plate,
            'O', '0'), 'Q', '0'), 'D', '0'),
            'I', '1'), 'L', '1'),
            'S', '5')
    );
END;
$$ LANGUAGE plpgsql IMMUTABLE;

CREATE INDEX idx_events_plate_norm ON events (plate_normalized);

-- Query matches OCR variants automatically
SELECT * FROM events
WHERE plate_normalized = normalize_plate_ocr('ABC123');

Filter by capture timestamp.

SELECT * FROM events
WHERE captured_at BETWEEN '2025-01-01' AND '2025-01-31'
ORDER BY captured_at DESC;

Implementation: B-tree index (covered by partitioning)

-- Partitioning handles this efficiently
-- Queries only scan relevant partitions

Find events within an area.

Option A: Bounding Box (Simple)

SELECT * FROM events
WHERE latitude BETWEEN 41.0 AND 41.1
  AND longitude BETWEEN -74.1 AND -74.0;

Option B: Radius Search (PostGIS)

CREATE EXTENSION IF NOT EXISTS postgis;

-- Add geography column
ALTER TABLE events ADD COLUMN location GEOGRAPHY(POINT);

-- Index
CREATE INDEX idx_events_location ON events USING gist (location);

-- Query: within 1km of a point
SELECT * FROM events
WHERE ST_DWithin(
    location,
    ST_MakePoint(-74.0, 41.0)::geography,
    1000  -- meters
);

Recommendation: Start with bounding box (no PostGIS dependency); add PostGIS if radius search is needed.

Combine filters for investigation queries.

SELECT * FROM events
WHERE plate_number LIKE 'ABC%'
  AND captured_at BETWEEN '2025-01-01' AND '2025-01-31'
  AND collector_id IN ('col_site01', 'col_site02')
  AND confidence >= 0.8
ORDER BY captured_at DESC
LIMIT 100 OFFSET 0;

Indexing Strategy

Primary Indexes

Column(s) Index Type Purpose
plate_number B-tree Exact match
plate_number GIN (pg_trgm) Wildcard, fuzzy
plate_normalized B-tree OCR-aware matching
captured_at B-tree Time range (via partitioning)
collector_id B-tree Filter by location
(collector_id, captured_at) B-tree Common compound query

Partial Indexes

For frequent query patterns:

-- High-confidence events only
CREATE INDEX idx_events_high_conf ON events (plate_number, captured_at)
WHERE confidence >= 0.9;

-- Recent events (hot data)
CREATE INDEX idx_events_recent ON events (plate_number, captured_at)
WHERE captured_at > NOW() - INTERVAL '30 days';

Partition Strategy

Table partitioned by month for efficient time-range queries:

CREATE TABLE events (
    id BIGSERIAL,
    captured_at TIMESTAMPTZ NOT NULL,
    -- ... other columns
) PARTITION BY RANGE (captured_at);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Benefits: - Time-range queries scan only relevant partitions - Old data deletion is instant (DROP partition) - Indexes are smaller and faster per partition


Query Patterns

Pattern 1: Plate History (Timeline)

"Show me everywhere this plate has been seen"

SELECT
    e.captured_at,
    e.plate_number,
    e.confidence,
    c.name AS collector_name,
    c.latitude,
    c.longitude
FROM events e
JOIN collectors c ON e.collector_id = c.id
WHERE e.plate_number = 'ABC123'
ORDER BY e.captured_at DESC
LIMIT 100;

Pattern 2: Location Activity

"Show me all plates seen at this location today"

SELECT
    e.captured_at,
    e.plate_number,
    e.direction,
    e.confidence
FROM events e
WHERE e.collector_id = 'col_site01'
  AND e.captured_at >= CURRENT_DATE
ORDER BY e.captured_at DESC;

Pattern 3: Plate Co-occurrence

"Find plates frequently seen with this plate"

-- Plates seen within 5 minutes of target plate
WITH target_sightings AS (
    SELECT captured_at, collector_id
    FROM events
    WHERE plate_number = 'ABC123'
)
SELECT
    e.plate_number,
    COUNT(*) AS co_occurrences
FROM events e
JOIN target_sightings t ON
    e.collector_id = t.collector_id
    AND e.captured_at BETWEEN t.captured_at - INTERVAL '5 minutes'
                          AND t.captured_at + INTERVAL '5 minutes'
WHERE e.plate_number != 'ABC123'
GROUP BY e.plate_number
HAVING COUNT(*) >= 3
ORDER BY co_occurrences DESC;

Pattern 4: Route Reconstruction

"Show the path this plate traveled"

SELECT
    e.captured_at,
    c.name,
    c.latitude,
    c.longitude
FROM events e
JOIN collectors c ON e.collector_id = c.id
WHERE e.plate_number = 'ABC123'
  AND e.captured_at BETWEEN '2025-01-15 08:00' AND '2025-01-15 18:00'
ORDER BY e.captured_at ASC;

Returns ordered list of locations for map visualization.


Performance Targets

Query Type Target Latency Data Volume
Exact plate lookup <100ms 7 years
Partial plate (wildcard) <500ms 7 years
Fuzzy plate search <1s 7 years
Time range + plate <200ms Any range
Location activity (today) <100ms 1 day

Optimization Techniques

  1. LIMIT early - Always paginate, never return unbounded results
  2. Covering indexes - Include frequently selected columns
  3. Partition pruning - Always include time range when possible
  4. Connection pooling - Reuse connections for search queries
  5. Query caching - Cache common searches (Redis, optional)

API Design

Search Endpoint

GET /api/v1/events/search

Query Parameters:

Parameter Type Description
plate string Plate number (exact or pattern with *)
plate_fuzzy boolean Enable fuzzy matching (default: false)
start_date datetime Start of time range
end_date datetime End of time range
collector_id string[] Filter by collector(s)
min_confidence float Minimum confidence (0.0-1.0)
direction string "entering" or "leaving"
lat_min, lat_max float Bounding box latitude
lon_min, lon_max float Bounding box longitude
page int Page number (default: 1)
per_page int Results per page (default: 50, max: 100)
sort string Sort field (default: captured_at)
order string "asc" or "desc" (default: desc)

Response:

{
  "data": [
    {
      "id": "evt_123",
      "plate_number": "ABC123",
      "captured_at": "2025-01-15T12:34:56Z",
      "confidence": 0.95,
      "direction": "entering",
      "collector": {
        "id": "col_site01",
        "name": "Main Street"
      },
      "images": {
        "full_url": "https://...",
        "crop_url": "https://..."
      }
    }
  ],
  "meta": {
    "page": 1,
    "per_page": 50,
    "total": 1234,
    "total_pages": 25
  }
}

Autocomplete Endpoint (Optional)

For plate number autocomplete in search UI:

GET /api/v1/plates/autocomplete?q=ABC&limit=10
{
  "suggestions": [
    {"plate": "ABC123", "sightings": 45},
    {"plate": "ABC456", "sightings": 12},
    {"plate": "ABC789", "sightings": 8}
  ]
}

Implementation: Materialized view or separate plates table with aggregates

CREATE MATERIALIZED VIEW plate_summary AS
SELECT
    plate_number,
    COUNT(*) AS sighting_count,
    MAX(captured_at) AS last_seen
FROM events
GROUP BY plate_number;

CREATE INDEX idx_plate_summary_prefix ON plate_summary (plate_number text_pattern_ops);

-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY plate_summary;

Future Enhancements

Saved Searches

Users can save frequent search queries:

CREATE TABLE saved_searches (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    name VARCHAR(100) NOT NULL,
    query_params JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

Search Alerts

Notify when new matches appear:

CREATE TABLE search_alerts (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    search_criteria JSONB NOT NULL,
    last_checked_at TIMESTAMPTZ,
    is_active BOOLEAN DEFAULT true
);

Background worker checks periodically for new matches.

Elasticsearch (Future Scale)

If PostgreSQL search becomes a bottleneck at very high scale:

  • Sync events to Elasticsearch for full-text search
  • Keep PostgreSQL as source of truth
  • Use Elasticsearch for complex text queries only

Trigger: Consider when query latency exceeds targets consistently.


Summary

Feature Implementation Complexity
Exact plate search B-tree index Low
Wildcard search pg_trgm GIN index Low
Fuzzy/OCR-aware Normalized column + index Medium
Time range Table partitioning Low
Geographic (bbox) Compound index Low
Geographic (radius) PostGIS Medium
Autocomplete Materialized view Medium

Recommendation for POC: 1. Start with exact + wildcard + time range (covers 90% of use cases) 2. Add OCR-normalized column if fuzzy matching is needed 3. Defer PostGIS until radius search is required 4. Defer Elasticsearch unless PostgreSQL can't meet targets


Decision Date: 2025-12-29 Status: Approved Rationale: PostgreSQL capabilities (pg_trgm, partitioning) sufficient for expected scale; clear upgrade path if needed