Search Architecture¶
Overview¶
Search is a core feature for investigations - users need to quickly find plate sightings across millions of events spanning 7 years of data.
Note: This document includes SQL examples because they define the search capabilities themselves (WHAT queries are supported), not implementation patterns (HOW to build them). The SQL is inseparable from explaining each search type's functionality. See Database PRP for general database patterns and indexing strategies.
Search Types¶
1. Exact Plate Search¶
Find all sightings of a specific plate number.
Implementation: B-tree index on normalized plate column
CREATE INDEX idx_events_plate ON events (plate_number);
-- Query
SELECT * FROM events
WHERE plate_number = 'ABC123'
ORDER BY captured_at DESC;
2. Partial Plate Search (Wildcard)¶
Find plates matching a pattern when full plate is unknown.
Input: ABC* → Matches ABC123, ABC456, ABCXYZ
Input: *123 → Matches ABC123, XYZ123
Input: *BC1* → Matches ABC123, XBC199
Implementation: pg_trgm extension for efficient LIKE queries
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_events_plate_trgm ON events USING gin (plate_number gin_trgm_ops);
-- Query
SELECT * FROM events
WHERE plate_number LIKE 'ABC%'
ORDER BY captured_at DESC;
3. Fuzzy Plate Search (Similarity)¶
Find plates similar to input, accounting for OCR misreads.
Common OCR confusions:
- 0 ↔ O ↔ Q ↔ D
- 1 ↔ I ↔ L ↔ 7
- 5 ↔ S
- 8 ↔ B
- 2 ↔ Z
- 6 ↔ G
Implementation: pg_trgm similarity with threshold
-- Query with similarity
SELECT *, similarity(plate_number, 'ABC123') AS sim
FROM events
WHERE plate_number % 'ABC123' -- Uses trigram similarity
AND similarity(plate_number, 'ABC123') > 0.6
ORDER BY sim DESC, captured_at DESC;
Alternative: Custom OCR-aware normalization
-- Create OCR-normalized column for matching
ALTER TABLE events ADD COLUMN plate_normalized VARCHAR(20);
-- Normalize on insert: O→0, I→1, S→5, etc.
CREATE OR REPLACE FUNCTION normalize_plate_ocr(plate TEXT) RETURNS TEXT AS $$
BEGIN
RETURN UPPER(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
plate,
'O', '0'), 'Q', '0'), 'D', '0'),
'I', '1'), 'L', '1'),
'S', '5')
);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
CREATE INDEX idx_events_plate_norm ON events (plate_normalized);
-- Query matches OCR variants automatically
SELECT * FROM events
WHERE plate_normalized = normalize_plate_ocr('ABC123');
4. Time Range Search¶
Filter by capture timestamp.
SELECT * FROM events
WHERE captured_at BETWEEN '2025-01-01' AND '2025-01-31'
ORDER BY captured_at DESC;
Implementation: B-tree index (covered by partitioning)
5. Geographic Search¶
Find events within an area.
Option A: Bounding Box (Simple)
Option B: Radius Search (PostGIS)
CREATE EXTENSION IF NOT EXISTS postgis;
-- Add geography column
ALTER TABLE events ADD COLUMN location GEOGRAPHY(POINT);
-- Index
CREATE INDEX idx_events_location ON events USING gist (location);
-- Query: within 1km of a point
SELECT * FROM events
WHERE ST_DWithin(
location,
ST_MakePoint(-74.0, 41.0)::geography,
1000 -- meters
);
Recommendation: Start with bounding box (no PostGIS dependency); add PostGIS if radius search is needed.
6. Multi-Criteria Search¶
Combine filters for investigation queries.
SELECT * FROM events
WHERE plate_number LIKE 'ABC%'
AND captured_at BETWEEN '2025-01-01' AND '2025-01-31'
AND collector_id IN ('col_site01', 'col_site02')
AND confidence >= 0.8
ORDER BY captured_at DESC
LIMIT 100 OFFSET 0;
Indexing Strategy¶
Primary Indexes¶
| Column(s) | Index Type | Purpose |
|---|---|---|
plate_number |
B-tree | Exact match |
plate_number |
GIN (pg_trgm) | Wildcard, fuzzy |
plate_normalized |
B-tree | OCR-aware matching |
captured_at |
B-tree | Time range (via partitioning) |
collector_id |
B-tree | Filter by location |
(collector_id, captured_at) |
B-tree | Common compound query |
Partial Indexes¶
For frequent query patterns:
-- High-confidence events only
CREATE INDEX idx_events_high_conf ON events (plate_number, captured_at)
WHERE confidence >= 0.9;
-- Recent events (hot data)
CREATE INDEX idx_events_recent ON events (plate_number, captured_at)
WHERE captured_at > NOW() - INTERVAL '30 days';
Partition Strategy¶
Table partitioned by month for efficient time-range queries:
CREATE TABLE events (
id BIGSERIAL,
captured_at TIMESTAMPTZ NOT NULL,
-- ... other columns
) PARTITION BY RANGE (captured_at);
-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
Benefits: - Time-range queries scan only relevant partitions - Old data deletion is instant (DROP partition) - Indexes are smaller and faster per partition
Query Patterns¶
Pattern 1: Plate History (Timeline)¶
"Show me everywhere this plate has been seen"
SELECT
e.captured_at,
e.plate_number,
e.confidence,
c.name AS collector_name,
c.latitude,
c.longitude
FROM events e
JOIN collectors c ON e.collector_id = c.id
WHERE e.plate_number = 'ABC123'
ORDER BY e.captured_at DESC
LIMIT 100;
Pattern 2: Location Activity¶
"Show me all plates seen at this location today"
SELECT
e.captured_at,
e.plate_number,
e.direction,
e.confidence
FROM events e
WHERE e.collector_id = 'col_site01'
AND e.captured_at >= CURRENT_DATE
ORDER BY e.captured_at DESC;
Pattern 3: Plate Co-occurrence¶
"Find plates frequently seen with this plate"
-- Plates seen within 5 minutes of target plate
WITH target_sightings AS (
SELECT captured_at, collector_id
FROM events
WHERE plate_number = 'ABC123'
)
SELECT
e.plate_number,
COUNT(*) AS co_occurrences
FROM events e
JOIN target_sightings t ON
e.collector_id = t.collector_id
AND e.captured_at BETWEEN t.captured_at - INTERVAL '5 minutes'
AND t.captured_at + INTERVAL '5 minutes'
WHERE e.plate_number != 'ABC123'
GROUP BY e.plate_number
HAVING COUNT(*) >= 3
ORDER BY co_occurrences DESC;
Pattern 4: Route Reconstruction¶
"Show the path this plate traveled"
SELECT
e.captured_at,
c.name,
c.latitude,
c.longitude
FROM events e
JOIN collectors c ON e.collector_id = c.id
WHERE e.plate_number = 'ABC123'
AND e.captured_at BETWEEN '2025-01-15 08:00' AND '2025-01-15 18:00'
ORDER BY e.captured_at ASC;
Returns ordered list of locations for map visualization.
Performance Targets¶
| Query Type | Target Latency | Data Volume |
|---|---|---|
| Exact plate lookup | <100ms | 7 years |
| Partial plate (wildcard) | <500ms | 7 years |
| Fuzzy plate search | <1s | 7 years |
| Time range + plate | <200ms | Any range |
| Location activity (today) | <100ms | 1 day |
Optimization Techniques¶
- LIMIT early - Always paginate, never return unbounded results
- Covering indexes - Include frequently selected columns
- Partition pruning - Always include time range when possible
- Connection pooling - Reuse connections for search queries
- Query caching - Cache common searches (Redis, optional)
API Design¶
Search Endpoint¶
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
plate |
string | Plate number (exact or pattern with *) |
plate_fuzzy |
boolean | Enable fuzzy matching (default: false) |
start_date |
datetime | Start of time range |
end_date |
datetime | End of time range |
collector_id |
string[] | Filter by collector(s) |
min_confidence |
float | Minimum confidence (0.0-1.0) |
direction |
string | "entering" or "leaving" |
lat_min, lat_max |
float | Bounding box latitude |
lon_min, lon_max |
float | Bounding box longitude |
page |
int | Page number (default: 1) |
per_page |
int | Results per page (default: 50, max: 100) |
sort |
string | Sort field (default: captured_at) |
order |
string | "asc" or "desc" (default: desc) |
Response:
{
"data": [
{
"id": "evt_123",
"plate_number": "ABC123",
"captured_at": "2025-01-15T12:34:56Z",
"confidence": 0.95,
"direction": "entering",
"collector": {
"id": "col_site01",
"name": "Main Street"
},
"images": {
"full_url": "https://...",
"crop_url": "https://..."
}
}
],
"meta": {
"page": 1,
"per_page": 50,
"total": 1234,
"total_pages": 25
}
}
Autocomplete Endpoint (Optional)¶
For plate number autocomplete in search UI:
{
"suggestions": [
{"plate": "ABC123", "sightings": 45},
{"plate": "ABC456", "sightings": 12},
{"plate": "ABC789", "sightings": 8}
]
}
Implementation: Materialized view or separate plates table with aggregates
CREATE MATERIALIZED VIEW plate_summary AS
SELECT
plate_number,
COUNT(*) AS sighting_count,
MAX(captured_at) AS last_seen
FROM events
GROUP BY plate_number;
CREATE INDEX idx_plate_summary_prefix ON plate_summary (plate_number text_pattern_ops);
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY plate_summary;
Future Enhancements¶
Saved Searches¶
Users can save frequent search queries:
CREATE TABLE saved_searches (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
name VARCHAR(100) NOT NULL,
query_params JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Search Alerts¶
Notify when new matches appear:
CREATE TABLE search_alerts (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
search_criteria JSONB NOT NULL,
last_checked_at TIMESTAMPTZ,
is_active BOOLEAN DEFAULT true
);
Background worker checks periodically for new matches.
Elasticsearch (Future Scale)¶
If PostgreSQL search becomes a bottleneck at very high scale:
- Sync events to Elasticsearch for full-text search
- Keep PostgreSQL as source of truth
- Use Elasticsearch for complex text queries only
Trigger: Consider when query latency exceeds targets consistently.
Summary¶
| Feature | Implementation | Complexity |
|---|---|---|
| Exact plate search | B-tree index | Low |
| Wildcard search | pg_trgm GIN index | Low |
| Fuzzy/OCR-aware | Normalized column + index | Medium |
| Time range | Table partitioning | Low |
| Geographic (bbox) | Compound index | Low |
| Geographic (radius) | PostGIS | Medium |
| Autocomplete | Materialized view | Medium |
Recommendation for POC: 1. Start with exact + wildcard + time range (covers 90% of use cases) 2. Add OCR-normalized column if fuzzy matching is needed 3. Defer PostGIS until radius search is required 4. Defer Elasticsearch unless PostgreSQL can't meet targets
Decision Date: 2025-12-29 Status: Approved Rationale: PostgreSQL capabilities (pg_trgm, partitioning) sufficient for expected scale; clear upgrade path if needed