Geospatial Data Conflation with DuckDB and Embeddings
Posted on Tue 01 October 2024 in Programming
Drew Breunig presents a fascinating case study of geospatial data conflation: the process of identifying and merging similar records from different sources.
The Challenge: Integrating Restaurant Data
The goal was to connect two data sources:
- Restaurant inspections from Alameda County
- Place data from Overture Maps Foundation
Two datasets with similar information but structured differently.
Modern Stack Tools
- DuckDB as the main engine
- Staging and querying large volumes of data
- Native support for geospatial formats
- Exceptional performance for exploratory analysis
H3 for spatial grouping
-- Group nearby places using H3 hexagons
SELECT h3_cell_to_lat_lng(h3_latlng_to_cell(lat, lng, 9)) as h3_9
FROM places
- Ollama for local embeddings
- ML framework running locally
- Generation of contextual embeddings
- No dependence on external APIs
Three Matching Approaches
- 1. Exact Name Matching
- Result: ~31% matches
- Limitations: Chain names, unit numbers in addresses
2. String Similarity (Jaro-Winkler)
-- Comparison combining name and address
WHERE jaro_winkler_similarity(name1, name2) > 0.8
AND levenshtein(address1, address2) < 5
- Result: ~68% matches
- Disadvantage: Complex SQL with many conditional rules
3. Embedding-Based Matching
# Generate contextual description
description = f"{name} at {address} in {city}"
embedding = ollama.embeddings(description)
- Result: ~71% matches
- Advantages: Simpler pipeline, greater flexibility
- Disadvantage: Longer processing time
Key Insights
- There is no silver bullet - each method has specific strengths
- Local tools are powerful - DuckDB + Ollama allow sophisticated analysis without cloud
- Embeddings are promising - especially for cases with complex context
- Conflation requires iteration - combining multiple techniques improves results
The Future of Data Integration
This work shows how modern tools democratize techniques that previously required complex infrastructure. The combination of:
- Analytical databases (DuckDB)
- Spatial indices (H3)
- Local ML (Ollama)
...opens new possibilities for intelligent data integration.
Original article: Conflating Overture Places Using DuckDB, Ollama, Embeddings, and More