This post walks you through the process of creating an API with Overture Map Data using DuckDB in FastAPI framework.

What is Overture Maps Foundation

Overture Maps foundation : Joint Development Foundation, Overture is dedicated to the development of reliable, easy-to-use, and interoperable open map data that will power current and next-generation map products.

Dataset is organized by theme and size.

- addresses
- base: water, land, land use, infrastructure, land cover
- buildings
- divisions
- places
- transportation

Here are link to read more about Schema, Data, Docs , Examples, Data Guides

The latest release of Overture Maps Foundation is now available in GeoParquet format, providing a more efficient way to perform spatial operations when working with subsets of the dataset.

Overture data set adopts Global Entity Reference System (GERS). GERS is a framework for structuring, encoding, and matching map data to a shared universal reference. All features in Overture have a unique ID (kind of UID), called Overture IDs or GERS IDs, are registered to GERS and they are stable. GERS provides a mechanism to match features across datasets, track data stability, and detect errors in the data.

What is DuckDB?

DuckDB is an analytics tool that allows you to query remote files and download only the data you want

DuckDB is a powerful analytical database designed for efficient handling of large datasets. It is particularly notable for its support of geospatial data through its spatial extension and its seamless integration with Parquet, a popular columnar storage format.

DuckDB is often called “SQLite for analytics.” It operates as an in-process database, making it simple to embed within larger applications, such as directly querying pandas dataframe objects without the need for network or inter-process communication. Additionally, DuckDB offers a command-line interface (CLI) for direct querying from the terminal.

DuckDB is not intended for use in applications that require concurrent write access or strict transactional integrity, as it is tailored for analytical workloads. Although it can handle data spilling to disk when memory is exceeded , it performs best with datasets that mostly fit in RAM (trick SET preserve_insertion_order to false for better memory handling), limiting its effectiveness for processing data at the terabyte scale. Moreover, DuckDB is not designed for streaming purposes or real-time analytics on constantly updating streams.

Project Goal

The aim of this project is to build an API using latest data (buildings, POI, address, infrastructure) from Overture Maps foundation using DuckDB for analytical purpose.

Data Prep

  1. Install DuckDB locally

  2. I used python commmand-line tool to download the overture maps data.

    pip install overturemaps

  3. Download the data as parquet format.

    overturemaps download --bbox=-95.74443405978232,29.594188821735415,-95.68407326319057,29.661140215266016 -f geoparquet--type=building-o buildings.parquet

    cli

  4. I chose the study/research area as below. Study area falls in Fortbend County, TX

    study-area

  5. Envelope of the study area is below study-area-enevlope

  6. On analyzing overture maps data for building theme, the attributes data is limited. So, I want to enrich this data with local county parcel data from state goverment. On merging, county data with building dataset provided rich attributes for my analysis.

Our approach focuses on joining two geospatial polygons, we simply need to determine whether a building is contained within a parcel shape. However, there is a small catch to consider: buildings located on the boundaries of parcel should not be assigned to multiple parcels. To handle this, I used a straightforward method: finding the building's center point with the ST_Centroid function and then checking if this point falls within a parcel shape using the ST_Contains function.

``` shell SQL
    
        COPY(
        SELECT 
            buildings.id as gers, 
            parcel.ownername AS owner,
            parcel.quickrefid AS properyid,
            parcel.oaddr1 AS address,
            parcel.totalvalue AS totalvalue,
            parcel.landvalue AS landvalue, 
            parcel.yearbuilt AS yearbuilt,
            parcel.landsizeac AS landsizeinacre,
            parcel.totsqftlvg AS totalsqftliving,
            parcel.ownerstate AS ownerstate,
            parcel.impvalue AS improventvalue,
            parcel.legal AS legaldescription,
            parcel.taxunits AS tax,
            parcel.nbhdcode AS neighbrhoodcode,
            ST_GeomFromWKB(buildings.geometry) as building_geom 
        FROM '/path/to/buildings.parquet' as buildings 
        JOIN ST_Read('/path/to/PARCELS.shp') as parcel 
            ON ST_Within(ST_Centroid(ST_GeomFromWKB(buildings.geometry)), parcel.geom)
        ) TO '/path/to/output.csv';

```

What am I analyzing?

For the selected study area - I set an goal to find an answers to following queries.

  1. Determine the number of residential houses and classify the data based on land value, total value, year built, total living square footage, and land size of the houses. This provides a good demographic overview of the locality.

  2. Identify the number of commercial establishments.

  3. Identify property owners who are in-state and out-of-state.

  4. Identify residential houses within a specified area.

  5. Generate a list of residential houses with swimming pools, useful for pool service or grill companies.

  6. Assess the impact on nearby houses and buildings in case of man-made lakes or streams overflowing during flood season; relevant for home insurance companies.

  7. Identify establishments within a given point and buffer radius.

  8. Find the shortest path to a specified destination.

  9. Identify parking spaces and perform other spatial queries.

Development Environment Setup

Here are the steps to set up your environment:

  1. Here is my another article on how to set-up pixi

  2. Folder structure for this project

    project-structure

    • init.py is main entry point
    • routers folder has router for places, building, infrastructure, statistics, impact-analysis
    • studyarea folder has all data required for loading to DuckDB
  3. Here is list of dependencies installed

    LibraryVersion
    python3.11
    uvicorn>=0.30.6,<0.31
    debugpy>=1.8.5,<2
    duckdb>=1.0.0,<2
    fastapi>=0.110.0,<0.111
    pydantic-settings>=2.4.0,<3
    spyarrow>=17.0.0
    pyproj>=3.6.1
  4. To run or debug a simple app in VS Code, click “Run and Debug” or press F5, and VS Code will run the active file. For more detailed debugging, create a launch configuration file to save your settings. This information is stored in a launch.json file in the .vscode folder of your workspace or in your user settings: I named API as “Peccan Grove API”

    launch-confing

  5. Run pixi shell to activate the project in the terminal. You can view all environment variables using pixi shell-hook.

  6. If all the above steps worked, you can run and debug the code using F5. You should see the API like this in the browser:

    API

  7. The GET endpoint find-placesshould return list of places (251 unique categories available for search) matching:

    Find-places

  8. The GET end point /building/extent-search to find list of buildings for an given extent

    Find-Buildings

  9. Buildings Search and other classification results

    Buildings-Result

    Buildings with Type and class search

    Infrastructure

    Land and Total Values

    Land and Total Values Result

    Sq Ft Living Ranges

    Land Size in Acres Ranges

  10. Database and spatial Queries

    Database

    Buildings Table

  11. Data loading: Since Fort Bend County uses EPSG 2278, I projected the data to the same coordinate system for easier comparison.

       
                    CREATE TABLE {table_name} AS
                    SELECT 
                        {column_selection},
                        ST_Transform(ST_FlipCoordinates(ST_GeomFromWKB(geometry)), '{crs_info}', 'EPSG:2278') AS geom,
                        ST_GeomFromWKB(geometry) AS geomwgs84                  
                    FROM read_parquet('{file_path}');
               
  1. Some Spatial Queries

Find an park from given point within 1 mile radius


     SELECT
            *
        FROM
            place
        WHERE
            ST_WITHIN(
                geom,
                ST_Buffer(
                    ST_TRANSFORM(ST_POINT(29.6251357968, -95.7340760595), 'EPSG:4326', 'EPSG:2278'),
                   5280.0
                )
            ) AND categories_primary = 'park';

Classify the land values based on ranges


    SELECT
        CASE 
            WHEN landvalue BETWEEN 0 AND 100000 THEN '0-100,000'
            WHEN landvalue BETWEEN 100001 AND 200000 THEN '100,001-200,000'
            WHEN landvalue BETWEEN 200001 AND 300000 THEN '200,001-300,000'
            -- Add more ranges as needed
            ELSE 'Above 300,000' -- Any other range above the highest defined range
        END AS landvalue_range,
        COUNT(*) AS count
    FROM
        building_copy  
    GROUP BY
        landvalue_range
    ORDER BY
        MIN(landvalue);

Swimming Pools with-in buffer distance of 20 ft


    SELECT  
        sp.*,
        b.*
    FROM
        water sp
    JOIN
        building_copy b
    ON
        ST_Intersects(
            ST_Buffer(sp.geom,20 ), 
            b.geom                  
        )
    WHERE
        sp.subtype = 'human_made'
        AND sp.class = 'swimming_pool'
        AND b.subtype = 'residential'
        AND b.class = 'house';

Learnings

  1. I automated the data loading from a Parquet file into DuckDB using the local coordinate system, EPSG 2278.

  2. The columns were extracted and flattened, and a schema structure was provided for each theme.

  3. During data validation, I noticed several inaccuracies—particularly in the Overture building dataset, where many attributes were missing, especially for man-made structures like swimming pools in our neighborhood.

  4. Spatial queries in DuckDB differed somewhat from other databases

Conclusion

I am developing a front-end application using React to display data and provide additional details. This project offers insights into Overture Maps data, its schema, and data quality. DuckDB shows great potential for data analysis and enables the creation of simple dashboards with ease. Github Repo

Useful References

  1. DuckDB spatial Function reference
  2. Geospatial Analysis using DuckDb
  3. DuckDB configuration
  4. DuckDB Performance Tuning
  5. DuckDB Persitent Database
  6. Data conversion tool
  7. Kepler GL for Data visualization
  8. DBeaver working spatial data
  9. ST_Transform with Flip coordinates
  10. Boundinb Box
  11. Simon Blog
  12. A step by step guide on how to access Overture data in ArcGIS Pro
  13. Reading and Writing Parquet in DuckDB
  14. Geojson Viewer
  15. How to view meta-data in DuckDB thread
  16. Simple data analysis (SDA) - easy-to-use and high-performance JavaScript library for data analysis
  17. Overture Maps Community Blog posts
  18. Overture Maps Building Dataset
  19. Featured Repos
  20. QGIS DuckDB Plugin