The First Step of Open GeoAgent - A Technical Dive into Finding Useful Geospatial Data Sources
abstract
Efficiently discovering and preparing optimal geospatial datasets from heterogeneous, large-scale open repositories like Google Earth Engine (GEE), Microsoft Planetary Computer, NASA Earth Data, and AWS Open Data is a critical bottleneck for advanced geospatial analysis and AI applications. Open GeoAgent’s initial phase tackles this by implementing a robust system for discovering and preparing relevant geospatial data sources.
This post details a two-stage architecture:
- A data processing pipeline converting STAC (SpatioTemporal Asset Catalog) metadata to GeoParquet, indexed by DuckDB for high-performance spatio-temporal querying.
- An intelligent retrieval layer using a multi-document Retrieval Augmented Generation (RAG) system built with LlamaIndex to identify task-specific collections and extract necessary parameters.
- scripts and geoparquet data link: https://huggingface.co/datasets/Yusin/GeoAgent-Geoparquet
Part 1: High-Performance Data Preparation with STAC, GeoParquet, and DuckDB
The foundation of our data discovery pipeline is the standardized STAC metadata. To optimize for analytical queries, STAC Items from target collections across major open data providers are transformed into GeoParquet files.
1. STAC to GeoParquet Transformation:
- Why GeoParquet? GeoParquet is an open, cloud-optimized, columnar format for geospatial vector data. Its columnar nature is key, allowing efficient storage and partial reads. This structure enables query engines like DuckDB to leverage predicate pushdown (filtering data at the source before reading it into memory) and columnar vectorization for significantly faster I/O and query processing. This is particularly advantageous compared to row-oriented formats or iterating through individual STAC items via an API for large-scale filtering.
- Organization: Typically, each set of GeoParquet files corresponds to STAC items from a specific data collection, maintaining a clear and organized data lake structure. The schema of the GeoParquet files is derived directly from the STAC Item structure, including common metadata fields, asset links, and the geometry.
2. DuckDB for Accelerated Indexing and Querying:
DuckDB, an in-memory OLAP (Online Analytical Processing) DBMS, is employed for its exceptional speed, ease of integration (especially its Python bindings and direct Parquet reading capabilities), and rich SQL dialect.
Key DuckDB features utilized:
-
Direct Parquet Querying: DuckDB can directly query one or more Parquet files, including those stored in cloud object storage.
Python
import duckdb conn = duckdb.connect() conn.sql("SELECT count(id) FROM 'path/to/your/collection_geoparquet/*.parquet';")
-
Spatial Extension: This extension is crucial for geospatial filtering. After installation (INSTALL spatial; LOAD spatial), powerful spatial SQL operations can be performed.
Python
# Example Python usage with DuckDB import duckdb # It's good practice to install and load extensions once per session if needed. # For persistent storage, these might be set in the database configuration. conn = duckdb.connect() conn.execute("INSTALL spatial;") conn.execute("LOAD spatial;") # Define WKT for an Area of Interest (AOI) and time range aoi_wkt = "POLYGON((-5.0 47.0, -5.0 48.0, -4.0 48.0, -4.0 47.0, -5.0 47.0))" # Example polygon start_time = "2023-03-01T00:00:00Z" end_time = "2023-08-31T23:59:59Z" query = f""" SELECT id, properties_datetime, assets_B04_href, properties_eo_cloud_cover FROM read_parquet('path_to_your_geoparquet_files/*.parquet', union_by_name=True) WHERE "sar:product_type" = 'GRD' -- Example for Sentinel-1 AND "sar:instrument_mode" = 'IW' -- Example for Sentinel-1 AND ST_Intersects(geometry, ST_GeomFromText('{aoi_wkt}')) AND properties_datetime >= '{start_time}' AND properties_datetime <= '{end_time}' AND properties_eo_cloud_cover < 20; -- Example cloud cover filter """ result = conn.execute(query).fetchdf() print(result.head())
-
CQL2 to SQL Translation: To maintain compatibility with existing STAC API workflows and allow users to leverage familiar query languages, the pygeofilter library, along with its library, along with its pygeofilter-duckdb backend, can parse CQL2-JSON filters and translate them into DuckDB SQL WHERE clauses. This allows users to define filters once and apply them to both STAC APIs and the GeoParquet/DuckDB backend.
Python
# Conceptual Python usage with pygeofilter from pygeofilter.parsers.cql2_json import parse as json_parse from pygeofilter.backends.duckdb import to_sql_where from pygeofilter.util import IdempotentDict cql2_filter = { "op": "and", "args": [ {"op": "between", "args": [{"property": "eo:cloud_cover"}, 0, 21]}, {"op": "between", "args": [{"property": "datetime"}, "2023-02-01T00:00:00Z", "2023-02-28T23:59:59Z"]}, {"op": "s_intersects", "args": [{"property": "geometry"}, {"type": "Polygon", "coordinates": [[[...]]]}]} ] } # field_mapping can be used if property names differ from GeoParquet column names sql_where_clause = to_sql_where(json_parse(cql2_filter), IdempotentDict()) # full_query = f"SELECT * FROM read_parquet('{geoparquet_path}') WHERE {sql_where_clause}"
This spatio-temporal pre-filtering significantly prunes the search space, making the subsequent semantic retrieval step more focused and computationally feasible.
Part 2: Intelligent Collection Selection via Multi-Document RAG with LlamaIndex
Even after spatio-temporal filtering, multiple data collections might meet the basic criteria. Selecting the most suitable collection for a nuanced task (e.g., “which LiDAR dataset is best for sub-meter vertical accuracy canopy height models in a temperate forest region?”) requires deeper semantic understanding of each collection’s specifications, processing levels, and suitability for specific applications. A Multi-Document Retrieval Augmented Generation (RAG) system, built using LlamaIndex, provides this capability.
1. Architecture Overview:
The system employs a two-tiered agent structure:
- Collection-Specific Agents (
FunctionAgent
): Each candidate geospatial data collection, represented by its detailed metadata, user guides, scientific papers describing it, or even API documentation, is managed by a dedicated LlamaIndexFunctionAgent
. - Top-Level Orchestrator Agent (
FunctionAgent
orReActAgent
): This agent receives the user’s primary query and intelligently routes sub-queries to the appropriate collection-specific agents.
2. Building Collection-Specific Agents:
The process, inspired by LlamaIndex’s multi-document agent patterns, involves creating specialized agents for each data collection:
-
Document Processing: For each collection, relevant documents (e.g., landing pages, technical specifications, usage tutorials) are loaded and parsed into nodes using
SentenceSplitter
. -
Index Creation:
-
VectorStoreIndex
: Built from these nodes using an embedding model (e.g.,OpenAIEmbedding(model="text-embedding-3-small")
). This index allows for semantic search within the collection’s documentation (e.g., “find information about radiometric correction”). -
SummaryIndex
: Optionally, aSummaryIndex
can be built to generate concise summaries of each document/collection. This is useful for providing quick overviews to the top-level agent or the user.
-
-
Query Engines and Tools:
- The indices are exposed as query engines (e.g.,
vector_index.as_query_engine()
,summary_index.as_query_engine()
). - These query engines are then wrapped into
QueryEngineTool
instances. Each tool is given a name and a description that outlines its specific capabilities (e.g., “Useful for answering specific factual questions about Collection X’s processing levels”).
- The indices are exposed as query engines (e.g.,
-
Agent Definition: A
FunctionAgent
(often powered by an LLM like OpenAI’sgpt-4o
) is instantiated for each collection. It’s equipped with the tools created above and a system prompt that directs it to use these tools to answer questions exclusively about its assigned data collection, avoiding reliance on prior knowledge. An asynchronous functionbuild_agent_per_doc
typically encapsulates this logic.
3. Top-Level Orchestrator Agent:
-
Tool Aggregation: The specialized collection agents are exposed as tools to the top-level agent. This is done by wrapping the
agent.run
(oragent.arun
for async) method of each collection agent into aFunctionTool
usingFunctionTool.from_defaults
. Thedescription
of eachFunctionTool
is critical; it’s often derived from a summary of the collection the agent represents, enabling the top-level agent to make informed decisions about which tool (i.e., which collection agent) to engage for a given query. -
Tool Retrieval and Reranking :
- An
ObjectIndex.from_objects(all_tools, index_cls=VectorStoreIndex)
is created to index all the collection-agent tools. - When the top-level agent receives a query, an initial set of relevant tools (collection agents) is retrieved using
obj_index.as_node_retriever(similarity_top_k=N)
. - This retrieved set can be further refined using a postprocessor like
CohereRerank(top_n=M, model="rerank-english-v3.0")
(or similar) to improve the precision of tool selection. - A
CustomObjectRetriever
(as demonstrated in LlamaIndex examples) can be implemented. This custom retriever can not only fetch the top N tools post-reranking but also dynamically inject a “comparison” or “query planning” sub-agent/tool. This sub-agent would take the original query and the selected tools as input, enabling it to explicitly compare information from multiple collections if the user’s query requires it (e.g., “Compare Sentinel-2 and Landsat 9 for vegetation monitoring”).
- An
-
Query Execution: The top-level agent (e.g., a
FunctionAgent
orReActAgent
with an LLM likegpt-4o
) then utilizes these selected and reranked tools (i.e., collection agents) to synthesize an answer to the user’s query. It effectively delegates sub-questions or information gathering tasks to the most relevant collection specialists.
4. Output:
The system aims to identify the most suitable collection(s) for the user’s task. Furthermore, it can be prompted to extract key parameters (e.g., specific band names, asset keys for direct access, relevant processing levels, or even Python code snippets for data loading) based on the context retrieved by the RAG pipeline.
Okay, I’ve revised the blog post to be more technical and included instructions for the flowchart.
The First Step of Open GeoAgent: A Technical Dive into Finding Useful Geospatial Data Sources
Efficiently discovering and preparing optimal geospatial datasets from heterogeneous, large-scale open repositories like Google Earth Engine (GEE), Microsoft Planetary Computer, NASA Earth Data, and AWS Open Data is a critical bottleneck for advanced geospatial analysis and AI applications. Open GeoAgent’s initial phase tackles this by implementing a robust system for discovering and preparing relevant geospatial data sources.
This post details a two-stage architecture:
- A data processing pipeline converting STAC (SpatioTemporal Asset Catalog) metadata to GeoParquet, indexed by DuckDB for high-performance spatio-temporal querying.
- An intelligent retrieval layer using a multi-document Retrieval Augmented Generation (RAG) system built with LlamaIndex to identify task-specific collections and extract necessary parameters.
Part 1: High-Performance Data Preparation with STAC, GeoParquet, and DuckDB
The foundation of our data discovery pipeline is the standardized STAC metadata. To optimize for analytical queries, STAC Items from target collections across major open data providers are transformed into GeoParquet files.
1. STAC to GeoParquet Transformation:
- Why GeoParquet? GeoParquet is an open, cloud-optimized, columnar format for geospatial vector data. Its columnar nature is key, allowing efficient storage and partial reads. This structure enables query engines like DuckDB to leverage predicate pushdown (filtering data at the source before reading it into memory) and columnar vectorization for significantly faster I/O and query processing. This is particularly advantageous compared to row-oriented formats or iterating through individual STAC items via an API for large-scale filtering.
- Organization: Typically, each set of GeoParquet files corresponds to STAC items from a specific data collection, maintaining a clear and organized data lake structure. The schema of the GeoParquet files is derived directly from the STAC Item structure, including common metadata fields, asset links, and the geometry.
2. DuckDB for Accelerated Indexing and Querying:
DuckDB, an in-memory OLAP (Online Analytical Processing) DBMS, is employed for its exceptional speed, ease of integration (especially its Python bindings and direct Parquet reading capabilities), and rich SQL dialect.
Key DuckDB features utilized:
-
Direct Parquet Querying: DuckDB can directly query one or more Parquet files, including those stored in cloud object storage.
Python
import duckdb conn = duckdb.connect() conn.sql("SELECT count(id) FROM 'path/to/your/collection_geoparquet/*.parquet';")
-
Spatial Extension: This extension is crucial for geospatial filtering. After installation (INSTALL spatial; LOAD spatial;), powerful spatial SQL operations can be performed.
Python
# Example Python usage with DuckDB import duckdb # It's good practice to install and load extensions once per session if needed. # For persistent storage, these might be set in the database configuration. conn = duckdb.connect() conn.execute("INSTALL spatial;") conn.execute("LOAD spatial;") # Define WKT for an Area of Interest (AOI) and time range aoi_wkt = "POLYGON((-5.0 47.0, -5.0 48.0, -4.0 48.0, -4.0 47.0, -5.0 47.0))" # Example polygon start_time = "2023-03-01T00:00:00Z" end_time = "2023-08-31T23:59:59Z" query = f""" SELECT id, properties_datetime, assets_B04_href, properties_eo_cloud_cover FROM read_parquet('path_to_your_geoparquet_files/*.parquet', union_by_name=True) WHERE "sar:product_type" = 'GRD' -- Example for Sentinel-1 AND "sar:instrument_mode" = 'IW' -- Example for Sentinel-1 AND ST_Intersects(geometry, ST_GeomFromText('{aoi_wkt}')) AND properties_datetime >= '{start_time}' AND properties_datetime <= '{end_time}' AND properties_eo_cloud_cover < 20; -- Example cloud cover filter """ result = conn.execute(query).fetchdf() print(result.head())
-
CQL2 to SQL Translation: To maintain compatibility with existing STAC API workflows and allow users to leverage familiar query languages, the
pygeofilter
library, along with its
pygeofilter-duckdb
backend, can parse CQL2-JSON filters and translate them into DuckDB SQL
WHERE
clauses. This allows users to define filters once and apply them to both STAC APIs and the GeoParquet/DuckDB backend.
Python
# Conceptual Python usage with pygeofilter from pygeofilter.parsers.cql2_json import parse as json_parse from pygeofilter.backends.duckdb import to_sql_where from pygeofilter.util import IdempotentDict cql2_filter = { "op": "and", "args": [ {"op": "between", "args": [{"property": "eo:cloud_cover"}, 0, 21]}, {"op": "between", "args": [{"property": "datetime"}, "2023-02-01T00:00:00Z", "2023-02-28T23:59:59Z"]}, {"op": "s_intersects", "args": [{"property": "geometry"}, {"type": "Polygon", "coordinates": [[[...]]]}]} ] } # field_mapping can be used if property names differ from GeoParquet column names sql_where_clause = to_sql_where(json_parse(cql2_filter), IdempotentDict()) # full_query = f"SELECT * FROM read_parquet('{geoparquet_path}') WHERE {sql_where_clause}"
-
- STAC Item Reconstruction
-
Query results from DuckDB (often retrieved as Arrow tables or Pandas DataFrames) can be efficiently converted back to
pystac.Item
objects using the
stac_geoparquet.arrow._api.stac_table_to_items
function. This ensures seamless integration with downstream STAC-aware tools and libraries.
Python
# Conceptual Python snippet (based on your provided material) # import json # import pystac # from stac_geoparquet.arrow._api import stac_table_to_items # db_result_arrow_table = conn.execute(full_query).fetch_arrow_table() # stac_items = [] # for item_dict in stac_table_to_items(db_result_arrow_table): # item_dict['assets'] = json.loads(item_dict['assets']) # Assets might be stored as JSON strings # stac_items.append(pystac.Item.from_dict(item_dict))
This spatio-temporal pre-filtering significantly prunes the search space, making the subsequent semantic retrieval step more focused and computationally feasible.
Part 2: Intelligent Collection Selection via Multi-Document RAG with LlamaIndex
Even after spatio-temporal filtering, multiple data collections might meet the basic criteria. Selecting the most suitable collection for a nuanced task (e.g., “which LiDAR dataset is best for sub-meter vertical accuracy canopy height models in a temperate forest region?”) requires deeper semantic understanding of each collection’s specifications, processing levels, and suitability for specific applications. A Multi-Document Retrieval Augmented Generation (RAG) system, built using LlamaIndex, provides this capability.
1. Architecture Overview:
The system employs a two-tiered agent structure:
- Collection-Specific Agents (
FunctionAgent
): Each candidate geospatial data collection, represented by its detailed metadata, user guides, scientific papers describing it, or even API documentation, is managed by a dedicated LlamaIndexFunctionAgent
. - Top-Level Orchestrator Agent (
FunctionAgent
orReActAgent
): This agent receives the user’s primary query and intelligently routes sub-queries to the appropriate collection-specific agents.
2. Building Collection-Specific Agents:
The process, inspired by LlamaIndex’s multi-document agent patterns, involves creating specialized agents for each data collection:
-
Document Processing: For each collection, relevant documents (e.g., landing pages, technical specifications, usage tutorials) are loaded and parsed into nodes using
SentenceSplitter
. -
Index Creation
:
-
VectorStoreIndex
: Built from these nodes using an embedding model (e.g.,OpenAIEmbedding(model="text-embedding-3-small")
). This index allows for semantic search within the collection’s documentation (e.g., “find information about radiometric correction”). -
SummaryIndex
: Optionally, aSummaryIndex
can be built to generate concise summaries of each document/collection. This is useful for providing quick overviews to the top-level agent or the user.
-
-
Query Engines and Tools
:
- The indices are exposed as query engines (e.g.,
vector_index.as_query_engine()
,summary_index.as_query_engine()
). - These query engines are then wrapped into
QueryEngineTool
instances. Each tool is given a name and a description that outlines its specific capabilities (e.g., “Useful for answering specific factual questions about Collection X’s processing levels”).
- The indices are exposed as query engines (e.g.,
-
Agent Definition: A
FunctionAgent
(often powered by an LLM like OpenAI’sgpt-4o
) is instantiated for each collection. It’s equipped with the tools created above and a system prompt that directs it to use these tools to answer questions exclusively about its assigned data collection, avoiding reliance on prior knowledge. An asynchronous functionbuild_agent_per_doc
typically encapsulates this logic.
3. Top-Level Orchestrator Agent:
-
Tool Aggregation: The specialized collection agents are exposed as tools to the top-level agent. This is done by wrapping the
agent.run
(oragent.arun
for async) method of each collection agent into aFunctionTool
usingFunctionTool.from_defaults
. Thedescription
of eachFunctionTool
is critical; it’s often derived from a summary of the collection the agent represents, enabling the top-level agent to make informed decisions about which tool (i.e., which collection agent) to engage for a given query. -
Tool Retrieval and Reranking :
- An
ObjectIndex.from_objects(all_tools, index_cls=VectorStoreIndex)
is created to index all the collection-agent tools. - When the top-level agent receives a query, an initial set of relevant tools (collection agents) is retrieved using
obj_index.as_node_retriever(similarity_top_k=N)
. - This retrieved set can be further refined using a postprocessor like
CohereRerank(top_n=M, model="rerank-english-v3.0")
(or similar) to improve the precision of tool selection. - A
CustomObjectRetriever
(as demonstrated in LlamaIndex examples) can be implemented. This custom retriever can not only fetch the top N tools post-reranking but also dynamically inject a “comparison” or “query planning” sub-agent/tool. This sub-agent would take the original query and the selected tools as input, enabling it to explicitly compare information from multiple collections if the user’s query requires it (e.g., “Compare Sentinel-2 and Landsat 9 for vegetation monitoring”).
- An
-
Query Execution: The top-level agent (e.g., a
FunctionAgent
orReActAgent
with an LLM likegpt-4o
) then utilizes these selected and reranked tools (i.e., collection agents) to synthesize an answer to the user’s query. It effectively delegates sub-questions or information gathering tasks to the most relevant collection specialists.
4. Output:
The system aims to identify the most suitable collection(s) for the user’s task. Furthermore, it can be prompted to extract key parameters (e.g., specific band names, asset keys for direct access, relevant processing levels, or even Python code snippets for data loading) based on the context retrieved by the RAG pipeline.
Conclusion and Future Work
This two-stage architecture – combining high-performance pre-filtering with DuckDB/GeoParquet and sophisticated semantic retrieval with LlamaIndex RAG – forms a robust and technically sound first step for Open GeoAgent. It addresses the dual challenges of massive data volume and the need for nuanced semantic relevance in data discovery, paving the way for more automated, intelligent, and insightful geospatial data utilization. Future enhancements will focus on loading data from queried data collection to local data analysis.
Enjoy Reading This Article?
Here are some more articles you might like to read next: