This demo shows how to use LLMs to improve discoverability of Earth observation datasets.
- Enrich STAC collection descriptions with application-focused context (e.g., agriculture monitoring, disaster response).
- Embed the enriched descriptions and topics as vectors, stored locally in Parquet and queried with DuckDB.
- Retrieve & Refine results for user queries using vector similarity search, with an LLM agent acting as a semantic judge.
The result is a lightweight prototype of a RAG-style system for semantic dataset discovery.
Disclaimer: this is the product of some weekend hacking to upskill with these technologies. I'm not claiming this is the best or even the right way to do this.
This project uses uv
(see installation instructions).
After cloning the repo and installing uv
, run uv sync
to create a virtual environment and install dependencies.
Note: I developed the Jupyter Notebook in VSCode so jupyterlab
is not included in the project requirements. To run the notebook within a Jupyter server, run uv sync --extra jupyterlab
followed by jupyter lab
to start the server.
This demo uses OpenAI's API so an API key is required. The notebook is also instrumented with logfire for observability. This is totally optional, but useful to understand what is happening when agents are running.
Run cp .env.example .env
and add your OpenAI API key. Optionally, set the Logfire API key.