How to import programmatically re-import changing datasets in production?

@dancesWithCycles asked in the [`berlin-gtfs-rt-server` project](https://github.com/derhuerst/berlin-gtfs-rt-server) (which transitively depends on `gtfs-via-postgres`) **how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data**.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

## why the import needs to be (more or less) atomic

From https://github.com/derhuerst/berlin-gtfs-rt-server/issues/9#issuecomment-1942333891:

> An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

## separate DBs

> I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.
> 
> At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in [`postgis-gtfs-importer`](https://github.com/mobidata-bw/postgis-gtfs-importer), I tackled the problem differently by using >1 DBs:
- For each GTFS import, a new DB named `gtfs_$unix_timestamp` gets created, and data is imported into it.
- After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
- Before each import, all import DBs other than the latest two are deleted.
- This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.

One problem remains: The consuming program then needs to connect to a DB with a *dynamic name*. Because at [*MobiData BW* IPL](https://mobidata-bw.de/blog//neue-mobidata-bw-integrationsplattform-ist-online), we have [PgBouncer](https://www.pgbouncer.org) in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. `gtfs`). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

## separate schemas

Now that [`gtfs-via-postgres` has gained the ability to import >1 GTFS datasets into 1 DB](https://github.com/public-transport/gtfs-via-postgres/issues/51) with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:
- Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
- We can get rid of the whole PgBouncer hassle (see above).
- A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.

However, there are disadvantages:
- As outlined in https://github.com/public-transport/gtfs-via-postgres/issues/51#issuecomment-1751700337, importing >1 datasets with different versions of `gtfs-via-postgres` won't be possible.
- If there was a bug in the logic cleaning up old imports, they would break the running service, e.g. if they delete parts of the *live*/latest schema.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to import programmatically re-import changing datasets in production? #57

why the import needs to be (more or less) atomic

separate DBs

separate schemas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How to import programmatically re-import changing datasets in production? #57

Description

why the import needs to be (more or less) atomic

separate DBs

separate schemas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions