Description
@dancesWithCycles asked in the berlin-gtfs-rt-server
project (which transitively depends on gtfs-via-postgres
) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.
I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!
why the import needs to be (more or less) atomic
From derhuerst/berlin-gtfs-rt-server#9 (comment):
An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.
With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.
separate DBs
I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.
At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?
Recently, in postgis-gtfs-importer
, I tackled the problem differently by using >1 DBs:
- For each GTFS import, a new DB named
gtfs_$unix_timestamp
gets created, and data is imported into it. - After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
- Before each import, all import DBs other than the latest two are deleted.
- This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.
One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs
). There are a lot of gotchas involved here though.
TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.
separate schemas
Now that gtfs-via-postgres
has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.
I see the following advantages:
- Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
- We can get rid of the whole PgBouncer hassle (see above).
- A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.
However, there are disadvantages:
- As outlined in support importing >1 GTFS datasets into one Postgres DB #51 (comment), importing >1 datasets with different versions of
gtfs-via-postgres
won't be possible. - If there was a bug in the logic cleaning up old imports, they would break the running service, e.g. if they delete parts of the live/latest schema.