Skip to content

How to import programmatically re-import changing datasets in production? #57

Open
@derhuerst

Description

@derhuerst

@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

why the import needs to be (more or less) atomic

From derhuerst/berlin-gtfs-rt-server#9 (comment):

An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

separate DBs

I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.

At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:

  • For each GTFS import, a new DB named gtfs_$unix_timestamp gets created, and data is imported into it.
  • After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
  • Before each import, all import DBs other than the latest two are deleted.
  • This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.

One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

separate schemas

Now that gtfs-via-postgres has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:

  • Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
  • We can get rid of the whole PgBouncer hassle (see above).
  • A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.

However, there are disadvantages:

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions