Skip to content

Further AIP and SIP generation speed improvements #25

Open
@nutjob4life

Description

@nutjob4life

The Issue

Issue #13 demonstrated how certain data (like the 1.3TiB insight_cameras) basically caused sipgen to not terminate. We've addressed that by using better algorithms and adding caching, but we can go steps further.

For example, sipgen still does some redundant XML parsing and aipgen does some single-threaded hash generation that hits the Python GIL. In issue #13 we architected things to include a temporary sqlite3 database that could be shared by numerous processing (using the multiprocessing module, for example) that's ripe for further optimizations.

Some Ideas

  • Additional use of sqlite3 in sipgen: process XML files just once and store the useful information in multiple tables
  • Multiprocessing: in sipgen use parallel processes and the sqlite3 database to accelerate
  • Producer-consumer: make multiprocessing workers consume XML and hash computations as they are done; in aipgen, for example, make one worker walk the directory tree for files to pass into a queue while multiple other workers snag files for MD5 digests.
  • Streaming: provide information as it becomes available so users have feedback that things are getting done instead of wondering what if things are just hanging

Context

See issue #13 and the commits made against it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ToDo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions