Open
Description
The Issue
Issue #13 demonstrated how certain data (like the 1.3TiB insight_cameras
) basically caused sipgen
to not terminate. We've addressed that by using better algorithms and adding caching, but we can go steps further.
For example, sipgen
still does some redundant XML parsing and aipgen
does some single-threaded hash generation that hits the Python GIL. In issue #13 we architected things to include a temporary sqlite3
database that could be shared by numerous processing (using the multiprocessing
module, for example) that's ripe for further optimizations.
Some Ideas
- Additional use of
sqlite3
insipgen
: process XML files just once and store the useful information in multiple tables - Multiprocessing: in
sipgen
use parallel processes and thesqlite3
database to accelerate - Producer-consumer: make multiprocessing workers consume XML and hash computations as they are done; in
aipgen
, for example, make one worker walk the directory tree for files to pass into a queue while multiple other workers snag files for MD5 digests. - Streaming: provide information as it becomes available so users have feedback that things are getting done instead of wondering what if things are just hanging
Context
See issue #13 and the commits made against it.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
ToDo