Skip to content

ENH, CI: e3sm_io_heatmap_and_dxt.darshan memory usage causes CI error #692

Open
@nawtrey

Description

@nawtrey

Background

As discussed in gh-691, e3sm_io_heatmap_and_dxt.darshan is one of the larger logs in the darshan-logs repo, and when generating a summary report it yields an error in the CI:

/home/runner/work/_temp/67ef905d-ab26-4af3-a5d0-5d8bf0d04f00.sh: line 7:  6525 Killed                  pytest --pyargs darshan --cov-report xml --cov=$site_packages/darshan
tests/test_summary.py ...................
Error: Process completed with exit code 137.

This is due to the peak memory usage exceeding the Linux runner cap of 7 GB, which Tyler highlighted pretty clearly here. This (unsurprisingly) occurs during the get_heatmap_df() call when generating the darshan summary report, specifically for the DXT_POSIX module, which has over 600,000 DXT segments.

Solutions

The current solution is to leave the e3sm_io_heatmap_and_dxt.darshan xfailed in the CI, but other long-term solutions were proposed that should allow us to remove the xfail status for this log:

  • Tyler suggested the following solution here:

    There's probably a way to use a smaller memory buffer to accumulate/bin/combine/process DXT records by rank as they get read in, as long as the time bins/bounds are established early on. pd.get_dummies also supports sparse data, though that may not be the right choice b/c it limits the valid ops that can be used on the data structures.

    Avoiding the generation of intermediate data structures with > 600,000 rows x 200 columns for an input with only 512 ranks seems desirable, at least at some point. Switching from single to multiple reduction operations (summing/combining a few DXT segments instead of all of them at once) is likely slower in Python, so a low-level + concurrent solution may be warranted.

  • Phil proposed an alternate solution here:

    In the long run, for logs with DXT data, we might have to decide if the DXT data should be used automatically (because of it's higher fidelity) or if it should be an optional argument, at least when the module data is above some threshold. We could put a message in the warning box along the lines of "DXT data is present in the log but was not used to generate this summary. Rerun the report with --dxt (this may be more memory and compute intensive) to ..."

Metadata

Metadata

Assignees

No one assigned

    Labels

    CIcontinuous integrationenhancementNew feature or requestpydarshan

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions