ENH, CI: `e3sm_io_heatmap_and_dxt.darshan` memory usage causes CI error

### Background

As discussed in gh-691, `e3sm_io_heatmap_and_dxt.darshan` is one of the larger logs in the `darshan-logs` repo, and when generating a summary report it yields an error in the CI:

```
/home/runner/work/_temp/67ef905d-ab26-4af3-a5d0-5d8bf0d04f00.sh: line 7:  6525 Killed                  pytest --pyargs darshan --cov-report xml --cov=$site_packages/darshan
tests/test_summary.py ...................
Error: Process completed with exit code 137.
```

This is due to the peak memory usage exceeding the Linux runner cap of 7 GB, which Tyler highlighted pretty clearly [here](https://github.com/darshan-hpc/darshan/pull/691#issuecomment-1066184467). This (unsurprisingly) occurs during the `get_heatmap_df()` call when generating the darshan summary report, specifically for the `DXT_POSIX` module, which has over 600,000 DXT segments. 

### Solutions

The current solution is to leave the `e3sm_io_heatmap_and_dxt.darshan` xfailed in the CI, but other long-term solutions were proposed that should allow us to remove the `xfail` status for this log:

- Tyler suggested the following solution [here](https://github.com/darshan-hpc/darshan/pull/691#issuecomment-1068406464):

  >There's probably a way to use a smaller memory buffer to accumulate/bin/combine/process DXT records by rank as they get read in, as long as the time bins/bounds are established early on. pd.get_dummies also supports sparse data, though that may not be the right choice b/c it limits the valid ops that can be used on the data structures.
  >
  >Avoiding the generation of intermediate data structures with > 600,000 rows x 200 columns for an input with only 512 ranks seems desirable, at least at some point. Switching from single to multiple reduction operations (summing/combining a few DXT segments instead of all of them at once) is likely slower in Python, so a low-level + concurrent solution may be warranted.

- Phil proposed an alternate solution [here](https://github.com/darshan-hpc/darshan/pull/691#issuecomment-1069189012):

  >In the long run, for logs with DXT data, we might have to decide if the DXT data should be used automatically (because of it's higher fidelity) or if it should be an optional argument, at least when the module data is above some threshold. We could put a message in the warning box along the lines of "DXT data is present in the log but was not used to generate this summary. Rerun the report with --dxt (this may be more memory and compute intensive) to ..."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH, CI: `e3sm_io_heatmap_and_dxt.darshan` memory usage causes CI error #692

Background

Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH, CI: e3sm_io_heatmap_and_dxt.darshan memory usage causes CI error #692

Description

Background

Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH, CI: `e3sm_io_heatmap_and_dxt.darshan` memory usage causes CI error #692