Upgrading PUDL for Scalable, Accessible, and ML-Enabled Data Processing #35

katie-lamb · 2025-05-30T19:16:04Z

Project

PUDL

Summary

Some of the most impactful energy datasets on our users’ wish list are trapped in hard-to-access formats like PDFs and require machine learning to extract. We’ve begun building infrastructure to process this data, and now need to scale this work by providing access to archived raw data and defining clear handoff points into the PUDL pipeline, making the process more reproducible, contributor-friendly, and capable of delivering regularly updated, well-documented datasets.

submitter

Katie Lamb

project lead

@katie-lamb

Community benefit

The Public Utility Data Liberation project (PUDL) is an open-source data initiative addressing information asymmetries that hinder rapid and equitable U.S. decarbonization. Operated by Catalyst Cooperative, PUDL exists to make public utility data truly public—freely and openly available, standardized, and accessible to all.

We are seeking support to upgrade PUDL’s infrastructure to handle larger, more complex modeling problems and datasets. Many of our users request high-impact datasets that are currently beyond PUDL’s capacity due to their size, complexity, and reliance on machine learning for extraction. The technical capacity required to process these datasets reproducibly is out of reach for many of the small organizations we work with – making it all the more important that Catalyst’s engineers build an open framework that others can contribute to and benefit from.

Regular, reliable updates to existing large datasets: To date, we have developed ad hoc processes for handling datasets like these, but this limits our ability to provide regular, reliable updates. For example, with support from the Mozilla Foundation we built a model to extract subsidiary company ownership information from PDF attachments to SEC 10-K filings. This ownership data is crucial for understanding the political economies of U.S. corporations, and when combined with PUDL’s existing electricity system data it provides insight into the behavior of US utilities. This data is beginning to see use, but it's difficult to update and maintain because processing is split between an upstream modeling repository and the PUDL pipeline. This fragmentation makes it hard for users and contributors to understand the modeling assumptions and processing steps, reducing trust and increasing maintenance overhead. Establishing clear standards for the state of data at the handoff point between upstream models and the PUDL pipeline would enable more shared processing to occur within the PUDL pipeline, streamlining updates and ensuring that key datasets are maintained, well-documented, and more broadly useful to the energy and climate research community. These aspects are essential to organizations that we’ve heard from who use SEC 10-K data, like Global Energy Monitor (GEM), who publish information on corporate ownership in the energy sector, and researchers at institutions like RMI.

A common infrastructural foundation for high-impact dataset integration: More broadly, the highest-priority and highest-impact datasets PUDL’s users have requested are predominantly large and complex. Future datasets, such as FERC’s Electric Quarterly Reports and information extracted from public utility commission dockets, will also depend on this upgraded infrastructure. Shared tooling will increase the pace of new data integration, make it more cost-effective for funders to support the integration of new data, and decrease the future maintenance burden. Additionally, we will create a versioned archive system for large datasets hosted in Google Cloud Storage, providing access to raw data where bulk access is currently limited or unavailable.

Amount requested

10000

Execution plan

Funds from the grant will be used to pay for the time of two developers at Catalyst Cooperative - Katie Lamb (@katie-lamb) and Zach Schira (@zschira). At Catalyst, Katie has spearheaded the development of a document analysis pipeline and unsupervised machine learning model to extract structured data from PDFs as well as built PUDL’s entity matching framework. Zach has led design on automation tooling and cloud infrastructure in order to make PUDL’s processing workflows and data archiving scalable and reproducible. During this project, their time will involve the following:

Designing, integrating, and testing new infrastructure in PUDL; coding in Python
Writing documentation, blog posts, and newsletters to communicate the changes to external contributors and downstream users of PUDL data
Design discussions and project management with the larger Catalyst team

Deliverables and Timeline

Month 1: Enable Versioned Archives of Raw Data

Deliverable: A versioned archive system for large raw datasets hosted in Google Cloud Storage, including a publicly accessible, requester-pays bucket and accompanying documentation for access and use.

For smaller datasets, we currently use Zenodo to host raw data, but the size of future high-priority datasets exceeds its storage limits (50GB / 100 files per record). To handle this we will create versioned archives of raw data in Google Cloud Storage (GCS). As a test case, we will archive a new year of SEC 10-K filings, linking each version number to a specific set of raw documents in GCS. Access will be provided via a requester-pays model, with clear documentation for users on how to retrieve the data.

Month 2: Create System for Handing Off Modeled Datasets to Standard PUDL Processing Module

Deliverable: A documented handoff system for modeled datasets entering the PUDL pipeline, including updated data ingestion standards and modules and an upgraded entity matching workflow using Splink for large datasets.

Currently, it's unclear which modeling and data processing steps should occur before data is ingested by PUDL versus what should happen within the PUDL repository. We will define clear standards for the format and readiness of extracted data prior to ingestion, based on factors like computational intensity and update frequency, and create modules to ingest data extracted by upstream models into the PUDL pipeline. Most of these larger, complex datasets require entity matching (record linkage) to connect with other datasets that refer to the same entities without a shared join key. PUDL uses the Splink library for entity matching, so we will upgrade our implementation to better handle large datasets—either by switching to a Spark backend or moving compute-intensive steps to cloud resources.

Month 3: Test the System and Conduct Outreach

Deliverable: To demonstrate completion of the first two deliverables, we will ensure open access to SEC 10-K raw filings and processed outputs, updated with a new year of data.

We will test the newly built infrastructure by running a data update on the newly released year of SEC 10-K data and ensuring that it can be integrated into PUDL’s normal update and release process. We will solicit feedback from organizations already using the SEC 10-K data about the updated documentation and raw data access method and write a blog post and newsletter about the systems built throughout the project.

katie-lamb added the Awaiting approval label May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Upgrading PUDL for Scalable, Accessible, and ML-Enabled Data Processing #35

Upgrading PUDL for Scalable, Accessible, and ML-Enabled Data Processing #35

katie-lamb commented May 30, 2025

Uh oh!

Upgrading PUDL for Scalable, Accessible, and ML-Enabled Data Processing #35

Upgrading PUDL for Scalable, Accessible, and ML-Enabled Data Processing #35

Comments

katie-lamb commented May 30, 2025

Project

Summary

submitter

project lead

Community benefit

Amount requested

Execution plan

Deliverables and Timeline

Month 1: Enable Versioned Archives of Raw Data

Month 2: Create System for Handing Off Modeled Datasets to Standard PUDL Processing Module

Month 3: Test the System and Conduct Outreach