Enhancing Taskflow Scheduling Performance with RL-based Task Graph Optimization #24

tsung-wei-huang · 2025-05-30T03:22:16Z

Project

Taskflow: A General-purpose Task-parallel Programming System

Summary

This project aims to enhance Taskflow's scheduling performance by integrating a Reinforcement Learning (RL)-based optimization framework to automatically tune and improve task graph execution. By learning optimal task scheduling and transformations, this approach will enable Taskflow users to achieve significant and adaptive performance gains for their complex parallel applications.

Submitter

Tsung-Wei Huang ([email protected])

Project lead

@tsung-wei-huang

Community benefit

With Taskflow currently reaching 4-7K unique weekly clones, the integration of an automatic task graph optimization module will significantly benefit its large and growing user community. This enhancement will empower users to achieve substantial performance improvements in their Taskflow applications without requiring manual, domain-specific tuning.

Amount requested

$10000

Execution plan

Rationale
Technical Tasks (6 Months)
Expected Impacts

Rationale

To fully harness the potential of Taskflow, optimizing task-parallel programs is crucial for efficient execution, as application-given task graphs are typically unaware of the underlying hardware availability and scheduling constraints, often leading to suboptimal performance. For instance, our research showed that, by optimizing the structure of a task-parallel timing analysis workload, the compiler can produce a result with 43% performance improvement over the original graph. Traditional optimization methods often rely on hand-tuned heuristics or require extensive manual effort to adapt to diverse hardware and dynamic workloads. Reinforcement Learning (RL) presents a compelling alternative, enabling the system to learn optimal task graph transformations and scheduling decisions in a black-box fashion by directly interacting with the execution environment. This adaptive learning capability allows the RL-based optimizer to discover non-intuitive optimization strategies that can significantly enhance Taskflow application performance across varied computing environments.

Technical Tasks (6 Months)

We expect to complete the project in 6 months. All results and code will be directly integrated into the Taskflow repository.

Months 1-2: Designing RL Framework, GNN, Reward, and Feature Collection

Objective: To establish the foundational components of the RL-based optimization framework.

Literature Review & Design Specification: Conduct a focused review of state-of-the-art RL approaches for graph optimization, specifically exploring Graph Neural Networks (GNNs) for task graph representation. Design the RL agent's architecture, action space (e.g., task fusion, reordering, parallelization decisions), and initial reward functions (e.g., based on estimated execution time, resource utilization).
Feature Representation Design: Develop a robust strategy for extracting features from Taskflow programs. This will involve analyzing both the task graph structure and the underlying task code at the LLVM IR level to create comprehensive feature representations for the GNN.
Initial Prototype & Simulator: Develop a basic simulator for task graph execution to allow for initial training and testing of the RL agent without direct integration into Taskflow's core. Implement a prototype GNN model for graph representation and initial RL agent training.

Months 3-4: Task Graph Benchmarking and Training Method Development

Objective: To gather real-world Taskflow benchmarks and develop effective training methodologies for the RL framework.

Benchmark Collection: Identify and collect representative task graph benchmarks from existing Taskflow use cases. The focus will be on applications within EDA (Electronic Design Automation), Quantum Simulation, and Computer Graphics, given their mainstream adoption of Taskflow.
Data Preprocessing & Augmentation: Prepare the collected benchmarks for RL training, including data normalization, graph serialization, and potentially data augmentation techniques to diversify the training set.
Training Methodology Development: Design and implement training methods for the RL agent. This includes defining training algorithms (e.g., Proximal Policy Optimization (PPO), Deep Q-Networks (DQN)), hyperparameter tuning strategies, and metrics for evaluating training progress.
Adaptation Scripts: Develop preliminary scripts that will allow users to adapt the RL framework to their specific computing environments, considering hardware specifics and workload characteristics.

Months 5-6: Integration of RL Task Graph Optimization into Taskflow

Objective: To integrate the developed RL-based optimization framework into Taskflow's programming environment, focusing on static graph parallelism in Taskflow.

Integration with Taskflow API: Implement an interface within Taskflow that allows the RL agent to receive task graph information and apply optimized transformations. This will initially focus on static task graph structures at compile-time or graph construction time.
Static Graph Optimization Module: Develop a module within Taskflow that leverages the trained RL agent to perform static task graph optimizations. This involves passing the constructed task graph to the RL framework, obtaining optimization decisions, and applying these transformations back to the Taskflow graph.
Performance Evaluation & Refinement: Conduct comprehensive performance evaluations using the collected benchmarks. Compare the performance of RL-optimized task graphs against unoptimized graphs and potentially other baseline optimization strategies. Iterate on the RL model and integration based on performance feedback.

Throughout the project execution, we will create clear and concise user guides demonstrating how to integrate and utilize the RL-based optimization framework within Taskflow applications. We will provide illustrative code examples and common use cases (e.g., for EDA, Quantum Simulation, Computer Graphics workloads). The documentation will be available in Taskflow Handbook.

Expected Impacts

The successful completion of this project will have a positive impact on the Taskflow community and the broader field of high-performance computing. With Taskflow currently supporting 4-7K unique weekly clones, the integration of an RL-based automatic task graph optimization module will significantly benefit its large and growing user community by providing substantial performance improvements for their complex parallel applications. This will empower users to achieve highly optimized task scheduling without requiring manual, domain-specific tuning, thereby accelerating scientific discovery, engineering simulations, and complex problem-solving across various fields.

tsung-wei-huang added the Awaiting approval label May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhancing Taskflow Scheduling Performance with RL-based Task Graph Optimization #24

Enhancing Taskflow Scheduling Performance with RL-based Task Graph Optimization #24

tsung-wei-huang commented May 30, 2025 •

edited

Loading

Uh oh!

Enhancing Taskflow Scheduling Performance with RL-based Task Graph Optimization #24

Enhancing Taskflow Scheduling Performance with RL-based Task Graph Optimization #24

Comments

tsung-wei-huang commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Project

Summary

Submitter

Project lead

Community benefit

Amount requested

Execution plan

Rationale

Technical Tasks (6 Months)

Months 1-2: Designing RL Framework, GNN, Reward, and Feature Collection

Months 3-4: Task Graph Benchmarking and Training Method Development

Months 5-6: Integration of RL Task Graph Optimization into Taskflow

Expected Impacts

tsung-wei-huang commented May 30, 2025 •

edited

Loading