Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997

junngo · 2025-09-26T14:18:01Z

Currently, Treeherder ingests performance data (PERFHERDER_DATA:) by parsing raw logs.
This patch supports reading data from the perfherder-data.json artifact instead.
For now, both the existing log parsing and the new JSON ingestion run in parallel to maintain compatibility.

bugzilla :https://bugzilla.mozilla.org/show_bug.cgi?id=1990742

gmierz · 2025-09-29T12:34:19Z

treeherder/log_parser/tasks.py

    return artifact_list
+
+
+def post_perfherder_artifacts(job_log):


@junngo I think it would be better for us to put this into a separate area. This folder seems to be specifically for parsing logs, but we're parsing JSONs instead. What do you think about having this task defined here in the perf directory? https://github.com/mozilla/treeherder/blob/505ad6b4047f77fc3ecdea63e57881116340d0fb/treeherder/perf/tasks.py

@gmierz Splitting the code is a great idea. Creating a separate file under the code directory [0] looks good to me. It feels more cohesive to put it there, since the log parsing [1] also lives in that folder.
Please consider my opinion and feel free to tell me about the directory location.

[0]
https://github.com/mozilla/treeherder/tree/505ad6b4047f77fc3ecdea63e57881116340d0fb/treeherder/log_parser
[1]

treeherder/treeherder/log_parser/artifactbuildercollection.py

Line 85 in 505ad6b

with make_request(self.url, stream=True) as response:

I added the new file based on your feedback. It seems more suitable since the JSON artifact isn’t part of the log parsing process :)

gmierz · 2025-09-29T12:43:47Z

treeherder/etl/perf.py

+                existing_replicates = set(
+                    PerformanceDatumReplicate.objects.filter(
+                        performance_datum=subtest_datum
+                    ).values_list("value", flat=True)


I'm guessing this is happening because of duplicate ingestion tasks (log, and json). I think we should find a way to default to using the JSON if they exist, and ignore the data we find in the logs. Maybe we could have a list of tests that we start with for testing this out? I'm thinking we could start with these tasks since the data they produce is not useful so any failures won't be problematic: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=regress&revision=6bd2ea6b9711dc7739d8ee7754b9330b11d0719d&selectedTaskRun=K87CGE6IT1GHl6wD4Skbyw.0

Exactly, log parsing and the JSON file feature are both active right now, so I handled the duplication.
I’ll revert that, add an allowlist, and only call _load_perf_datum for whitelisted tests when needed.

junngo · 2025-09-30T14:57:15Z

treeherder/etl/perf.py

+        "awsy": ["ALL"],
+        "build_metrics": ["decision", "compiler warnings"],
+        "browsertime": ["constant-regression"],
+    }


The job is processed if at least one suite name matches the allowlist (e.g. compiler warnings).
This list is just a sample. We’ll gradually update it to expand JSON artifact usage.
[0]
https://firefoxci.taskcluster-artifacts.net/KZ6krBACTcyC1_q_tUejTA/0/public/build/perfherder-data-building.json

junngo · 2025-10-01T03:39:42Z

ID	Framework	Enabled	Suites
1	talos	true
2	build_metrics	true	compiler warnings, compiler_metrics, decision ...
4	awsy	true
5	awfy	false
6	platform_microbench	true
10	raptor	true
11	js-bench	true
12	devtools	true
13	browsertime	true	constant-regression ...
14	vcs	false
15	mozperftest	true
16	fxrecord	true
17	telemetry	true

I have a list of frameworks generated locally by django code.
It would be good to gradually reflect the less important framework-suite mappings one by one.

[0]
compiler warnings: https://firefoxci.taskcluster-artifacts.net/NE-naCeqSyenKogxu0nD4Q/0/public/build/perfherder-data-building.json
compiler_metrics: https://firefoxci.taskcluster-artifacts.net/P1T_HaXURD-r59ymlz5GWA/0/public/build/perfherder-data-compiler-metrics.json
decision: https://firefoxci.taskcluster-artifacts.net/OKsoq3lARpCjUhwVjqDddA/0/public/perfherder-data-decision.json

junngo

note:

# treeherder/etl/jobs.py
parse_logs.apply_async(queue=queue, args=[job.id, [job_log.id], priority])

~~I considered splitting the queues, but decided to keep using the existing ones to avoid code duplication and increased complexity.~~

https://github.com/mozilla/treeherder/pull/8997/files#diff-937b3e21ad52eec5277a7f52f51572348a072addafb88a049f9fe302ae437e76R369

junngo · 2025-10-07T12:43:52Z

Hi there :) I updated the code.
There is log parsing feature. I didn’t modify the existing log parsing feature. Instead, I created the new queue and task for handling the perfherder-data.json artifacts. I separated the processing of logs and perfherder-data.json artifacts so that they can run on different queues.

junngo · 2025-10-07T12:49:44Z

treeherder/perf/tasks.py

+
+
+@retryable_task(name="ingest-perfherder-data", max_retries=10)
+def ingest_perfherder_data(job_id, job_log_ids):


I kept the overall flow consistent with the existing log parsing task code.

treeherder/treeherder/log_parser/tasks.py

Line 22 in b626c64

def parse_logs(job_id, job_log_ids, priority):

…arsing logs

gmierz

Great start @junngo! It looks like we're getting close :)

gmierz · 2025-10-07T14:18:37Z

treeherder/etl/jobs.py


+        job_log_name = job_log.name.replace("-", "_")
+        if job_log_name.startswith("perfherder_data"):
+            _schedule_perfherder_ingest(job, job_log, result, repository)


Instead of calling the schedule function here, we should call it in the _load_job method similar to where we call the _schedule_log_parsing function.

gmierz · 2025-10-07T14:20:34Z

treeherder/etl/perf.py

+    return any(suite["name"] in allowed for suite in suites)
+
+
+def _should_ingest(framework_name: str, suites: list, is_perfherder_data_json: bool) -> bool:


I'm not seeing this being used for determining which JSON artifacts should be ingested, I might have missed it though.

gmierz · 2025-10-07T14:21:17Z

treeherder/perf/tasks.py

+        )
+
+    first_exception = None
+    for job_log in job_logs:


It looks like this is parsing the logs, but this new task should only be responsible for handling the JSON artifacts.

junngo · 2025-10-07T15:30:03Z

treeherder/etl/perf.py

        for perfdatum in performance_data:
+            framework_name = perfdatum["framework"]["name"]
+            suites = perfdatum.get("suites", [])
+            if not _should_ingest(framework_name, suites, is_perfherder_data_json):


The _should_ingest function is used here. The return value determines whether the data will be stored or not.

junngo marked this pull request as draft September 26, 2025 14:18

gmierz self-requested a review September 29, 2025 12:27

gmierz reviewed Sep 29, 2025

View reviewed changes

junngo force-pushed the ingest-perfherder-data branch from 34855c7 to 26bc32d Compare September 30, 2025 14:44

junngo marked this pull request as ready for review September 30, 2025 14:44

junngo requested review from beatrice-acasandrei and esanuandra as code owners September 30, 2025 14:44

junngo commented Sep 30, 2025

View reviewed changes

junngo commented Oct 3, 2025

View reviewed changes

junngo force-pushed the ingest-perfherder-data branch from 26bc32d to 7ec7ee8 Compare October 7, 2025 12:29

junngo commented Oct 7, 2025

View reviewed changes

Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of p…

b29d246

…arsing logs

junngo force-pushed the ingest-perfherder-data branch from 7ec7ee8 to b29d246 Compare October 7, 2025 13:58

gmierz requested changes Oct 7, 2025

View reviewed changes

junngo commented Oct 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997

Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997

junngo commented Sep 26, 2025 •

edited

Loading

Uh oh!

gmierz Sep 29, 2025

Uh oh!

junngo Sep 29, 2025

Uh oh!

junngo Oct 1, 2025

Uh oh!

gmierz Sep 29, 2025

Uh oh!

junngo Sep 29, 2025

Uh oh!

junngo Sep 30, 2025 •

edited

Loading

Uh oh!

junngo commented Oct 1, 2025

Uh oh!

junngo left a comment •

edited

Loading

Uh oh!

junngo commented Oct 7, 2025

Uh oh!

junngo Oct 7, 2025

Uh oh!

gmierz left a comment

Uh oh!

gmierz Oct 7, 2025

Uh oh!

gmierz Oct 7, 2025

Uh oh!

gmierz Oct 7, 2025

Uh oh!

junngo Oct 7, 2025

Uh oh!

Uh oh!



		@retryable_task(name="ingest-perfherder-data", max_retries=10)
		def ingest_perfherder_data(job_id, job_log_ids):

		return any(suite["name"] in allowed for suite in suites)


		def _should_ingest(framework_name: str, suites: list, is_perfherder_data_json: bool) -> bool:

Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997

Are you sure you want to change the base?

Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997

Conversation

junngo commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junngo Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junngo commented Oct 1, 2025

Uh oh!

junngo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junngo commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmierz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junngo commented Sep 26, 2025 •

edited

Loading

junngo Sep 30, 2025 •

edited

Loading

junngo left a comment •

edited

Loading