-
Notifications
You must be signed in to change notification settings - Fork 372
Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
treeherder/log_parser/tasks.py
Outdated
return artifact_list | ||
|
||
|
||
def post_perfherder_artifacts(job_log): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junngo I think it would be better for us to put this into a separate area. This folder seems to be specifically for parsing logs, but we're parsing JSONs instead. What do you think about having this task defined here in the perf directory? https://github.com/mozilla/treeherder/blob/505ad6b4047f77fc3ecdea63e57881116340d0fb/treeherder/perf/tasks.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmierz Splitting the code is a great idea. Creating a separate file under the code directory [0] looks good to me. It feels more cohesive to put it there, since the log parsing [1] also lives in that folder.
Please consider my opinion and feel free to tell me about the directory location.
with make_request(self.url, stream=True) as response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the new file based on your feedback. It seems more suitable since the JSON artifact isn’t part of the log parsing process :)
treeherder/etl/perf.py
Outdated
existing_replicates = set( | ||
PerformanceDatumReplicate.objects.filter( | ||
performance_datum=subtest_datum | ||
).values_list("value", flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is happening because of duplicate ingestion tasks (log, and json). I think we should find a way to default to using the JSON if they exist, and ignore the data we find in the logs. Maybe we could have a list of tests that we start with for testing this out? I'm thinking we could start with these tasks since the data they produce is not useful so any failures won't be problematic: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=regress&revision=6bd2ea6b9711dc7739d8ee7754b9330b11d0719d&selectedTaskRun=K87CGE6IT1GHl6wD4Skbyw.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, log parsing and the JSON file feature are both active right now, so I handled the duplication.
I’ll revert that, add an allowlist, and only call _load_perf_datum for whitelisted tests when needed.
34855c7
to
26bc32d
Compare
"awsy": ["ALL"], | ||
"build_metrics": ["decision", "compiler warnings"], | ||
"browsertime": ["constant-regression"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job is processed if at least one suite name matches the allowlist (e.g. compiler warnings
).
This list is just a sample. We’ll gradually update it to expand JSON artifact usage.
[0]
https://firefoxci.taskcluster-artifacts.net/KZ6krBACTcyC1_q_tUejTA/0/public/build/perfherder-data-building.json
I have a list of frameworks generated locally by django code. [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note:
# treeherder/etl/jobs.py
parse_logs.apply_async(queue=queue, args=[job.id, [job_log.id], priority])
I considered splitting the queues, but decided to keep using the existing ones to avoid code duplication and increased complexity.
26bc32d
to
7ec7ee8
Compare
Hi there :) I updated the code. |
|
||
|
||
@retryable_task(name="ingest-perfherder-data", max_retries=10) | ||
def ingest_perfherder_data(job_id, job_log_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the overall flow consistent with the existing log parsing task code.
treeherder/treeherder/log_parser/tasks.py
Line 22 in b626c64
def parse_logs(job_id, job_log_ids, priority): |
7ec7ee8
to
b29d246
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start @junngo! It looks like we're getting close :)
|
||
job_log_name = job_log.name.replace("-", "_") | ||
if job_log_name.startswith("perfherder_data"): | ||
_schedule_perfherder_ingest(job, job_log, result, repository) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling the schedule function here, we should call it in the _load_job
method similar to where we call the _schedule_log_parsing
function.
return any(suite["name"] in allowed for suite in suites) | ||
|
||
|
||
def _should_ingest(framework_name: str, suites: list, is_perfherder_data_json: bool) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing this being used for determining which JSON artifacts should be ingested, I might have missed it though.
) | ||
|
||
first_exception = None | ||
for job_log in job_logs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is parsing the logs, but this new task should only be responsible for handling the JSON artifacts.
for perfdatum in performance_data: | ||
framework_name = perfdatum["framework"]["name"] | ||
suites = perfdatum.get("suites", []) | ||
if not _should_ingest(framework_name, suites, is_perfherder_data_json): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _should_ingest
function is used here. The return value determines whether the data will be stored or not.
Currently, Treeherder ingests performance data (
PERFHERDER_DATA:
) by parsing raw logs.This patch supports reading data from the
perfherder-data.json
artifact instead.For now, both the existing log parsing and the new JSON ingestion run in parallel to maintain compatibility.
bugzilla :https://bugzilla.mozilla.org/show_bug.cgi?id=1990742