docs/ci: Add some links in the CI docs to how to track job flakes

and also figuring out how many boards are available for sharding
management.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>
This commit is contained in:
Eric Anholt
2023-10-19 10:21:04 +02:00
committed by Marge Bot
parent 553070f993
commit 7a3fb60ac8
2 changed files with 28 additions and 8 deletions

View File

@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
DUT requirements
----------------
In addition to the general :ref:`CI-farm-expectations`, using
In addition to the general :ref:`CI-job-user-expectations`, using
Docker requires:
* DUTs must have a stable kernel and GPU reset (if applicable).

View File

@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet is to ask about
it on ``#dri-devel`` on OFTC and tag `Nico Cortes
<https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).
.. _CI-farm-expectations:
.. _CI-job-user-expectations:
CI farm expectations
--------------------
CI job user expectations:
-------------------------
To make sure that testing of one vendor's drivers doesn't block
unrelated work by other vendors, we require that a given driver's test
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing someone's
code getting blocked on a spurious failure daily, which is an
unacceptable cost to the project.
To ensure that, driver maintainers with CI enabled should watch the Flakes panel
of the `CI flakes dashboard
<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
particularly the "Flake jobs" pane, to inspect jobs in their driver where the
automatic retry of a failing job produced a success a second time.
Additionally, most CI reports test-level flakes to an IRC channel, and flakes
reported as NEW are not expected and could cause spurious failures in jobs.
Please track the NEW reports in jobs and add them as appropriate to the
``-flakes.txt`` file for your driver.
Additionally, the test farm needs to be able to provide a short enough
turnaround time that we can get our MRs through marge-bot without the
pipeline backing up. As a result, we require that the test farm be
able to handle a whole pipeline's worth of jobs in less than 15 minutes
(to compare, the build stage is about 10 minutes).
turnaround time that we can get our MRs through marge-bot without the pipeline
backing up. As a result, we require that the test farm be able to handle a
whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
stage is about 10 minutes). Given boot times and intermittent network delays,
this generally means that the test runtime as reported by deqp-runner should be
kept to 10 minutes.
If a test farm is short the HW to provide these guarantees, consider dropping
tests to reduce runtime. dEQP job logs print the slowest tests at the end of
@@ -179,6 +191,14 @@ artifacts. Or, you can add the following to your job to only run some fraction
to just run 1/10th of the test list.
For Collabora's LAVA farm, the `device types
<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
many boards of a specific tag are currently available by adding the "Idle" and
"Busy" columns. For bare-metal, a gitlab admin can look at the `runners
<https://gitlab.freedesktop.org/admin/runners>`__ page. A pipeline should
probably not create more jobs for a board type than there are boards, unless you
clearly have some short-runtime jobs.
If a HW CI farm goes offline (network dies and all CI pipelines end up
stalled) or its runners are consistently spuriously failing (disk
full?), and the maintainer is not immediately available to fix the