docs/ci: Add some links in the CI docs to how to track job flakes
and also figuring out how many boards are available for sharding management. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>
This commit is contained in:
@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
|
||||
DUT requirements
|
||||
----------------
|
||||
|
||||
In addition to the general :ref:`CI-farm-expectations`, using
|
||||
In addition to the general :ref:`CI-job-user-expectations`, using
|
||||
Docker requires:
|
||||
|
||||
* DUTs must have a stable kernel and GPU reset (if applicable).
|
||||
|
@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet is to ask about
|
||||
it on ``#dri-devel`` on OFTC and tag `Nico Cortes
|
||||
<https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).
|
||||
|
||||
.. _CI-farm-expectations:
|
||||
.. _CI-job-user-expectations:
|
||||
|
||||
CI farm expectations
|
||||
--------------------
|
||||
CI job user expectations:
|
||||
-------------------------
|
||||
|
||||
To make sure that testing of one vendor's drivers doesn't block
|
||||
unrelated work by other vendors, we require that a given driver's test
|
||||
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing someone's
|
||||
code getting blocked on a spurious failure daily, which is an
|
||||
unacceptable cost to the project.
|
||||
|
||||
To ensure that, driver maintainers with CI enabled should watch the Flakes panel
|
||||
of the `CI flakes dashboard
|
||||
<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
|
||||
particularly the "Flake jobs" pane, to inspect jobs in their driver where the
|
||||
automatic retry of a failing job produced a success a second time.
|
||||
Additionally, most CI reports test-level flakes to an IRC channel, and flakes
|
||||
reported as NEW are not expected and could cause spurious failures in jobs.
|
||||
Please track the NEW reports in jobs and add them as appropriate to the
|
||||
``-flakes.txt`` file for your driver.
|
||||
|
||||
Additionally, the test farm needs to be able to provide a short enough
|
||||
turnaround time that we can get our MRs through marge-bot without the
|
||||
pipeline backing up. As a result, we require that the test farm be
|
||||
able to handle a whole pipeline's worth of jobs in less than 15 minutes
|
||||
(to compare, the build stage is about 10 minutes).
|
||||
turnaround time that we can get our MRs through marge-bot without the pipeline
|
||||
backing up. As a result, we require that the test farm be able to handle a
|
||||
whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
|
||||
stage is about 10 minutes). Given boot times and intermittent network delays,
|
||||
this generally means that the test runtime as reported by deqp-runner should be
|
||||
kept to 10 minutes.
|
||||
|
||||
If a test farm is short the HW to provide these guarantees, consider dropping
|
||||
tests to reduce runtime. dEQP job logs print the slowest tests at the end of
|
||||
@@ -179,6 +191,14 @@ artifacts. Or, you can add the following to your job to only run some fraction
|
||||
|
||||
to just run 1/10th of the test list.
|
||||
|
||||
For Collabora's LAVA farm, the `device types
|
||||
<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
|
||||
many boards of a specific tag are currently available by adding the "Idle" and
|
||||
"Busy" columns. For bare-metal, a gitlab admin can look at the `runners
|
||||
<https://gitlab.freedesktop.org/admin/runners>`__ page. A pipeline should
|
||||
probably not create more jobs for a board type than there are boards, unless you
|
||||
clearly have some short-runtime jobs.
|
||||
|
||||
If a HW CI farm goes offline (network dies and all CI pipelines end up
|
||||
stalled) or its runners are consistently spuriously failing (disk
|
||||
full?), and the maintainer is not immediately available to fix the
|
||||
|
Reference in New Issue
Block a user