docs/ci: Add some links in the CI docs to how to track job flakes

and also figuring out how many boards are available for sharding management. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>
2023-10-19 10:21:04 +02:00
parent 553070f993
commit 7a3fb60ac8
2 changed files with 28 additions and 8 deletions
--- a/docs/ci/docker.rst
+++ b/docs/ci/docker.rst
@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
 DUT requirements
 ----------------

-In addition to the general :ref:`CI-farm-expectations`, using
+In addition to the general :ref:`CI-job-user-expectations`, using
 Docker requires:

 * DUTs must have a stable kernel and GPU reset (if applicable).
--- a/docs/ci/index.rst
+++ b/docs/ci/index.rst
@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet is to ask about
 it on ``#dri-devel`` on OFTC and tag `Nico Cortes
 <https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).

-.. _CI-farm-expectations:
+.. _CI-job-user-expectations:

-CI farm expectations
--------------------
+CI job user expectations:
+-------------------------

 To make sure that testing of one vendor's drivers doesn't block
 unrelated work by other vendors, we require that a given driver's test
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing someone's
 code getting blocked on a spurious failure daily, which is an
 unacceptable cost to the project.

+To ensure that, driver maintainers with CI enabled should watch the Flakes panel
+of the `CI flakes dashboard
+<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
+particularly the "Flake jobs" pane, to inspect jobs in their driver where the
+automatic retry of a failing job produced a success a second time.
+Additionally, most CI reports test-level flakes to an IRC channel, and flakes
+reported as NEW are not expected and could cause spurious failures in jobs.
+Please track the NEW reports in jobs and add them as appropriate to the
+``-flakes.txt`` file for your driver.
+
 Additionally, the test farm needs to be able to provide a short enough
-turnaround time that we can get our MRs through marge-bot without the
-pipeline backing up.  As a result, we require that the test farm be
-able to handle a whole pipeline's worth of jobs in less than 15 minutes
-(to compare, the build stage is about 10 minutes).
+turnaround time that we can get our MRs through marge-bot without the pipeline
+backing up.  As a result, we require that the test farm be able to handle a
+whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
+stage is about 10 minutes).  Given boot times and intermittent network delays,
+this generally means that the test runtime as reported by deqp-runner should be
+kept to 10 minutes.

 If a test farm is short the HW to provide these guarantees, consider dropping
 tests to reduce runtime.  dEQP job logs print the slowest tests at the end of
@@ -179,6 +191,14 @@ artifacts.  Or, you can add the following to your job to only run some fraction

 to just run 1/10th of the test list.

+For Collabora's LAVA farm, the `device types
+<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
+many boards of a specific tag are currently available by adding the "Idle" and
+"Busy" columns.  For bare-metal, a gitlab admin can look at the `runners
+<https://gitlab.freedesktop.org/admin/runners>`__ page.  A pipeline should
+probably not create more jobs for a board type than there are boards, unless you
+clearly have some short-runtime jobs.
+
 If a HW CI farm goes offline (network dies and all CI pipelines end up
 stalled) or its runners are consistently spuriously failing (disk
 full?), and the maintainer is not immediately available to fix the