ci/freedreno: Detect cheza HFI errors and restart the run.

These are intermittent (~1/day), seem to be around GPU faults (so hopefully will go away once we clean up piglit's fault errors), and are probably also related to our vintage firmware. Until we can get new hardware in the farm, just restart the flaked job. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8722>
2021-01-26 08:47:50 -08:00
parent 4c3ad4d065
commit ce1bb26b06
1 changed files with 13 additions and 0 deletions
--- a/.gitlab-ci/bare-metal/cros_servo_run.py
+++ b/.gitlab-ci/bare-metal/cros_servo_run.py
@@ -120,6 +120,19 @@ class CrosServoRun:
                print("Detected cheza power management bus error, restarting run...")
                return 2

+            # These HFI response errors started appearing with the introduction
+            # of piglit runs.  CosmicPenguin says:
+            #
+            # "message ID 106 isn't a thing, so likely what happened is that we
+            # got confused when parsing the HFI queue.  If it happened on only
+            # one run, then memory corruption could be a possible clue"
+            #
+            # Given that it seems to trigger randomly near a GPU fault and then
+            # break many tests after that, just restart the whole run.
+            if re.search("a6xx_hfi_send_msg.*Unexpected message id .* on the response queue", line):
+                print("Detected cheza power management bus error, restarting run...")
+                return 2
+
            result = re.search("bare-metal result: (\S*)", line)
            if result:
                if result.group(1) == "pass":