From f8b230b0a298deeb368b50276f434d0c6551e425 Mon Sep 17 00:00:00 2001
From: Nick Fagerlund <nick.fagerlund@gmail.com>
Date: Tue, 7 Nov 2023 18:48:53 -0800
Subject: [PATCH] MAYBE fix flaky
 `TestPolicySetsCreate/with_vcs_policy_updated`
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This flake has been extremely rowdy lately. The flake signature is:

- PoliciesPath is zeroed out
- After logging that failure, it segfaults on the next line (nil pointer dereference)

It turns out PoliciesPath is kind of a red herring — the backend erases that
value if the VCSRepo is removed, as documented in the API reference. The real
action is in VCSRepo, which comes back as a nil pointer after that update. And,
if you modify the test to do an additional Read on the policy set to compare, it
agrees with what the Update returned — the repo got nilled out for some reason.

I went looking in the backend itself, and it turns out that OAuthClients have
some async cleanup behavior on deletion. And the prior subtest that sets up the
policy set used by the flaky test cleans up its OAuthClient when it's done,
leaving the next subtest to create a new one. My working theory is that there's
a race in the backend: if the Update call with the new VCSRepo values manage to
slip in before the async cleanup for the old OAuthClient is done, it'll get
nilled out instead of creating a new VCSRepo model.

IF this guess is right, then we can avoid the flake by not trying to eagerly
clean up the first OAuthClient before the rest of the subtests run. We're
cleaning up the whole org at the end of the outer test block anyway, so there's
no real benefit to being in a hurry, and an org can have multiple OAuthClient
instances at once.
---
 policy_set_integration_test.go | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/policy_set_integration_test.go b/policy_set_integration_test.go
index d1f02a666..208d0025a 100644
--- a/policy_set_integration_test.go
+++ b/policy_set_integration_test.go
@@ -225,8 +225,13 @@ func TestPolicySetsCreate(t *testing.T) {
 			t.Skip("Export a valid GITHUB_POLICY_SET_IDENTIFIER before running this test")
 		}
 
-		oc, ocTestCleanup := createOAuthToken(t, client, orgTest)
-		defer ocTestCleanup()
+		// We are deliberately ignoring the cleanup func here, because there's a potential race condition
+		// against the subsequent subtest -- TFC performs some async cleanup on VCS repos when deleting an
+		// OAuthClient, and we've seen evidence that it will zero out the next test's NEW VCSRepo values if
+		// they manage to slip in before the async stuff completes, even though the new values link it to a
+		// new OAuthToken. Anyway, there's a deferred cleanup for orgTest in the outer scope, so the org's
+		// dependent: destroy clause on OAuthClients will clean this up when the test as a whole ends.
+		oc, _ := createOAuthToken(t, client, orgTest)
 
 		options := PolicySetCreateOptions{
 			Name:         String("vcs-policy-set"),
@@ -265,8 +270,13 @@ func TestPolicySetsCreate(t *testing.T) {
 			t.Skip("Export a valid GITHUB_POLICY_SET_IDENTIFIER before running this test")
 		}
 
-		oc, ocTestCleanup := createOAuthToken(t, client, orgTest)
-		defer ocTestCleanup()
+		// We are deliberately ignoring the cleanup func here, because it's not really necessary: there's a
+		// deferred cleanup for orgTest in the outer scope, so the org's dependent: destroy clause on
+		// OAuthClients will clean this up when the test as a whole ends. Unlike the one in the previous
+		// subtest, there's no known race condition here because there aren't any later subtests that modify
+		// this same policy set. But I'm being consistent with the prior case just to reduce risks from future
+		// copypasta code.
+		oc, _ := createOAuthToken(t, client, orgTest)
 
 		options := PolicySetUpdateOptions{
 			Name:         String("vcs-policy-set"),