Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agentbeat packaging failures: aarch64-linux-gnu/bin/ld.gold: internal error in maybe_apply_stub, at ../../gold/aarch64.cc:5407 #41270

Closed
cmacknz opened this issue Oct 16, 2024 · 11 comments · Fixed by #41365
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 16, 2024

The following error has been observed inconsistently in Beats packaging after dependencies updates of the AWS and GCP SDKs, respectively.

# github.com/elastic/beats/v7/x-pack/agentbeat
/usr/local/go/pkg/tool/linux_amd64/link: running aarch64-linux-gnu-gcc failed: exit status 1
/usr/lib/gcc-cross/aarch64-linux-gnu/6/../../../../aarch64-linux-gnu/bin/ld.gold: internal error in maybe_apply_stub, at ../../gold/aarch64.cc:5407
collect2: error: ld returned 1 exit status
Error: running "go build -o build/golang-crossbuild/agentbeat-linux-arm64 -buildmode pie -gcflags=all=-N -l -tags=agentbeat -ldflags -X github.com/elastic/beats/v7/libbeat/version.buildTime=2024-10-03T12:37:52Z -X github.com/elastic/beats/v7/libbeat/version.commit=698951e8c895eff9d03a8d0ececadba3b8b4c6bb" failed with exit code 1

This error was resolved by reverted the following two unrelated PRs:

The go.mod changes do not overlap for those two PRs. There is some problem in the dependency graphs that doesn’t jump out obviously that is causing this.

I suspect this is likely related to a change in https://github.com/elastic/golang-crossbuild that only reproduces under specific but infrequent conditions. Possibly it is a bug in aarch64-linux-gnu-gcc and updating the version included in the crossbuild image would resolve it.

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 16, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member Author

cmacknz commented Oct 16, 2024

Quoting @rdner on Slack with more context on how to reproduce this:

I managed to reproduce this error locally using the same command (needs to be run in the root of the Beats repo):

docker run --env DEV=true --rm --env GOFLAGS="-mod=readonly -buildvcs=false" --env MAGEFILE_VERBOSE= --env MAGEFILE_TIMEOUT= --env SNAPSHOT=true -v $PWD:/go/src/github.com/elastic/beats -w /go/src/github.com/elastic/beats/x-pack/agentbeat docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm --build-cmd "build/mage-linux-arm64 golangCrossBuild" --platforms linux/arm64

when I tried to revert my repository to a25c5a5 (latest successful packaging run on the CI) the command succeeded.

So it's safe to say that this failure is due to 89ed20d
I've created a revert PR #41269 (edited)

@pierrehilbert
Copy link
Collaborator

@cmacknz as this is only happening when we are bumping the GCP SDK to a newer version, should we ask observability to investigate?

@cmacknz
Copy link
Member Author

cmacknz commented Oct 17, 2024

It happened with both the AWS SDK bump and the GCP SDK bump, but I'm not sure we can conclude that it has something to do with cloud SDKs. Both of those PRs have a large set of dependencies and it's more likely there is some conflict in an indirect dependency (possibly different each time) triggering a bug in the linker.

@mauri870
Copy link
Member

mauri870 commented Oct 17, 2024

This is very likely a bug in the gcc cross-compiler or gold. Checking the docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm image the toolchain is quite old:

$ ld.gold --version
GNU gold (GNU Binutils for Debian 2.28) 1.14
Copyright (C) 2017 Free Software Foundation, Inc.

$ aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.

$ gcc --version
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.

As you can see, the toolchain is gcc 6, from 2017. Gcc stable is currently in version 14. We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.

@shmsr
Copy link
Member

shmsr commented Oct 20, 2024

Agree with @mauri870's comment; bumping the linker and gcc versions is definitely a good idea.

I've spent some time debugging when I saw it in the other Go 1.23.2 upgrade PR. Initially, I couldn't reproduce it, and everything was working fine even when I tried different Go versions. I was using the command PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package.

The reproducer shared here does replicate the issue on my setup as well: #41270 (comment)

I compared the Docker commands: the one invoked internally with this command: (PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package) and the one shared that reproduces the issue.

After experimenting for a while, I noticed that the DEV var is causing this.

When DEV=true, I can reproduce the issue. If DEV=false, I cannot.

So, here's another reproducer:

DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package

This fails too, but DEV=false works.

I believe the culprit here is -gcflags=all=-N -l, which is being added here:

args.ExtraFlags = append(args.ExtraFlags, `-gcflags=all=-N -l`)

[ Update: I tried exactly with which flag we are getting the issue; it is the -N; -l (disabling inlining works) but when disabling the optimization with -N, it breaks ]

It seems there's an issue with the linker when compiler optimizations are disabled. Could someone also see if DEV=false is solving this issue in your setup?

I think this also explains why the CI passes when packaging agentbeat because DEV=false during that step.

@rdner rdner self-assigned this Oct 21, 2024
@cmacknz
Copy link
Member Author

cmacknz commented Oct 21, 2024

For snapshot builds I think we have DEV=true on, but for not staging.

- group: Packaging snapshot
if: build.branch =~ /^[0-9]+\.[0-9x]+\$/ || build.branch == 'main' || build.env('RUN_SNAPSHOT') == "true"
key: packaging-snapshot
depends_on: start-gate-snapshot
steps:
- label: "SNAPSHOT: {{matrix}}"
env:
PLATFORMS: "${PLATFORMS}"
SNAPSHOT: true
DEV: true

I need to follow more to see where the snapshot DRA artifacts actually get used, if these end up in the official snapshot images we need to turn these off, otherwise the way they are built doesn't match what we eventually release.

Upgrading the cross toolchain is a good idea, I recall we were limited by the version of glibc we needed to build the 7.17 branch on all supported platforms, but there have been a few revisions to the support matrix since then.

@rdner
Copy link
Member

rdner commented Oct 21, 2024

@cmacknz is there any case where we need DEV=true images? I can't come up with any.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 21, 2024

It enables using a debugger. I would think locally rebuilding with DEV=true would be acceptable, since this capability is not in the release binaries anyway.

rdner added a commit to rdner/beats that referenced this issue Oct 22, 2024
Packaging with `DEV=true` adds additional Go flags that sometimes lead
to linker failures using the old versions of `ld.gold`.

See elastic#41270
@rdner
Copy link
Member

rdner commented Oct 22, 2024

We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.

@mauri870

According to our support matrix https://www.elastic.co/support/matrix, we still support Debian 10 (released on 2019-07-06, EOL 2022-09-10) in the latest version of Beats (8.15.x).

This is the main reason why we're still crossbuilding using this image docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-darwin-arm64-debian10. AFAIK, there is a strict dependency on a certain glibc version when we build the Beats binaries and this is why we have to use Debian 10 for building them. For more details, please read this thread #34921 (comment)

@shmsr thank you very much for your investigation, I opened a PR to remove the DEV=true mode when building snapshots here #41365

This should prevent such failures in the future.
Unfortunately, I don't think we can update to a newer Debian version just yet.

rdner added a commit that referenced this issue Oct 22, 2024
Packaging with `DEV=true` adds additional Go flags that sometimes lead
to linker failures using the old versions of `ld.gold`.

See #41270
mergify bot pushed a commit that referenced this issue Oct 22, 2024
Packaging with `DEV=true` adds additional Go flags that sometimes lead
to linker failures using the old versions of `ld.gold`.

See #41270

(cherry picked from commit af46682)
rdner added a commit that referenced this issue Oct 22, 2024
Packaging with `DEV=true` adds additional Go flags that sometimes lead
to linker failures using the old versions of `ld.gold`.

See #41270

(cherry picked from commit af46682)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
rdner added a commit to rdner/beats that referenced this issue Oct 23, 2024
We're dropping support for Debian 10, so no need to crossbuild using
the outdated image anymore.

The old linker in Debian 10 caused a packaging issue with some Go
dependency updates elastic#41270

So, this update should also help with that.
rdner added a commit that referenced this issue Oct 24, 2024
We're dropping support for Debian 10, so no need to crossbuild using
the outdated image anymore.

The old linker in Debian 10 caused a packaging issue with some Go
dependency updates #41270

So, this update should also help with that.

This also updates the statically linked glibc from 2.28 to 2.31.
mergify bot pushed a commit that referenced this issue Oct 24, 2024
We're dropping support for Debian 10, so no need to crossbuild using
the outdated image anymore.

The old linker in Debian 10 caused a packaging issue with some Go
dependency updates #41270

So, this update should also help with that.

This also updates the statically linked glibc from 2.28 to 2.31.

(cherry picked from commit 4140d15)
rdner added a commit that referenced this issue Oct 24, 2024
We're dropping support for Debian 10, so no need to crossbuild using
the outdated image anymore.

The old linker in Debian 10 caused a packaging issue with some Go
dependency updates #41270

So, this update should also help with that.

This also updates the statically linked glibc from 2.28 to 2.31.

(cherry picked from commit 4140d15)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
leehinman added a commit to leehinman/elastic-agent that referenced this issue Oct 28, 2024
leehinman added a commit to elastic/elastic-agent that referenced this issue Oct 29, 2024
mergify bot pushed a commit to elastic/elastic-agent that referenced this issue Oct 29, 2024
(cherry picked from commit c9cd580)

# Conflicts:
#	.buildkite/scripts/steps/integration-package.sh
#	.buildkite/scripts/steps/k8s-extended-tests.sh
ycombinator pushed a commit to elastic/elastic-agent that referenced this issue Oct 30, 2024
(cherry picked from commit c9cd580)

# Conflicts:
#	.buildkite/scripts/steps/integration-package.sh
#	.buildkite/scripts/steps/k8s-extended-tests.sh
leehinman added a commit to elastic/elastic-agent that referenced this issue Oct 31, 2024
(cherry picked from commit c9cd580)
leehinman added a commit to elastic/elastic-agent that referenced this issue Nov 8, 2024
(cherry picked from commit c9cd580)
pierrehilbert added a commit to elastic/elastic-agent that referenced this issue Nov 8, 2024
* Fix like elastic/beats#41270 (#5868)

(cherry picked from commit c9cd580)

# Conflicts:
#	.buildkite/scripts/steps/integration-package.sh
#	.buildkite/scripts/steps/k8s-extended-tests.sh

* Update integration-package.sh

* Update k8s-extended-tests.sh

---------

Co-authored-by: Lee E Hinman <57081003+leehinman@users.noreply.github.com>
Co-authored-by: Pierre HILBERT <pierre.hilbert@elastic.co>
@belimawr
Copy link
Contributor

I managed to reproduce this issue today on a ARM64 MacBook:

exec: go "build" "-o" "build/golang-crossbuild/agentbeat-linux-arm64" "-buildmode" "pie" "-gcflags=all=-N -l" "-tags=agentbeat" "-ldflags" "-X github.com/elastic/beats/v7/libbeat/version.buildTime=2024-12-13T17:00:15Z -X github.com/elastic/beats/v7/libbeat/version.commit=9758447b09134d9959f387fb3fae5ffb375f0d44"
# github.com/elastic/beats/v7/x-pack/agentbeat
/usr/local/go/pkg/tool/linux_arm64/link: running aarch64-linux-gnu-gcc failed: exit status 1
/usr/bin/ld.gold: internal error in maybe_apply_stub, at ../../gold/aarch64.cc:5407
collect2: error: ld returned 1 exit status

The same error happens with Go 1.23.0 and GO 1.22.10.

I'm using docker.elastic.co/beats-dev/golang-crossbuild:1.22.9-arm, which seems to be the latest one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants