Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensor fails to detect containers #7

Open
PierreRust opened this issue May 19, 2021 · 8 comments
Open

Sensor fails to detect containers #7

PierreRust opened this issue May 19, 2021 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@PierreRust
Copy link

Hi,

I have a system where the sensor fails to detect any running container, both when using docker or kubernetes directly.

When looking at the source code, I see that the sensor

  • looks for containers in the subdirectories of the perf_event cgroup ll /sys/fs/cgroup/perf_event/ (can be changed with the -p flag
  • try to detect a type based on the path in this directory see
    target_detect_type(const char *cgroup_path)
  • then, for docker and kubernetes, validates the type based on an expected path 👍
    • perf_event/docker/... for docker
    • perf_event/kubepods/... for kubernetes
  • If the validation fails, the container is ignored

On my system,

  • kubernetes pods appear in /sys/fs/cgroup/perf_event/kubepods.slice/ , are detected but fail type validation
  • docker containers appear in /sys/fs/cgroup/perf_event/system.slice/ are detected but fail type validation too
    As a consequences, no container is monitored.

Maybe we could bypass type detection and file validation altogether, as I proposed in the old #2 PR ? That should solve this issue.

Finally, I have no idea on why the /sys/fs/cgroup/perf_event the system is diffferent on this system. It's a centos 7.2, I usually use debian and ubuntu systems and never had this issue before.

Please let me know if I can add something useful or if you have any idea to get me started on this issue.

@gfieni
Copy link
Collaborator

gfieni commented May 19, 2021

Hello,
Could you give a sample of find /sys/fs/cgroup/perf_event/kubepods.slice/ and find /sys/fs/cgroup/perf_event/system.slice/ please ?

@gfieni gfieni added the bug Something isn't working label May 19, 2021
@gfieni gfieni self-assigned this May 19, 2021
@PierreRust
Copy link
Author

Sure, :
find /sys/fs/cgroup/perf_event/system.slice/ > system_find.txt
find /sys/fs/cgroup/perf_event/kubepods.slice/ > kube_find.txt

@PierreRustOrange
Copy link
Contributor

PierreRustOrange commented Nov 17, 2021

Hello,
I still have this issue, and I use a workaround where I

  • disable all target type validation (in target_validate_type)
  • Change the regexp used to resolve target's name (I had to add .slice)
#define TARGET_KUBERNETES_EXTRACT_CONTAINER_ID_REGEX \
    "perf_event/kubepods.slice/" \
    "(besteffort/|burstable/|)" \
    "(pod[a-zA-Z0-9][a-zA-Z0-9.-]+)/" /* Pod ID */ \
    "([a-f0-9]{64})" /* Container ID */ \
    "(/[a-zA-Z0-9][a-zA-Z0-9.-]+|)" /* Resource group */

I think the first step could be activated with a cli flag, when type does not matter, but I'm not sure how to properly handle the second point, any idea ? Allow overriding them with a cli option ?

I'd like to make a PR for this, to avoid maintaining a fork for our centos platform. Besides, I suspect we might encounter other system where these paths have other values.

@PierreRustOrange
Copy link
Contributor

I almost forgot, I still have a pending PR for the first point : #5

@marceloamaral
Copy link

I'm having some problem related to this, i.e., deploying in Kubernetes.
But in my case the /sys/fs/cgroup/perf_event folder doesn't exist

Any clues on how to fix this?

Btw, my Kernel has the CONFIG_CGROUP_PERF=yes

@gfieni
Copy link
Collaborator

gfieni commented Jun 21, 2022

Hello @marceloamaral,
This is probably due to the usage of the unified cgroup v2 hierarchy by your distribution.
Unfortunately, the support for the unified cgroup hierarchy is ongoing work.

For now, the only way to get the sensor work for this kind of environment is to disable the unified hierarchy, and to setup the Kubernetes cluster to use cgroupfs as cgroup driver. (more information)

@marceloamaral
Copy link

Thanks, I'll check it out!

@marceloamaral
Copy link

Hi, I deployed hwpc-sensor in the cluster with cgroup v1.

The sensor finds the VMs running in the system but not the containers....
I don't see any error, but it is not finding the containers in the system.

kubectl logs hwpc-sensor-exporter-fb4cs
I: 22-06-29 11:14:22 build: version v1.1.2 (rev: eba2fe195878bae1afadb29fb6da7c4151c890ad) (Jan 21 2022 - 14:54:06)
I: 22-06-29 11:14:22 uname: Linux 5.4.0-66-generic #74~18.04.2-Ubuntu SMP Fri Feb 5 11:17:31 UTC 2021 x86_64
I: 22-06-29 11:14:22 pmu: found ix86arch 'Intel X86 architectural PMU' having 7 events, 7 counters (4 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found perf 'perf_events generic PMU' having 181 events, 0 counters (0 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found rapl 'Intel RAPL' having 2 events, 3 counters (0 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found perf_raw 'perf_events raw PMU' having 1 events, 0 counters (0 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha0 'Intel SkylakeX CHA0 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha1 'Intel SkylakeX CHA1 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha2 'Intel SkylakeX CHA2 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha3 'Intel SkylakeX CHA3 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha4 'Intel SkylakeX CHA4 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha5 'Intel SkylakeX CHA5 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha6 'Intel SkylakeX CHA6 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha7 'Intel SkylakeX CHA7 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha8 'Intel SkylakeX CHA8 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha9 'Intel SkylakeX CHA9 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha10 'Intel SkylakeX CHA10 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha11 'Intel SkylakeX CHA11 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha12 'Intel SkylakeX CHA12 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha13 'Intel SkylakeX CHA13 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha14 'Intel SkylakeX CHA14 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha15 'Intel SkylakeX CHA15 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha16 'Intel SkylakeX CHA16 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha17 'Intel SkylakeX CHA17 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha18 'Intel SkylakeX CHA18 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha19 'Intel SkylakeX CHA19 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio0 'Intel SkylakeX IIO0 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio1 'Intel SkylakeX IIO1 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio2 'Intel SkylakeX IIO2 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio3 'Intel SkylakeX IIO3 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio4 'Intel SkylakeX IIO4 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio5 'Intel SkylakeX IIO5 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc0 'Intel SkylakeX IMC0 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc1 'Intel SkylakeX IMC1 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc2 'Intel SkylakeX IMC2 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc3 'Intel SkylakeX IMC3 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc4 'Intel SkylakeX IMC4 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc5 'Intel SkylakeX IMC5 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m2m0 'Intel SkylakeX M2M0 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m2m1 'Intel SkylakeX M2M1 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi0 'Intel SkylakeX M3UPI0 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi1 'Intel SkylakeX M3UPI1 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi2 'Intel SkylakeX M3UPI2 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_pcu 'Intel SkylakeX PCU uncore' having 29 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_ubo 'Intel SkylakeX U-Box uncore' having 5 events, 3 counters (2 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi0 'Intel SkylakeX UPI0 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi1 'Intel SkylakeX UPI1 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi2 'Intel SkylakeX UPI2 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found clx 'Intel CascadeLake X' having 85 events, 11 counters (8 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found intel_msr 'Intel MSR' having 6 events, 6 counters (0 general, 6 fixed)
I: 22-06-29 11:14:22 sensor: configuration is valid, starting monitoring...
I: 22-06-29 11:14:23 perf<all>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-63-guestvm-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf<system>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-66-guestvm2x8-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-64-guestvm-clone1.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-62-guestvm.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-67-guestvm2x8-clone1.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-65-guestvm2x8.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-69-guestvm8x32-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-68-guestvm8x32.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf<system>: monitoring actor started

Any idea how to debug/fix that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants