Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#301 session/active time #305

Closed
wants to merge 2 commits into from
Closed

Conversation

yhuelf
Copy link
Contributor

@yhuelf yhuelf commented Mar 3, 2022

No description provided.

yhuelf added 2 commits March 3, 2022 15:37
This commit only gives the opportunity to write PG14 specific queries.
@rjuju
Copy link
Member

rjuju commented Mar 3, 2022

Hi,

IMHO it's a good idea to mix instant and cumulated counters in the same service. I'm also unsure of how useful it's, so it's another argument to be a in a dedicated service.

Also, the "global_" prefix is quite misleading, maybe using "instance_" instead?

@yhuelf
Copy link
Contributor Author

yhuelf commented Mar 3, 2022

Hi Julien,

I also have a patch for a dedicated service, but @ioguix told me it would be better to include these new metrics in a existing one. I'll let the two of you debate this ;-)

And yes, the "_instance" prefix would certainly be better (the code was very much modelled on the commit_ratio service).

@rjuju
Copy link
Member

rjuju commented Mar 3, 2022

Was it a private discussion or did I miss some activity?

In any case, agreed let's wait for him to chime in.

@yhuelf
Copy link
Contributor Author

yhuelf commented Mar 3, 2022

Yes it was a private discussion (sorry for the confusion). I've been given time to work on check_pgactivity, a few weeks spread into this year, and I asked him what I could do to get a foothold in this project. Other suggestions are welcomed :)

@rjuju
Copy link
Member

rjuju commented Mar 3, 2022

Well, there are already a few bugs opened (https://github.com/OPMDG/check_pgactivity/issues?q=is%3Aopen+is%3Aissue+label%3Abug) or otherwise quite a lot of other enhancements ideas in the bug tracker. Or simply improve the tap tests :)

@ioguix
Copy link
Member

ioguix commented Mar 3, 2022

Hi,

We reviewed some existing issues and spot the #301 that leads to the internal short discussion, and this PR.

When discussing it, I couldn't find any interesting alerting use case based on these metrics. Creating a service only for two little metrics seemed overkill as well.

On the other hand, we usually fetch more metrics than needed in other services for pure metrics. Adding two metrics to an existing service which is already related to the backends numbers was easier, seemed logical, coherent with other services and allow to add them without setting anything on the monitoring side.

Moreover, the metrics have different units and are not supposed to be on the same graph anyway.

Now, I really have no strong feelings and didn't review the patch yet.

@rjuju
Copy link
Member

rjuju commented Mar 3, 2022

Well, as I said in my first answer I don't really see how helpful those metrics are. It becomes even worse on environment where you have connection pooling.

@yhuelf
Copy link
Contributor Author

yhuelf commented Mar 4, 2022

I guess it could be helpful to decide whether or not you should set up a connection pooling.

And maybe, once it has been set up, how efficient it is. If there's no activity at night, but pgBouncer maintains a pool of server connections, then a graph of these metrics over time could be useful (good active ratio during workday, bad average).

@rjuju
Copy link
Member

rjuju commented Mar 4, 2022

I guess it could be helpful to decide whether or not you should set up a connection pooling.

I'm not sure it's really that helpful. Using a pooler is already a solution for specific problems that can already be identified without those info.

And maybe, once it has been set up, how efficient it is. If there's no activity at night, but pgBouncer maintains a pool of server connections, then a graph of these metrics over time could be useful (good active ratio during workday, bad average).

I'm also doubtful. It depends on your configuration as e.g. your pooler could aggresively drop idle connection or not.

If you really want to integrate it I still thing that having a dedicate check with a bit more thinking on what could be done with the info would be better. For instance, we could detect which how the busy time is distributed among the databases.

@ioguix
Copy link
Member

ioguix commented Mar 4, 2022

I'm not sure it's really that helpful. Using a pooler is already a solution for specific problems that can already be identified without those info.

What are the other means are you thinking about? From check_pga perspective, I don't remember there's much, out of the backend_status check which gather only sampled data and nothing about new sessions.

One of the reason you might want to add a pooler is to lower the number of sessions per second and we have nothing about this currently (neither in current code base or in this PR).

If you really want to integrate it I still thing that having a dedicate check with a bit more thinking on what could be done with the info would be better. For instance, we could detect which how the busy time is distributed among the databases.

Thinking about the other new pg_stat_database fields, I can imagine some of them might actually be useful to monitor with the thresholds:

  • sessions: useful to compute the number of sessions/s
  • sessions_killed: useful to detect eg. a backend crash killing all other backends

Moreover, active_time, idle_in_transaction_time and session_time metrics might help to diagnose some issues after some application update if their shape change much more than expected. I can't imagine how a threshold might be useful on them though.

I believe the sessions_abandoned and sessions_fatal fields would rise way too many false positive, but htey might be useful as pure metrics as well.

So, indeed, this might qualify for a dedicate check if you agree with these certainly not exhausted use case.

@rjuju
Copy link
Member

rjuju commented Mar 6, 2022

What are the other means are you thinking about? From check_pga perspective, I don't remember there's much, out of the backend_status check which gather only sampled data and nothing about new sessions.

Nothing in check_pgactivity indeed. But the need for a pooler is something really tied to your architecture, I don't think that's the job of monitoring to tell you that you need a pooler, as it's not its job to tell you that you need to do backups or setup replicas.

Now if we can have a new service that gives some useful information and can also help you notice that a pooler would be a good idea that's great, but it's not something that should really change, unlike other properties discussed here.

sessions_killed: useful to detect eg. a backend crash killing all other backends

Huh? This field shows controlled kill, like pg_terminate_backend() or a postmaster clean stop. If there's a crash that kills everything, you will lose all pg_stat_* data anyway.

Moreover, active_time, idle_in_transaction_time and session_time metrics might help to diagnose some issues after some application update if their shape change much more than expected. I can't imagine how a threshold might be useful on them though.

Agreed, especially idle_in_transaction_time. This one would deserve threshold.

So, indeed, this might qualify for a dedicate check if you agree with these certainly not exhausted use case.

Agreed.

@yhuelf
Copy link
Contributor Author

yhuelf commented Mar 8, 2022

OK I will submit a new PR during my second week of check_pgactivity dedicated week (mid-april).

@ioguix
Copy link
Member

ioguix commented May 3, 2022

PR #308 now exists, so we can close this one and move on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants