dnscollector_requesters_total could be extended to include time dimension and reduced with netmasks #520

johnhtodd · 2023-12-21T04:45:57Z

Is your feature request related to a problem? Please describe.
It seems dnscollector_requesters_total in Prometheus stats keeps a list of every IP address that has ever been seen. This may be potentially a crash risk (v6 hosts with dynamic last bits means each request is a new IP, possibly) and also doesn't seem to have the dimension that is useful for monitoring, which to me seems to be: "How many unique IP addresses have I seen in the last X seconds?"

By tracking a gauge of the number seen in X seconds, it will be possible to graph the result in a way that allows an operator to understand the volume of IP addresses seen by a particular data stream. Right now, it is not possible to see how many clients have been in a data stream recently; it is only possible to see the MAXIMUM number of clients ever seen by that data stream across its entire lifespan.

Describe the solution you'd like
I am wary of dnscollector_requesters_total keeping all IP addresses forever; this seems to be a bug since infinite storage is impossible. It would be more interesting to keep a time-limited count of how many requesters have been seen in X minutes, where "X" is configurable. It may also be very useful to have a netmask, separate for v4 and v6 addresses, which is applied to the uniqueness criteria. This is probably really important for v6 addresses, and no meaningful data will be lost by filtering at higher netmasks where hosts have control of those bits.

If no DNS event is seen from an IP address in X seconds, then that IP address would be removed from the list of addresses, the memory associated with that address would be freed, and no data would exist in the program memory space until at least one DNS event is seen again from that IP address. This prevents infinite memory creep.

dmachard · 2023-12-22T06:06:06Z

You're correct; the Prometheus logger retains all the IPs it encounters (as well as all qnames) in memory. This poses a significant problem and could potentially lead to app crashes.

Indeed, I also believe a new Prometheus logger should be rewritten to support pipeline mode.
In the meantime, it might be possible to mitigate this risk.

johnhtodd · 2023-12-27T01:06:28Z

Perhaps having maximums of time for storing each type of data (qname, IP address) and a total maximum number of bytes allowed in a pool for each of those would be the way to minimize the risk. I can speak for my own interests and say that in many cases this figure would be very low, and that would probably still be quite effective.

For each stored array of values, for each value store a set of [first seen, last seen, timestamp of last seen] and then run a housekeeping sweep every X seconds to see if there are any that need deletion. If the maximum memory value is reached for that particular pool, then delete the items that have the oldest "last seen" date even if they are not past the value set as maximum.

Ideally, this could just be attached to Prometheus (see #510) so that each routing stanza could just split its output to a named prometheus route and then statistics could be kept as separate counters. It would be a breaking change since the current model of storing everything in one big "pool" of qname and IP address list wouldn't match this model, though perhaps some default behavior might be implied and could be shut off explicitly. It would also alter the REST API/swagger, but not by much since it would be easy to add a name structure in the swagger calls (http://127.0.0.1:8080/my-prometheus-tagname/clients instead of http://127.0.0.1:8080/clients ) and then a command to list all the various sets of data. I don't know enough about swagger in this implementation to guess if this is sufficient, but it seems reasonable?

These might be potential configuration items that would be added to Prometheus.

recording:
  ip-recording:
    maximum-memory-mb: 100
    maximum-ttl-seconds: 600
    v4-netmask: 24
    v6-netmask: 56
  qname-recording:
    maximum-memory-mb: 100
    maximum-ttl-seconds: 600

dmachard added the feature request label Dec 22, 2023

dmachard added the bug Something isn't working label Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnscollector_requesters_total could be extended to include time dimension and reduced with netmasks #520

dnscollector_requesters_total could be extended to include time dimension and reduced with netmasks #520

johnhtodd commented Dec 21, 2023 •

edited

Loading

dmachard commented Dec 22, 2023

johnhtodd commented Dec 27, 2023

dnscollector_requesters_total could be extended to include time dimension and reduced with netmasks #520

dnscollector_requesters_total could be extended to include time dimension and reduced with netmasks #520

Comments

johnhtodd commented Dec 21, 2023 • edited Loading

dmachard commented Dec 22, 2023

johnhtodd commented Dec 27, 2023

johnhtodd commented Dec 21, 2023 •

edited

Loading