Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric queries return null data until whisper file exists #629

Closed
bitprophet opened this issue Feb 20, 2014 · 25 comments · Fixed by graphite-project/carbon#960 · May be fixed by graphite-project/carbon#888
Closed

Metric queries return null data until whisper file exists #629

bitprophet opened this issue Feb 20, 2014 · 25 comments · Fixed by graphite-project/carbon#960 · May be fixed by graphite-project/carbon#888
Labels
Milestone

Comments

@bitprophet
Copy link
Member

Scenario

  • You have a large Graphite install and/or are just stuck with slow disks, such that data enters carbon-cache and then doesn't reach disk for a number of minutes.
  • You add new metric paths to your system that didn't exist before - new hosts or services came online, you renamed some metrics in your collectors, whatever.
  • You are querying Graphite-web for these new metrics directly (aka bypassing the 'finder') via direct HTTP calls, or a dashboard, etc.
  • Those queries result in no data until the above to-disk lag period passes.
  • Users get sad because they made all the changes on their end (re: data collection) but then they have to wait a long time for validation/usefulness.

Cause

Querying the webapp for a specific metric path returns no data until it hits disk even if data is in the cache, due to the short-circuit linked here.

Solution

Update readers.py such that it performs a best-effort check of the cache for the requested metric path before giving up and returning None.

Downsides/challenges

  • This may incur a (hopefully minor!) performance hit compared to existing behavior, in the case of truly-bogus metric path queries, but feels quite worth it, at least in my use case where lag is 20-30 minutes at times.
  • Unless Carbon/CarbonLink is capable of servicing glob-expression requests (guessing not, given the abovelinked code) this will only solve things for fully explicit queries. Still useful some of the time, but not a full solution.
    • EDIT: Yup, the query is straight-up keying into the carbon-cache's cache dict. Could probably shoehorn in a way to handle globs, however, if core team thinks that's a reasonable thing to do.
@obfuscurity
Copy link
Member

I'm really surprised to see that we don't attempt to hit the cache before disk. I'd like to get @mleinart's feedback on this proposed change and why we started returning None in 764bdf7 without checking the cache too.

@g76r
Copy link
Contributor

g76r commented Dec 17, 2014

@obfuscurity: I didn't talk to @mleinart but I had a look in the code to change that behaviour by myself.

IMHO it should be interesting to search the cache for series not yet present on disk, however it needs some changes since carbon cache store series in a hashtable, therefore resolving glob-expression will need another memory structure in addition to the hashtable (or, less likely, instead of the hashtable). This is probably why Graphite-web does not request carbon cache first.

@ocervell
Copy link

Any updates on this ? I am running into the same problem.

@deniszh
Copy link
Member

deniszh commented Jul 11, 2016

@ocervell - unfortunately not yet.
We still need to attack this issue.

@deniszh deniszh added this to the 1.1.0 milestone Sep 8, 2016
@deniszh
Copy link
Member

deniszh commented Sep 8, 2016

It completely slips off our radars. Developers on my work hitting that constantly, I even using dirty hack for 0.9.x
Will try to make a fix for a master, but adding 1.1.0 milestone just in case.

@ocervell
Copy link

Issue is 3 years old. Any update ?
My company moving out of Graphite because of this.
I was still hoping that this issue would get resolved since I'm using Graphite in my own projects.

@DanCech
Copy link
Member

DanCech commented Oct 12, 2017

The big thing that would be needed for this to work would be the ability for carbon to respond to a find request, since right now if the whisper file doesn't exist then the standard finder is never going to find the series and the code won't even get as far as calling the whisper reader.

If carbon were to provide a find method then it seems like it should be possible to move all the cache functionality out into a finder and reader for the cache, and handle merging the cached data into the final results via new the MultiReader mechanism. The biggest issue there seems to be aligning the data from the cache since it's stored with the raw timestamps and not aligned to a step.

@deniszh
Copy link
Member

deniszh commented Oct 14, 2017

@ocervell : There're 2 main reasons why this issue is not in active development yet. First - that logic buried quite deep in whisper itself, and fix for this is quite big and massive, as @DanCech said.
On the other hand - if you will use SSDs and increase the number of your caches delay can be quite low and acceptable. I'm not saying that we should fix it, just explaining why even 3rd party graphite implementations still have this issue. :(

@Exocomp
Copy link

Exocomp commented Oct 14, 2017

You have a large Graphite install and/or are just stuck with slow disks, such that data enters carbon-cache and then doesn't reach disk for a number of minutes.

I'm hitting the above scenario. I have to lower MAX_UPDATES_PER_SECOND because it kills the disk's throughput but that causes the cache to increase. Then queries start returning null for a few mintues because the cache is not being checked. ¯\_(ツ)_/¯

@deniszh deniszh modified the milestones: 1.1.0, 1.2.0 Nov 19, 2017
@AsenZahariev
Copy link

@deniszh , i saw that you included this issue into milestones 1.1.0 and 1.2.0, can you confirm in which version it will be definitely included ? Seems that we are struggling with the same problem.
Thank you!

@deniszh
Copy link
Member

deniszh commented Dec 15, 2017

Milestone 1.2.0 means only 'not now'. Currently, we have no solution for that problem, and not actively working on it.
Sorry.

@mwtzzz-zz
Copy link

I just wanted to add my two cents. We also are running a large graphite installation. This problem has been apparent for a while, but was tolerable because we were using nvme-backed storage. But I recently converted to iops-provisioned EBS volumes (because we were losing historical data every time an nvme drive bombed out) and the problem has been exasperated.

It's problematic for us because a lot our monitoring and auto-scaling stuff relies on being able to retrieve timely metrics from the graphite front-end.

I should note that for us, some metrics current values are retrievable from the cache, while others aren't. I don't know if this is because of a hashing problem or because of the sheer number of metrics in the cache at any given time. But in either case, the fact that it retrieves some of them means it shouldn't be a big deal to update the code to retrieve any of them, right? In our case, we're not using wildcards, if that makes a difference.

@piotr1212
Copy link
Member

I should note that for us, some metrics current values are retrievable from the cache, while others aren't. I don't know if this is because of a hashing problem or because of the sheer number of metrics in the cache at any given time. But in either case, the fact that it retrieves some of them means it shouldn't be a big deal to update the code to retrieve any of them, right? In our case, we're not using wildcards, if that makes a difference.

This issue is about metrics for which a .wsp file does not exist yet. They never return results. I don't get why it would sometimes return data for you. In graphite-project/carbon#782 (comment) you indicated that the metrics you are having trouble with already have a .wsp file. Is this another issue you are experiencing?

@mwtzzz-zz
Copy link

@piotr1212 Ok, if this issue #629 is about metrics for which a .wsp file does not exist yet, then my problem is different. Sorry to bother.

@stale
Copy link

stale bot commented Apr 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 13, 2020
@twm
Copy link

twm commented Apr 14, 2020

This is still a thing AFAIK.

@stale stale bot removed the stale label Apr 14, 2020
@piotr1212
Copy link
Member

There is no easy way to fix this without changing the whole architecture.

@g76r
Copy link
Contributor

g76r commented Apr 14, 2020

@piotr1212 the whole architecture I don't think so, but it would need a kind of index in addition to the hashtable in the write cache, in order to be able to find new series names in the cache with a globing pattern and not only knowing their exact name after globing on the filesystem.

@piotr1212
Copy link
Member

@g76r Graphite-web needs to figure out which cache to query, as long as globbing is not done it just doesn't know which cache has that metric. Querying all caches on every metric with a glob is not scalable at all.

So you will need a index daemon which keeps a list (or rather tree or some faster datastructure) of all metrics. When a cache receives a metric it will first have to check if it already exists on disk instead of just pushing it to cache and worrying about creating it later at write time. It can check it on disk which is slow or keep an internal index. With an internal index you have to reindex at startup or save it on disk.
If the metric doesn't exists it has to inform the index daemon. Graphite-web will have to query the index daemon on every request which has a glob, the index daemon will return a list of caches which have this metric, or just metricnames so that web can do the carbon hash lookup itself.
Then this index daemon needs to be distributed in some way because otherwise it will become a huge bottleneck in performance.

@g76r
Copy link
Contributor

g76r commented Apr 15, 2020

@piotr1212 graphite-web already query every carbon instance, to complete on-disk data with its still-in-memory data. because graphite-web has already no way to known which cache contains still-in-memory data.
at less it's the way it worked when I wrote my comment above, on 17 Dec 2014.

@g76r
Copy link
Contributor

g76r commented Apr 15, 2020

PS:
I don't say : "it's easy it's should be done" I'm only saying "it's probably more easy and with less impact than you seems to think"
and to be frank maybe in 2014 I would have coded it because I needed graphite in those days, but I'm no longer using it so I won't contribute, therefore I wouldn't be offended if this issue was closed...

@g76r
Copy link
Contributor

g76r commented Apr 15, 2020

@piotr1212 imho what you call a distributed index is actually carbon cache daemon

@piotr1212
Copy link
Member

@piotr1212 graphite-web already query every carbon instance, to complete on-disk data with its still-in-memory data. because graphite-web has already no way to known which cache contains still-in-memory data.
at less it's the way it worked when I wrote my comment above, on 17 Dec 2014.

No, it uses carbon consistent hashing to determine which cache to query. I don't know if it ever worked differently.

basically:

  • web receives a metric with globs
  • web checks disk to expand the globs
  • web runs carbon consistent hash function on expanded metrics to determine which cache to query
  • web queries the cache's which hasve the in memory data.

@piotr1212
Copy link
Member

Another option which would be easy to implement would be to create the new files earlier.

piotr1212 added a commit to piotr1212/carbon that referenced this issue Apr 16, 2020
In large Graphite installations the queue's can get really long. It can take
an hour for Graphite to write all metrics in queue. New db files are created
when the metric is written, which can take too long. This separates the creation
of metrics from writing data to them and moves the creation to an earlier moment.

Whenever a new metric is received it's name is pushed to a new_metric list. The
first step in the writer loop is to check if there are new metrics received and
creates them if they don't exist on disk yet. After the creation the writer
continues as usual with writing metrics from the queue but it does not check if
the file already exists, to prevent that the check occurs twice and has impact
on IO. If the file does not exists at thie point it is logged.

Fixes: graphite-project/graphite-web#629
@stale
Copy link

stale bot commented Jun 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet