-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frontend not finding some metrics in carbon cache. How to determine whether this is a problem with carbon-cache or if the metrics are stuck in a relay? #782
Comments
Update: I'm pretty sure (but not positive) the problem is with carbon-cache. I did a test where I turned off the relays and then refreshed the browser. The idea being that by turning off the relays, the data would either get flushed to carbon-cache or would disappear with the relay. The result was that still no recent data was showing up in the browser. Until about 10 minutes later when it was written to disk. This tells me the data is being held by carbon-cache but not being given when the front-end requests it. Is there a way to log when carbon-cache receives a new metric from the line? This way we would know for sure it is receiving the metric. More importantly, why would it not return metrics that it has in its cache? Maybe there is an internal queue limit being hit? I'm running six instances that are processing about 2 million updates /minute. If I increase this number from six to eight or ten, would that help with this issue? |
are these "new" metrics for which a .wsp file does not exist yet? Are you querying the right cache? |
Unfortunately, that's not true. It's one of the oldest bugs which still exist - graphite-project/graphite-web#629 |
@piotr1212 These are existing metrics that already have a .wsp file and I'm definitely querying the right cache. @deniszh Well, shoot.... It sounds like adding cache-instances can help alleviate the problem, though? i should note that we're not using wildcards. |
I never saw this behavior TBH. |
@deniszh Can you help me troubleshoot and fix it? |
@deniszh Here is an exampe of a metric that is giving me trouble: |
@mwtzzz : please check @piotr1212 comment - graphite-project/graphite-web#629 (comment) |
@deniszh Yes, looks like it's a different problem because mine involves .wsp files that already exist. Can you help me to troubleshoot? I'm stuck, I'm not sure what else to do. |
what is your setup? running a relay? hashing? If so make sure that carbonlink destionations in carbon config is literally equal to carbonlink_hosts in local_settings.py. No exceptions in query.log console.log or in webapp log? Which OS, Which Python version. Could you post your carbon.conf and local_settings.py online? |
@piotr1212 Here's how the carbonlink and cluster_servers destinations look in local_settings.py:
Here's my carbon.conf. There are six entries (2105, 2107, etc) with cache_query_port 7201, 7202, etc.:
|
Here are some logs that might help: From cache.log:
F |
@piotr1212 I found a clue. I decided to test each cache instance one at a time. I edited local_settings.py and got rid of five entries in CARBONLINK_HOSTS, leaving only one. I iterated through these, and when I tested
These 25 data points correspond to the 25 minutes that have elapsed since the whisper file was last updated. But if I put CARBONLINK_HOSTS back to its original setting: A hashing problem? |
@piotr1212 @deniszh
My relay writes to these cache instances like this:
|
I fixed it by updating graphite-web and adding CARBONLINK_HASHING_TYPE = 'fnv1a_ch' to local_settings Apparently this problem had been with us for quite some time, but we didnt' realize it until we moved from fast nvme drives to slower EBS volumes. |
Cool, good to know that's solved now! |
webapp is supposed to retrieve and display metrics that are in carbon cache's memory before they have been written to disk (in addition to the metrics that are already on disk).
I can verify this is working as expected for some of my metrics, by enabling LOG_CACHE_HITS, looking at the logs and seeing "query returned" 1+ results when I refresh my browser. The timestamp on disk will be older, and so I know that the webapp is successfully getting the values from both disk and in-cache.
But I have other metrics that are not returning any recent values until they have been flushed to disk. In some cases, this means waiting 30+ minutes. During this wait, the log shows "query return 0 results" when I refresh my browser. Data is not lost; eventually it makes it to disk, but in the meantime the browser is showing an absence of recent data.
I've done some testing and I don't know if this is because:
(a) metrics are in carbon-cache memory but carbon-cache cannot find them
or
(b) metrics are stuck on the relay and have not made it to carbon-cache yet
How do I figure it out? And what can be done to fix it?
The text was updated successfully, but these errors were encountered: