Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frontend not finding some metrics in carbon cache. How to determine whether this is a problem with carbon-cache or if the metrics are stuck in a relay? #782

Closed
mwtzzz-zz opened this issue Jun 10, 2018 · 16 comments

Comments

@mwtzzz-zz
Copy link

mwtzzz-zz commented Jun 10, 2018

webapp is supposed to retrieve and display metrics that are in carbon cache's memory before they have been written to disk (in addition to the metrics that are already on disk).

I can verify this is working as expected for some of my metrics, by enabling LOG_CACHE_HITS, looking at the logs and seeing "query returned" 1+ results when I refresh my browser. The timestamp on disk will be older, and so I know that the webapp is successfully getting the values from both disk and in-cache.

But I have other metrics that are not returning any recent values until they have been flushed to disk. In some cases, this means waiting 30+ minutes. During this wait, the log shows "query return 0 results" when I refresh my browser. Data is not lost; eventually it makes it to disk, but in the meantime the browser is showing an absence of recent data.

I've done some testing and I don't know if this is because:

(a) metrics are in carbon-cache memory but carbon-cache cannot find them
or
(b) metrics are stuck on the relay and have not made it to carbon-cache yet

How do I figure it out? And what can be done to fix it?

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 10, 2018

Update: I'm pretty sure (but not positive) the problem is with carbon-cache. I did a test where I turned off the relays and then refreshed the browser. The idea being that by turning off the relays, the data would either get flushed to carbon-cache or would disappear with the relay.

The result was that still no recent data was showing up in the browser. Until about 10 minutes later when it was written to disk. This tells me the data is being held by carbon-cache but not being given when the front-end requests it.

Is there a way to log when carbon-cache receives a new metric from the line? This way we would know for sure it is receiving the metric.

More importantly, why would it not return metrics that it has in its cache? Maybe there is an internal queue limit being hit? I'm running six instances that are processing about 2 million updates /minute. If I increase this number from six to eight or ten, would that help with this issue?

@piotr1212
Copy link
Member

are these "new" metrics for which a .wsp file does not exist yet? Are you querying the right cache?

@deniszh
Copy link
Member

deniszh commented Jun 10, 2018

@mwtzzz

webapp is supposed to retrieve and display metrics that are in carbon cache's memory before they have been written to disk (in addition to the metrics that are already on disk).

Unfortunately, that's not true. It's one of the oldest bugs which still exist - graphite-project/graphite-web#629

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 10, 2018

@piotr1212 These are existing metrics that already have a .wsp file and I'm definitely querying the right cache.

@deniszh Well, shoot.... It sounds like adding cache-instances can help alleviate the problem, though?
Some of my metrics values do get displayed even when they haven't made it to disk yet, so it works for some of them. Does this mean the problem is with the size of the data, or with the hash of the metric name?

i should note that we're not using wildcards.

@deniszh
Copy link
Member

deniszh commented Jun 10, 2018

@mwtzzz

Some of my metrics values do get displayed even when they haven't made it to disk yet, so it works for some of them.

I never saw this behavior TBH.

@mwtzzz-zz
Copy link
Author

@deniszh Can you help me troubleshoot and fix it?

@mwtzzz-zz
Copy link
Author

@deniszh
Here is an example of a metric that is not giving me trouble:
atlantic_exchange.request.sum_all.hosts

Here is an exampe of a metric that is giving me trouble:
atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts

@deniszh
Copy link
Member

deniszh commented Jun 11, 2018

@mwtzzz : please check @piotr1212 comment - graphite-project/graphite-web#629 (comment)
Looks like you facing some different issue...

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 11, 2018

@deniszh Yes, looks like it's a different problem because mine involves .wsp files that already exist. Can you help me to troubleshoot? I'm stuck, I'm not sure what else to do.

@piotr1212
Copy link
Member

what is your setup? running a relay? hashing? If so make sure that carbonlink destionations in carbon config is literally equal to carbonlink_hosts in local_settings.py. No exceptions in query.log console.log or in webapp log? Which OS, Which Python version. Could you post your carbon.conf and local_settings.py online?

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 11, 2018

@piotr1212
i'm running a relay layer using carbon-c-relay with fn1v_ch hash
no exceptions in the logs
OS: Amzn linux 2015.09
python 2.7

Here's how the carbonlink and cluster_servers destinations look in local_settings.py:

CARBONLINK_HOSTS = ["127.0.0.1:7201:2105","127.0.0.1:7202:2107","127.0.0.1:7203:2109","127.0.0.1:7204:2111","127.0.0.1:7205:2113","127.0.0.1:7206:2115"]

CLUSTER_SERVERS = ['radar-be-a.mgmt.yyy.com:8030','radar-be-b.mgmt.yyy.com:8030','radar-be-c.mgmt.yyy.com:8030','radar-be-d.mgmt.yyy.com:8030','radar-be-e.mgmt.yyy.com:8030', ... ]

Here's my carbon.conf. There are six entries (2105, 2107, etc) with cache_query_port 7201, 7202, etc.:

[cache:2105]
ENABLE_LOGROTATION = True
USER = yyy
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 10000
MAX_CREATES_PER_MINUTE = 3000
USE_INSECURE_UNPICKLER = False
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CREATES = False
LOG_CACHE_QUEUE_SORTS = False
CACHE_WRITE_STRATEGY = timesorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = False
CARBON_METRIC_PREFIX = carbon.agents.radar123
CARBON_METRIC_INTERVAL = 60
ENABLE_AMQP = False
ENABLE_MANHOLE = False

ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 0
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2105
CACHE_QUERY_PORT = 7201

@mwtzzz-zz
Copy link
Author

Here are some logs that might help:

From cache.log:

Mon Jun 11 11:08:09 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 0 datapoints
Mon Jun 11 11:08:12 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 11:08:12 2018 :: Data-Cache miss [e8da12dd07d2498aca62b1e42d1c829a]
Mon Jun 11 11:08:12 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 11:08:12 2018 :: Data-Cache miss [a6fadeacec0b735c4d577bb25e7ca05e]

F

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 12, 2018

@piotr1212 I found a clue.

I decided to test each cache instance one at a time. I edited local_settings.py and got rid of five entries in CARBONLINK_HOSTS, leaving only one. I iterated through these, and when I tested CARBONLINK_HOSTS = ["127.0.0.1:7205:2113"] the data points show up for this metric and cache.log shows 25 data points are retrieved:

Mon Jun 11 20:35:56 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 20:35:56 2018 :: Data-Cache miss [52fd8f9bd86cdba7ec015a07be13d28a]
Mon Jun 11 20:35:56 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 20:35:56 2018 :: Data-Cache miss [9d471edfc61931ce1613fc64cd86df7b]
Mon Jun 11 20:35:57 2018 :: CarbonLink sending request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts to ('127.0.0.1', '2113')
Mon Jun 11 20:35:57 2018 :: CarbonLink finished receiving atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts from ('127.0.0.1', '2113')
Mon Jun 11 20:35:57 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 25 datapoints

These 25 data points correspond to the 25 minutes that have elapsed since the whisper file was last updated.

But if I put CARBONLINK_HOSTS back to its original setting:
CARBONLINK_HOSTS = ["127.0.0.1:7201:2105","127.0.0.1:7202:2107","127.0.0.1:7203:2109","127.0.0.1:7204:2111","127.0.0.1:7205:2113","127.0.0.1:7206:2115"]
It no longer can retrieve the data points.

A hashing problem?

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 12, 2018

@piotr1212 @deniszh
It's querying the wrong cache instance. It's querying 2115 instead of 2113:

Mon Jun 11 20:56:20 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 20:56:20 2018 :: Data-Cache miss [b610953d155b2e73634893bc1a6b5d89]
Mon Jun 11 20:56:20 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 20:56:20 2018 :: Data-Cache miss [0ba2e41252e01e4fab74db41b4f8b741]
Mon Jun 11 20:56:20 2018 :: CarbonLink sending request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts to ('127.0.0.1', '2115')
Mon Jun 11 20:56:20 2018 :: CarbonLink finished receiving atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts from ('127.0.0.1', '2115')
Mon Jun 11 20:56:20 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 0 datapoints

My relay writes to these cache instances like this:


cluster cache
  fnv1a_ch
    127.0.0.1:2105=2105
    127.0.0.1:2107=2107
    127.0.0.1:2109=2109
    127.0.0.1:2111=2111
    127.0.0.1:2113=2113
    127.0.0.1:2115=2115
  ;

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 12, 2018

I fixed it by updating graphite-web and adding CARBONLINK_HASHING_TYPE = 'fnv1a_ch' to local_settings
and also installing pip scandir as per graphite-project/graphite-web#1867

Apparently this problem had been with us for quite some time, but we didnt' realize it until we moved from fast nvme drives to slower EBS volumes.

@deniszh
Copy link
Member

deniszh commented Jun 12, 2018

Cool, good to know that's solved now!
Sorry, I was busy and not able to support you, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants