Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no simple batch retrieval (possible memory issues) #441

Open
chungg opened this issue Oct 23, 2017 · 5 comments
Open

no simple batch retrieval (possible memory issues) #441

chungg opened this issue Oct 23, 2017 · 5 comments
Assignees

Comments

@chungg
Copy link
Member

chungg commented Oct 23, 2017

if we're doing a basic batch retrieval, the current code needless combines them into a matrix and evaluates it.

two issues:

  • we have to consider start/stop timestamp and overlap==0% to actually get the entire series or else it will set start/stop to where it overlaps (and it might not)
  • more importantly, the matrix can be extremely sparse if the series are big and has no overlap
@sileht
Copy link
Member

sileht commented Oct 24, 2017

About the first issue, I would like to change some defaults. I would like to set needed_overlap to None.
When fill is None and needed_overlap is None. I plan to use fill='null' and then remove all nan from the result.

@jd jd added the enhancement label Oct 24, 2017
@jd jd added this to the 4.1 milestone Oct 24, 2017
@sileht sileht self-assigned this Oct 24, 2017
@chungg
Copy link
Member Author

chungg commented Oct 26, 2017

copied from PR, for reference.

ok, so i ran:

In [25]: def extra_work(tss):
    ...:     combine = numpy.concatenate(tss)
    ...:     times, indices = numpy.unique(combine['timestamps'], return_inverse=True)
    ...:     filler = numpy.NaN
    ...:     val_grid = numpy.full((len(tss), len(times)), filler)
    ...:     start = 0
    ...:     for i, split in enumerate(tss):
    ...:         size = len(split)
    ...:         val_grid[i][indices[start:start + size]] = split['values']
    ...:         start += size
    ...:     values = val_grid.T
    ...:     for blah in values.T:
    ...:         pos = ~numpy.isnan(blah)
    ...:         v = blah[pos]
    ...:     return v
    ...: 

In [27]: %timeit extra_work(tss[:5])
10000 loops, best of 3: 172 µs per loop

In [28]: %timeit extra_work(tss[:50])
100 loops, best of 3: 2.76 ms per loop

In [29]: %timeit extra_work(tss[:10])
1000 loops, best of 3: 331 µs per loop

In [30]: %timeit extra_work(tss[:100])
10 loops, best of 3: 27.4 ms per loop

In [31]: %timeit extra_work(tss[:200])
10 loops, best of 3: 110 ms per loop

In [32]: %timeit extra_work(tss)
1 loop, best of 3: 694 ms per loop

tss is 500 series of 500 points that do not overlap...

so i guess if a user attempts to pull 500 series with 500 points each, current code adds 700ms... which i guess isn't bad compared to how long it'll take to render all those points?

that said, the memory :(

In [47]: tss[0].nbytes
Out[47]: 8000
# memory for a single ts.. 500  * 8K ~= 4MB for all series.

In [48]: extra_work(tss).nbytes
Out[48]: 1000000000
# 1GB size of the intermediate matrix for 500 non-overlapping series

In [58]: extra_work(tss[:100]).nbytes
Out[58]: 40000000
# 40MB size of intermediate matrix for 100 non-overlapping series

@jd
Copy link
Member

jd commented Oct 26, 2017

@chungg @sileht so we have a fix on the way for this?

@chungg
Copy link
Member Author

chungg commented Oct 26, 2017

nope. my only solution is to branch and skip our matrix code if it's a simple batch get...

it's just the edge case that's an issue: A LOT of BIG series that do not overlap

@sileht
Copy link
Member

sileht commented Oct 27, 2017

As @chungg It's just a edge case. We can reduce (a lot) the memory usage but that will slower (a bit) the applied operations computation.

I think we should just keep this ticket open, and see if people experience that. If that become problematic. We can implements the branch things.

I bet people will always retrieve metrics that have the same archive policy. You want all cpu info, all memory infos or all disk usages.

@jd jd removed this from the 4.1 milestone Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants