Write new indexing process #348

hackartisan · 2017-02-23T14:27:22Z

ActiveFedora::Base.reindex_everything is inefficient and must be run twice. Given that it takes more than an entire workday to run even a single time, it would be worth rewriting so that we can run it only once and more efficiently.

hackartisan · 2017-02-23T14:31:40Z

Two strategies discussed on slack for improving efficiency are:

Multithread the process, e.g.: https://github.com/avalonmediasystem/avalon/blob/develop/lib/tasks/avalon.rake#L250-L264
Do not call update_index on every object. Instead call to_solr and build up a list of objects and only write them when you have a bunch of documents (like 1000) (update, bill dueber mentioned that he played with this and found 100 to be a good number).

hackartisan · 2017-02-23T14:32:22Z

See also this issue about 2 invocations: samvera/hydra-head#402

hackartisan · 2017-02-24T14:12:56Z

A single baseline run on staging took nearly 10.5 hrs:

$ time bundle exec rake chf:reindex RAILS_ENV=production                                                                                               
reindex complete

real    625m20.498s
user    54m13.081s
sys     1m18.049s

note that to reindex correctly you must run it twice (so that would take 21 hrs)

hackartisan · 2017-02-24T15:44:58Z

benchmarking a multithreaded run:

$ time bundle exec rake chf:reindex[2] RAILS_ENV=production

Well that didn't work.

real    602m5.382s
user    75m10.674s
sys     1m27.205s

jrochkind · 2017-03-27T16:07:00Z

My PR's have been merged into ActiveFedora to batch solr adds, as well as order permissions objects first which should prevent double-index needed.

Batch solr adds significantly (like order of magnitude) speed up reindexing.

I am going to put both of these fixes in locally -- not monkey patching but making a new class that our reindex task can use. I will put a load-time check in for version of ActiveFedora expected to include these tasks, warning we may not need local code anymore.

@HackMasterA , does this make sense as an approach?

Concurrency may speed it up even more, but I think it may be fast enough to be workable with these changes -- but I can try working on concurrency if we want even faster. I'm not totally positive I'll have avoided the need for double-index, we'll have to check that.

hackartisan · 2017-03-27T17:52:55Z

Sounds great. Feel free to close this ticket without doing any concurrency; the performance gain was definitely the goal, as opposed to the strategy for getting there.

jrochkind · 2017-03-27T18:07:56Z

okay, can't close the ticket without doing that other stuff first, will do.

It'll still be slow -- takes 20 minutes just get all the ID's out of fedora, plus prob another 20-40 to actually index. But that'll still be an order of magnitude improvement! One step at a time, we'll do that first.

hackartisan added the gantt: best practices label Feb 23, 2017

hackartisan added this to the Soft launch milestone Feb 23, 2017

hackartisan mentioned this issue Feb 23, 2017

Facets overhaul / configuration #308

Closed

hackartisan added a commit that referenced this issue Feb 24, 2017

Update solr reindex job to run multithreaded, refs #348

ef5a283

hackartisan added gantt: solr and removed gantt: best practices labels Mar 13, 2017

jrochkind added the jrochkind-interested label Mar 27, 2017

jrochkind mentioned this issue Mar 27, 2017

Better mass index #412

Merged

hackartisan closed this as completed in #412 Mar 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write new indexing process #348

Write new indexing process #348

hackartisan commented Feb 23, 2017

hackartisan commented Feb 23, 2017 •

edited

Loading

hackartisan commented Feb 23, 2017

hackartisan commented Feb 24, 2017 •

edited

Loading

hackartisan commented Feb 24, 2017 •

edited

Loading

jrochkind commented Mar 27, 2017

hackartisan commented Mar 27, 2017

jrochkind commented Mar 27, 2017

Write new indexing process #348

Write new indexing process #348

Comments

hackartisan commented Feb 23, 2017

hackartisan commented Feb 23, 2017 • edited Loading

hackartisan commented Feb 23, 2017

hackartisan commented Feb 24, 2017 • edited Loading

hackartisan commented Feb 24, 2017 • edited Loading

jrochkind commented Mar 27, 2017

hackartisan commented Mar 27, 2017

jrochkind commented Mar 27, 2017

hackartisan commented Feb 23, 2017 •

edited

Loading

hackartisan commented Feb 24, 2017 •

edited

Loading

hackartisan commented Feb 24, 2017 •

edited

Loading