-
Notifications
You must be signed in to change notification settings - Fork 30
Recipe: Add full text indexing to your app
There are number of areas in the Hydra stack that need to be touched to do full-text indexing. Sufia supports full-text indexing using Apache Tika (which is provided in Apache Solr), and here's how it's implemented. (Note: if you're using Sufia, this is already done for you!)
The Solr schema contains a field called all_text_timv
.
The Solr config pulls in a bunch of extraction libraries and adds the all_text_timv
field to the default qf and pf. The ExtractingRequestHandler must be enabled as well.
Sufia uses a rake task to download extraction libraries and store them where Solr looks for them.
The all_text_timv
field is added to the all_fields
search qf in the Catalog controller
Sufia's GenericFile
model mixes in a module that knows how to talk to Solr's ExtractingRequestHandler. (The #extract_content
method is where that happens.)
Sufia has an indexing service that takes the output of Apache Tika and indexes it in Solr. (This is the equivalent of overriding #to_solr
on an ActiveFedora model.)
When a file is uploaded, Sufia spawns a background job that characterizes the file. The #characterize
method calls #append_metadata. That method in turn calls the #extract_content
method which hits Apache Tika via the Solr API.