forked from industria/solrprocessors
-
Notifications
You must be signed in to change notification settings - Fork 0
Solr processors
License
Jyllands-Posten/solrprocessors
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Solr processors Implements update request processors for use in the Solr update request processor chain. ============ Installation ============ 1) Make the project jar-file running "sbt package". 2) Make the jar-file available in the classpath. If you are running Solr with multiple cores, define a shared lib in the solr.xml and place the jar-file in that folder. Solr will add the files in that folder to the classpath on startup. 3) Make sure you have scala-library-2.11.7 (http://central.maven.org/maven2/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar) in the classpath You can place it in the lib folder as described above if you don't have it available in the classpath of the container already. 4) Define an update request processor chain with the processors needed. Below is an example with four processor factories AllowDisallowIndexingProcessorFactory, HTMLStripCharFilterProcessorFactory, LogUpdateProcessorFactory and RunUpdateProcessorFactory. <updateRequestProcessorChain name="customChain"> <processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory"> <lst name="allow"> <str name="content_type">article</str> </lst> </processor> <processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory"> <str name="field">header</str> <str name="field">content</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> 5) Register the processor chain with your update request handler as shown below <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> <lst name="defaults"> <str name="update.processor">customChain</str> </lst> </requestHandler> ========== Processors ========== The project contains the following processors: - AllowDisallowIndexingProcessor - HTMLStripCharFilterProcessor - PatternReplaceProcessor ============================== AllowDisallowIndexingProcessor ============================== The AllowDisallowIndexingProcessor makes it possible to configure rules based on field content for deciding whether or not a given document should be indexed or not. The use case for this processor: A system, for instance a content management system, pushes documents to be indexed to Solr and you don't want all the document types to be indexed. An example of this use case is using the Escenic indexer-webapp to create a site search index. The processor is configured by supplying the <lst> element with a name attribute indicating the semantics of the processor, which can be either allow or disallow. The semantics work as follows: allow : Index documents matching at least one rule in the list, dropping everything else. disallow : Index documents that doesn't match any rules in the list. Rules are defined by using the <str> element giving the field to check in the name attribute and the match rule (regular expression) as the value of the element. Example allow rule indexing documents with a field content_type set to article: <updateRequestProcessorChain name="customChain"> <processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory"> <lst name="allow"> <str name="content_type">article</str> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> <lst name="defaults"> <str name="update.processor">customChain</str> </lst> </requestHandler> If more than one rule is defined they will be tested one by one until either one of them match (declaring a match) or none of the matched (declaring no match) that is more rules work as logical or. ============================ HTMLStripCharFilterProcessor ============================ The HTMLStripCharFilterProcessor makes it possible to run the Solr character filter HTMLStripCharFilter on a field before it's delivered to Solr for indexing. This can be especially convenient when the application doing the indexing is not under you control. This could be a content management system sending fields with markup that you want to use for highlighting and therefore need the markup removed before the field is stored in the index. In addition to running HTMLStripCharFilter the processor will: - Remove no-break spaces (unicode point: 00A0) from the result of HTMLStripCharFilter - Remove leading and trailing spaces from the result - Remove multiple continuous spaces from the result The above can be turned off by placing a bool element with a name attribute set to normalize and a value of false. An Example configuration of the HTMLStripCharFilterProcessor: <updateRequestProcessorChain name="customChain"> <processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory"> <str name="field">header</str> <str name="field">content</str> <bool name="normalize">true</bool> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> <lst name="defaults"> <str name="update.processor">customChain</str> </lst> </requestHandler> ======================= PatternReplaceProcessor ======================= The PatternReplaceProcessor makes it possible to replace patterns defined by regular expressions with a replacement string. The replacements are done withe the matcher objects replaceAll method meaning all matches in a field value will be replaced. The processor will replace all values in a field if it is a multivalued field and if the values are strings. It is possible to attach multiple rules to a field by repeating the field in the fields list. Each rule attached will be run in order of appearance. In the example configuration later, the field card2 has both punctuation and prefix attached. An example configuration of the PatternReplaceProcessor is shown below. The configuration contains two rules, punctuation and prefix. The rule punctuation is defined for the fields title, name and comment. The prefix rule is defined for the card field. <updateRequestProcessorChain name="customChain"> <processor class="dk.industria.solr.processors.PatternReplaceProcessorFactory"> <lst name="rule"> <str name="id">punctuation</str> <str name="pattern">\p{P}</str> <str name="replace"/> </lst> <lst name="rule"> <str name="id">prefix</str> <str name="pattern">^\d{4}</str> <str name="replace">****</str> </lst> <lst name="fields"> <str name="title">punctuation</str> <str name="name">punctuation</str> <str name="comment">punctuation</str> <str name="card">prefix</str> <str name="card2">punctuation</str> <str name="card2">prefix</str> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> <lst name="defaults"> <str name="update.processor">customChain</str> </lst> </requestHandler>
About
Solr processors
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Scala 100.0%