GitHub - Jyllands-Posten/solrprocessors: Solr processors

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README		README
build.sbt		build.sbt
Repository files navigation

Solr processors

Implements update request processors for use in the Solr update request processor chain.

============
Installation
============
1) Make the project jar-file running "sbt package".

2) Make the jar-file available in the classpath.
   If you are running Solr with multiple cores, define a shared lib in the solr.xml
   and place the jar-file in that folder. Solr will add the files in that folder to
   the classpath on startup.

3) Make sure you have scala-library-2.11.7 (http://central.maven.org/maven2/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar) in the classpath
   You can place it in the lib folder as described above if you don't have it
   available in the classpath of the container already.

4) Define an update request processor chain with the processors needed.
   Below is an example with four processor factories AllowDisallowIndexingProcessorFactory,
   HTMLStripCharFilterProcessorFactory, LogUpdateProcessorFactory and RunUpdateProcessorFactory.

   <updateRequestProcessorChain name="customChain">
     <processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory">
       <lst name="allow">
         <str name="content_type">article</str>
       </lst>
     </processor>
     <processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory">
       <str name="field">header</str>
       <str name="field">content</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>

5) Register the processor chain with your update request handler as shown below

   <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
     <lst name="defaults">
       <str name="update.processor">customChain</str>
     </lst>
   </requestHandler>


==========
Processors
==========

The project contains the following processors:

- AllowDisallowIndexingProcessor
- HTMLStripCharFilterProcessor
- PatternReplaceProcessor

==============================
AllowDisallowIndexingProcessor
==============================
The AllowDisallowIndexingProcessor makes it possible to configure rules based on
field content for deciding whether or not a given document should be indexed or not.

The use case for this processor:
A system, for instance a content management system, pushes documents to be indexed to Solr
and you don't want all the document types to be indexed. An example of this use case is using
the Escenic indexer-webapp to create a site search index.

The processor is configured by supplying the <lst> element with a name attribute
indicating the semantics of the processor, which can be either allow or disallow.

The semantics work as follows:

allow    : Index documents matching at least one rule in the list, dropping everything else.
disallow : Index documents that doesn't match any rules in the list.

Rules are defined by using the <str> element giving the field to check in the name attribute
and the match rule (regular expression) as the value of the element.

Example allow rule indexing documents with a field content_type set to article:

<updateRequestProcessorChain name="customChain">
  <processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory">
    <lst name="allow">
      <str name="content_type">article</str>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
  <lst name="defaults">
    <str name="update.processor">customChain</str>
  </lst>
</requestHandler>

If more than one rule is defined they will be tested one by one until either one of
them match (declaring a match) or none of the matched (declaring no match) that is
more rules work as logical or.


============================
HTMLStripCharFilterProcessor
============================
The HTMLStripCharFilterProcessor makes it possible to run the Solr character filter
HTMLStripCharFilter on a field before it's delivered to Solr for indexing. This can
be especially convenient when the application doing the indexing is not under you control.

This could be a content management system sending fields with markup that you want to use
for highlighting and therefore need the markup removed before the field is stored in the index.

In addition to running HTMLStripCharFilter the processor will:

- Remove no-break spaces (unicode point: 00A0) from the result of HTMLStripCharFilter
- Remove leading and trailing spaces from the result
- Remove multiple continuous spaces from the result

The above can be turned off by placing a bool element with a name attribute set
to normalize and a value of false.

An Example configuration of the HTMLStripCharFilterProcessor:

<updateRequestProcessorChain name="customChain">
  <processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory">
    <str name="field">header</str>
    <str name="field">content</str>
    <bool name="normalize">true</bool>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
  <lst name="defaults">
    <str name="update.processor">customChain</str>
  </lst>
</requestHandler>

=======================
PatternReplaceProcessor
=======================
The PatternReplaceProcessor makes it possible to replace patterns defined by
regular expressions with a replacement string. The replacements are done withe
the matcher objects replaceAll method meaning all matches in a field value
will be replaced.

The processor will replace all values in a field if it is a multivalued field
and if the values are strings.

It is possible to attach multiple rules to a field by repeating the field in the
fields list. Each rule attached will be run in order of appearance. In the example
configuration later, the field card2 has both punctuation and prefix attached.

An example configuration of the PatternReplaceProcessor is shown below. The
configuration contains two rules, punctuation and prefix. The rule punctuation
is defined for the fields title, name and comment. The prefix rule is
defined for the card field.

<updateRequestProcessorChain name="customChain">
  <processor class="dk.industria.solr.processors.PatternReplaceProcessorFactory">
    <lst name="rule">
      <str name="id">punctuation</str>
      <str name="pattern">\p{P}</str>
      <str name="replace"/>
    </lst>
    <lst name="rule">
      <str name="id">prefix</str>
      <str name="pattern">^\d{4}</str>
      <str name="replace">****</str>
    </lst>
    <lst name="fields">
      <str name="title">punctuation</str>
      <str name="name">punctuation</str>
      <str name="comment">punctuation</str>
      <str name="card">prefix</str>
      <str name="card2">punctuation</str>
      <str name="card2">prefix</str>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
  <lst name="defaults">
    <str name="update.processor">customChain</str>
  </lst>
</requestHandler>