forked from industria/solrprocessors
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
187 lines (150 loc) · 7.23 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
Solr processors
Implements update request processors for use in the Solr update request processor chain.
============
Installation
============
1) Make the project jar-file running "sbt package".
2) Make the jar-file available in the classpath.
If you are running Solr with multiple cores, define a shared lib in the solr.xml
and place the jar-file in that folder. Solr will add the files in that folder to
the classpath on startup.
3) Make sure you have scala-library-2.11.7 (http://central.maven.org/maven2/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar) in the classpath
You can place it in the lib folder as described above if you don't have it
available in the classpath of the container already.
4) Define an update request processor chain with the processors needed.
Below is an example with four processor factories AllowDisallowIndexingProcessorFactory,
HTMLStripCharFilterProcessorFactory, LogUpdateProcessorFactory and RunUpdateProcessorFactory.
<updateRequestProcessorChain name="customChain">
<processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory">
<lst name="allow">
<str name="content_type">article</str>
</lst>
</processor>
<processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory">
<str name="field">header</str>
<str name="field">content</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
5) Register the processor chain with your update request handler as shown below
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">customChain</str>
</lst>
</requestHandler>
==========
Processors
==========
The project contains the following processors:
- AllowDisallowIndexingProcessor
- HTMLStripCharFilterProcessor
- PatternReplaceProcessor
==============================
AllowDisallowIndexingProcessor
==============================
The AllowDisallowIndexingProcessor makes it possible to configure rules based on
field content for deciding whether or not a given document should be indexed or not.
The use case for this processor:
A system, for instance a content management system, pushes documents to be indexed to Solr
and you don't want all the document types to be indexed. An example of this use case is using
the Escenic indexer-webapp to create a site search index.
The processor is configured by supplying the <lst> element with a name attribute
indicating the semantics of the processor, which can be either allow or disallow.
The semantics work as follows:
allow : Index documents matching at least one rule in the list, dropping everything else.
disallow : Index documents that doesn't match any rules in the list.
Rules are defined by using the <str> element giving the field to check in the name attribute
and the match rule (regular expression) as the value of the element.
Example allow rule indexing documents with a field content_type set to article:
<updateRequestProcessorChain name="customChain">
<processor class="dk.industria.solr.processors.AllowDisallowIndexingProcessorFactory">
<lst name="allow">
<str name="content_type">article</str>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">customChain</str>
</lst>
</requestHandler>
If more than one rule is defined they will be tested one by one until either one of
them match (declaring a match) or none of the matched (declaring no match) that is
more rules work as logical or.
============================
HTMLStripCharFilterProcessor
============================
The HTMLStripCharFilterProcessor makes it possible to run the Solr character filter
HTMLStripCharFilter on a field before it's delivered to Solr for indexing. This can
be especially convenient when the application doing the indexing is not under you control.
This could be a content management system sending fields with markup that you want to use
for highlighting and therefore need the markup removed before the field is stored in the index.
In addition to running HTMLStripCharFilter the processor will:
- Remove no-break spaces (unicode point: 00A0) from the result of HTMLStripCharFilter
- Remove leading and trailing spaces from the result
- Remove multiple continuous spaces from the result
The above can be turned off by placing a bool element with a name attribute set
to normalize and a value of false.
An Example configuration of the HTMLStripCharFilterProcessor:
<updateRequestProcessorChain name="customChain">
<processor class="dk.industria.solr.processors.HTMLStripCharFilterProcessorFactory">
<str name="field">header</str>
<str name="field">content</str>
<bool name="normalize">true</bool>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">customChain</str>
</lst>
</requestHandler>
=======================
PatternReplaceProcessor
=======================
The PatternReplaceProcessor makes it possible to replace patterns defined by
regular expressions with a replacement string. The replacements are done withe
the matcher objects replaceAll method meaning all matches in a field value
will be replaced.
The processor will replace all values in a field if it is a multivalued field
and if the values are strings.
It is possible to attach multiple rules to a field by repeating the field in the
fields list. Each rule attached will be run in order of appearance. In the example
configuration later, the field card2 has both punctuation and prefix attached.
An example configuration of the PatternReplaceProcessor is shown below. The
configuration contains two rules, punctuation and prefix. The rule punctuation
is defined for the fields title, name and comment. The prefix rule is
defined for the card field.
<updateRequestProcessorChain name="customChain">
<processor class="dk.industria.solr.processors.PatternReplaceProcessorFactory">
<lst name="rule">
<str name="id">punctuation</str>
<str name="pattern">\p{P}</str>
<str name="replace"/>
</lst>
<lst name="rule">
<str name="id">prefix</str>
<str name="pattern">^\d{4}</str>
<str name="replace">****</str>
</lst>
<lst name="fields">
<str name="title">punctuation</str>
<str name="name">punctuation</str>
<str name="comment">punctuation</str>
<str name="card">prefix</str>
<str name="card2">punctuation</str>
<str name="card2">prefix</str>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">customChain</str>
</lst>
</requestHandler>