Skip to content
E. Lynette Rayle edited this page Mar 2, 2022 · 3 revisions

References

Terminology

Solr Instance

  • multiple instances can run ('multiple solr instances are running')
  • deploy webapp on multiple servers, each of which is an instance

Solr Core

  • each solr instance can have multiple cores
  • also referred to as Solr Index, or simply Core or Index
  • implemented in a databases
  • generally, each core runs in isolation, but can configure some communication between cores via CoreContainer

Document

  • 0..m documents live in a core
  • basic unit of information

Field

  • 0..m fields live in a document
  • various types: text, numeric, date, etc.
  • type tells solr how to interpret the field and how it can be queried
  • type: String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for faceting.
  • type: Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.

Facet


Indexing Documents

index via...

  • Request Handlers & Update Handlers (via HTTP POST/PUT)
    • default: XML, Binary, JSON, CVS, etc.
    • can define own handlers in config
  • Index Handlers
    • import from databases
  • Solr Cell framework (???)
  • custom Java application to ingest data through Solr's Java Client and other apps

update processors

  • signature
  • logging
  • indexing

Request Handlers

<!--  solr.SearchHandler  -->
<requestHandler name="standard" class="solr.SearchHandler">               <!-- /select -->
<requestHandler name="search" class="solr.SearchHandler" default="true">
<requestHandler name="permissions" class="solr.SearchHandler" >
<requestHandler name="document" class="solr.SearchHandler" >

<!--  solr.UpdateRequestHandler  -->
<requestHandler name="/update" class="solr.UpdateRequestHandler"  />

<!--  other handlers  -->
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">

To see what a requestHandler returns, change the value of qt from /select to the name of the handler in the solr admin Query page. NOTE: You will need to change the host to your solr admin host and may need to change the name of the core from development to the name or your core.


Querying

  • receive XML, JSON, CSV, or binary (via HTTP GET)
  • request handlers (via HTTP GET)
    • default: /admin, /select, /spell
    • can define own handlers in config
  • search components
    • query
    • spelling
    • faceting
    • highlighting
    • statistics
    • debug
    • clustering
  • search process (see Common Query Parameters)
description default example
qt selects Request Handler for a query using /select DisMaxRequestHandler
defType selects a Query Parser for the query parser configured in Request Handler
q field_name:field_value with * as wildcard to search for : q=title:Archery
fq filters query by applying an additional query to the initial query's results, caches the results (same syntax as q) : fq=popularity:[10TO*]& fq=section:0
sort sort field score desc
start an offset into the query results where the returned response should begin 0 start=0
rows the number of rows to be displayed at one time 10 rows=20
fl fields to return in result all fl=id, name
df default field name (I think) that indicates field to search all indexed fields df=description
wt selects a Response Writer for formatting the query response xml | json wt=json
qf list of fields and the "boosts" to associate with each of them when building DisjunctionMaxQueries (see also SOLR df and qf explanation) all indexed fields are required (???) qf=title^20 description^10

Features

High Level

  • Advanced Full-Text Search
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces - XML, JSON, HTTP
  • Comprehensive HTML Admin Interfaces
  • Service statistics exposed over JMX for monitoring
  • Near Real-time indexing and Adaptable with XML configuration
  • Linearly scalable, auto index replication, auto, extensible plugin architecture

Specific Features

  • faceting
  • highlighting
  • spell checking
  • query-re-ranking
  • transforming
  • suggestors
  • more like this
  • pagination
  • grouping & clustering
  • spatial search
  • components
  • real time (get & update)
  • labs

Configuration

  • schema.xml
    • field types
    • etc.
  • solrconfig.xml
    • register Request Handlers for querying the index
    • register Update Handlers for indexing documents
    • register Event Handlers for searcher events (e.g. queries to execute to warm new searches)
    • activate version-dependent features in Lucene
    • Lib directives indicates where Solr can find JAR files for extensions
    • Index management settings
    • Enable JMX instrumentation of Solr MBeans
    • Cache-management settings
  • solr.xml
  • core.properties

Fields

Defined in schema.xml

Hyrax Data Types:

Reference: schema.xml

defined by <types><fieldType>...</></>

postfix code meaning
t text (tokenized)
te english text (tokenized)
s string
i integer
it - trie integer
f float
ft - trie float
l long
lt - trie float
d double
dt trie double
b boolean
dt date
dtt - trie date
ll location
coordinate trie double to index lat and long of a location with indexed=true/stored=false

NOTE: letter indicates the postfix indicator that sets the type for Hyrax dynamic fields. Ex. name_tsi means that name has type="text"

Hyrax Field Def Parameters:

defined by <fields><dynamicField>...</></>

postfix code parameter impact meaning if true
s stored sets stored to true when true, value is returned in solr document
i indexed sets indexed to true when true, value is searchable
m multiValue sets multiValue* to true when true, can have multiple values
v termVectors sets termVectors to true ???
v termPosition sets termPosition to true ???
v termOffsets sets termOffsets to true ???

NOTE: letter indicates the postfix indicator that sets that parameter to true for Hyrax dynamic fields. Ex. name_tsi means that name has stored=true,indexed=true

Examples for values of stored and indexed:

stored="true" indexed="false"

  • destination URL
  • file system path
  • time stamp
  • icon image
  • sort string - have a name that is tokenized text with stored=false/indexed=true and this field that is the exact string for sorting

stored="false" indexed="true"

  • bag of words - want to be able to search for all terms in the bag, but don't want them in the solr document search results
  • common misspellings - allow common misspellings to match in search, but don't include in solr document search results

indexed="false" stored="false"

  • Use this when you want to ignore fields. For example, the following will ignore unknown fields that don't match a defined field rather than throwing an error by default.
<fieldtype name="ignored" stored="false" indexed="false" />
<dynamicField name="*" type="ignored" />

Solr Cloud Features

  • horizontal scaling (for sharding and replication)
  • elastic scaling
  • high availability
  • distributed indexing
  • distributed searching
  • central configuration for entire cluster
  • automatic load balancing
  • automatic failover for queries
  • zookeeper integration for coordination & configurations

CRUD

Create

Read

Return all results with search term = "book"

http://localhost:8983/solr/#/development/select?q=book

Update

Delete

NOTE: Examples use stream.body to show how to do this through a URL. Usually done via HTTP POST.

Delete by ID

http://localhost:8983/solr/#/development/update?stream.body=<delete><id>SOLR1000</id></delete>
http://localhost:8983/solr/#/development/update?stream.body=<commit/>

Delete by Query

http://localhost:8983/solr/#/development/update?stream.body=<delete><query>cat:software</query></delete>
http://localhost:8983/solr/#/development/update?stream.body=<commit/>

Steps to delete all via Solr Admin UI

  • In Solr UI, select core to effect from selection box on left side menu
  • select Documents on left side menu
  • set Document Type = XML
  • set Doucment(s) text area to <delete><query>*:*</query></delete>
  • leave commit within and overwrite as defaults
  • Submit

Delete All in Hyrax

require 'active_fedora/cleaner'
ActiveFedora::Cleaner.clean!

Delete All in Valkyrie-Solr in Hyrax

conn = Valkyrie::IndexingAdapter.find(:solr_index).connection
conn.delete_by_query('*:*', params: { 'softCommit' => true })

More Query Examples

Search for a specific field, category, containing a search term, book

http://localhost:8983/solr/#/development/select?q=category:book

Search for price between 0 and 400, inclusive

http://localhost:8983/solr/#/development/select?q=price:[0 TO 400]

Limit search results to return only fields id, name, and price.

http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price

Return facets for a specific field, category, with counts for each value of category based on the search results.

http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price&facet=on&facet.field=category

Partial Response as relates to returned facet information.

<lst name="facet_counts">
  <lst name="facet_queries" />
  <lst name="facet_fields">
    <lst name="category">
      <int name="book">10</int>
      <int name="video">2</int>
      <int name="audio">2</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
</lst>

Return facets for a specific field, category, with specific value for category, book, with counts for each value of category based on the search results.

http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price&facet=on&facet.field=category&fq=category:electronics

Partial Response as relates to returned facet information.

<lst name="facet_counts">
  <lst name="facet_queries" />
  <lst name="facet_fields">
    <lst name="category">
      <int name="book">10</int>
      <int name="video">0</int>
      <int name="audio">0</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
</lst>

NOTE: Can include multiple filter queries (fq).

NOTE: When filter query is applied, all categories are still listed, but now have 0 for count if they don't include the filtered value.