Language Filter with Advanced Search #116

jduss4 · 2015-05-08T20:13:11Z

While implementing chronam for the University of Nebraska-Lincoln's newspapers project, which involves several Czech language papers, we discovered that selecting a language on the advanced search page does not filter by language unless if you also put in a keyword.

For example, the following query on the main chronicling america site has no parameters except for "Spanish" as the language, yet it returns nearly 9,500,000 results (presumably the entire set of OCR). This is also true if you select French, German, or English without any andtext / ortext / phrasetext, etc.

http://chroniclingamerica.loc.gov/search/pages/results/?dateFilterType=yearRange&date1=1836&date2=1922&language=spa&ortext=&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced

This seems unintuitive to us, as it is very possible that users will want to browse Czech pages rather than searching for a specific phrase. A quick fix is to include q or fq=language:Czech in the query, which hopefully should not interfere with the keyword queries which use the selected language to search ocr_eng vs ocr_cze, etc.

I've added some quick code which uses the requested code ("cze") to find a language that the solr results will recognize ("Czech") and append a language query filter.
https://gist.github.com/jduss4/d4f71929fbcf946d1c64

This is the location in the current chronam repo where we have made the changes to our project:
https://github.com/CDRH/nebnews/blob/4ea06e3c3b4ed2e23e3119454493572d6c30604d/core/index.py#L453

The language filtering only for keyword searches may be expected behavior in chronam, but I wanted to make an issue of it as it surprised us and seems unintuitive. If it is expected behavior, then perhaps a change in the UI may make it more clear how the language dropdown is going to affect search results.

johnscancella · 2018-04-09T18:05:35Z

@jduss4 Thanks for submitting this! I agree this is confusing. It seems like we should be able to filter the results from solr based on if the page has any OCR text associated with the select language. Unfortunately there is a existing bug in the upstream library that prevents this from being merged in at this time. It is being tracked here: search5/solrpy#45

johnscancella · 2018-04-10T14:11:18Z

upon further research the correct way to do this would be to add new fields to SOLR, one for each language as a boolean and mark when that page contains that language.

jduss4 · 2018-04-10T18:16:37Z

@johnscancella have you considered a multivalued field that could hold any languages used per page such as [eng, fra, cze, ...] ? If I understand the problem, it's that solrpy can't filter on a field name, but I would hope that it can filter on the contents of a field using the built in solr faceting?

johnscancella · 2018-04-10T18:24:18Z

@jduss4 I haven't, but only because I am not very familiar with SOLR. After much tracing the problem is you can't use a * because SOLR expands it and then you run into the tooManyBooleanClauses error. From what I have read the correct way around that is to modify the schema. The multivalued field may work here, but it still involves modifying the schema and re-creating the index which is more work than the product owner is willing to do right now. If you think of a way to accomplish this without having to re-create the index please submit a pull request and I will be happy to review it!

jduss4 · 2018-04-10T18:27:08Z

@johnscancella that makes sense, my solution would also require a reindex, or at least an update script of some sort!

lorellav · 2018-04-11T12:48:23Z

Hi All,
I just wanted to share with you how I solved the language filtering issue without the need of any script of sort.. After getting the list of titles and countries of the Italian language newspapers in CA, I selected them in the advance search. Remember to select both the countries AND the titles. Done! It's that easy :-)
For example, for Italian, it returned the following https://chroniclingamerica.loc.gov/search/pages/results/?date1=1880&date2=1920&searchType=advanced&language=ita&sequence=1&lccn=2012271201&lccn=sn85066408&lccn=sn85055164&lccn=sn85054967&lccn=sn88064299&lccn=sn84037024&lccn=sn84037025&lccn=sn86092310&proxdistance=5&state=California&state=District+of+Columbia&state=Massachusetts&state=Pennsylvania&state=Piedmont&state=Vermont&state=West+Virginia&rows=20&ortext=&proxtext=&phrasetext=&andtext=&dateFilterType=yearRange&page=1&sort=relevance

Now I have my little corpus of Italian language newspapers.

I hope this helps!

Lorella.

karindalziel mentioned this issue Jul 23, 2015

Language Filter with Advanced Search open-oni/open-oni#9

Closed

johnscancella added a commit that referenced this issue Apr 9, 2018

refs #116 - filter by language even if there is no text

252fd1c

johnscancella added a commit that referenced this issue Apr 9, 2018

refs #116 - filter by language based on ocr text in solr.

07d105f

johnscancella added the bug label Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Filter with Advanced Search #116

Language Filter with Advanced Search #116

jduss4 commented May 8, 2015

johnscancella commented Apr 9, 2018

johnscancella commented Apr 10, 2018

jduss4 commented Apr 10, 2018

johnscancella commented Apr 10, 2018

jduss4 commented Apr 10, 2018

lorellav commented Apr 11, 2018

Language Filter with Advanced Search #116

Language Filter with Advanced Search #116

Comments

jduss4 commented May 8, 2015

johnscancella commented Apr 9, 2018

johnscancella commented Apr 10, 2018

jduss4 commented Apr 10, 2018

johnscancella commented Apr 10, 2018

jduss4 commented Apr 10, 2018

lorellav commented Apr 11, 2018