Skip to content

Latest commit

 

History

History
1998 lines (1517 loc) · 134 KB

Information retriveal and web search.md

File metadata and controls

1998 lines (1517 loc) · 134 KB

Content

IR and Search

Web search engine is the first big data system in order to collect and organize the data in world wide web. Information retrieval is the extension of the search engine.

https://sarielhp.org/blog/

If the recommender system is to solve the information overload problem personally, modern information retrieval and search technology is to solve that problem generally at the web-scale. Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. Information is in diverse format or form, such as character strings(texts), images, voices and videos so that information retrieval has diverse subfields such as multimedia information retrieval and music information retrival. Search engine is considered as a practical application of information retrieval.

Critical to all search engines is the problem of designing an effective retrieval model that can rank documents accurately for a given query. A main goal of any IR system is to rank documents optimally given a query so that a highly relevant documents would be ranked above less relevant ones and non-relevant ones. Relevance, Ranking and Context are three foundation stones of web search. In this section, we focus on relevance more than rank.

If interested in the history of information retrieval, Mark Sanderson and W. Bruce Croft wrote a paper for The History of Information Retrieval Research.

The basic functions of a search engine can be described as crawling, data mining, indexing and query processing. Crawling is the act of sending small programed bots out to collect information. Data mining is storing the information collected by the bots. Indexing is ordering the information systematically. And query processing is the mathematical process in which a person's query is compared to the index and the results are presented to that person.

One of the most fundamental and important challenges is to develop a truly optimal retrieval model that is both effective and efficient and that can learn form the feedback information over time, which will be talked in Rating and Ranking.

Information Acquisition: Web Crawling

The first step of information retrieval is to acquise the information itself. The web-scale information brings information overload problem, which search engine or web search attempts to solve.

Web Crawling By Christopher Olston and Marc Najork

A web crawler (also known as a robot or a spider) is a system for the bulk downloading of web pages.
Web crawlers are used for a variety of purposes. Most prominently, they are one of the main components of web search engines, systems that assemble a corpus of web pages, index them, and allow users to issue queries against the index and find the web pages that match the queries.
A related use is web archiving (a service provided by e.g., the Internet archive), where large sets of web pages are periodically collected and archived for posterity. A third use is web data mining, where web pages are analyzed for statistical properties, or where data analytics is performed on them (an example would be Attributor, a company that monitors the web for copyright and trademark infringements). Finally, web monitoring services allow their clients to submit standing queries, or triggers, and they continuously crawl the web and notify clients of pages that match those queries (an example would be GigaAlert).

Information Organization and Storage: Indexing and Index

Index as data structure is to organize the information efficiently in order to search some specific terms.

First, let us consider the case where we do not remember some key terms as reading some references, the appendices may include index recording the places where the terms firstly appear such as the following images shown.

Search engine takes advantage of this idea: it is the best place to store where the terms/words appear in key-value format where the key, values is the terms and their places, respectively.

Index Creation

Here we introduce some data structure which can speed up the search procedure given a query. Note that index technology is also used in database performance optimization. Index is used to organize and manage the documents. It is the core function of information retrieval system.

B+ Tree

A B+ Tree is primarily utilized for implementing dynamic indexing on multiple levels. Compared to B- Tree, the B+ Tree stores the data pointers only at the leaf nodes of the Tree, which makes search more process more accurate and faster.

BitMap

In computing, a bitmap is a mapping from some domain (for example, a range of integers) to bits, that is, values which are zero or one. It is also called a bit array or bitmap index.

Hashing

To quote the hash function at wikipedia:

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

Hashed indexes use a hashing function to compute the hash of the value of the index field. The hashing function collapses embedded documents and computes the hash for the entire value but does not support multi-key (i.e. arrays) indexes.

Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a class of methods for the nearest neighbor search problem, which is defined as follows: given a dataset of points in a metric space (e.g., Rd with the Euclidean distance), our goal is to preprocess the data set so that we can quickly answer nearest neighbor queries: given a previously unseen query point, we want to find one or several points in our dataset that are closest to the query point.

Cuckoo Hashing

Cuckoo Hashing is a technique for resolving collisions in hash tables that produces a dictionary with constant-time worst-case lookup and deletion operations as well as amortized constant-time insertion operations.

Bio-Inspired Hashing

The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

FlyHash

Learning to Hash

By using hash-code to construct index, we can achieve constant or sub-linear search time complexity.

Hash functions are learned from a given training dataset.

KNN graph
Learned Indexes

Index Compression

Index compression is one special data compression technology in order to save the storage space. It is necessary to compress the index at the web scale.

Index Compression for BitFunnel Query Processing

Large-scale search engines utilize inverted indexes which store ordered lists of document identifies (docIDs) relevant to query terms, which can be queried thousands of times per second. In order to reduce storage requirements, we propose a dictionarybased compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter. Compression is achieved through storing frequently occurring blocks in a dictionary. Infrequently occurring blocks (those which are not represented in the dictionary) are instead referenced using similar blocks that are in the dictionary, introducing additional false positive errors. We further introduce a docID reordering strategy to improve compression.

Information Retrieval

Query Languages

Boolean Queries

Keywords combined with Boolean operators: OR AND BUT

Phrasal Queries

Retrieve documents with a specific phrase (ordered list of contiguous words)

Proximity Queries

List of words with specific maximal distance constraints between terms Example: “dogs” and “race” within 4 words match “…dogs will begin the race…”

Pattern Matching

Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

Edit (Levenstein) Distance is defined as minimum number of character deletions, additions, or replacements needed to make two strings equivalent.

Longest Common Subsequence (LCS) is the length of the longest subsequence of characters shared by two strings

Regular Expressions

Language for composing complex patterns from simpler ones: Union, Concatenation, Repetition.

Structural Queries

Assumes documents have structure that can be exploited in search, allow queries for text appearing in specific fields.

SQL

Query Parser and Understanding

Query is often some keywords in natural language such as English or Chinese. We use the search engine when we would like to find some information related with the keywords on the web/internet, which means we do not completely know what the result is. Additionally, all information is digitalized in computer and the computers do not understand the natural language natively. For example, synonyms are different as character or string data structure in computer. Natural language processing(NLP) or natural language understanding(NLU) facilitate the computers to comprehend the query.

Response Time
Query Auto Completion Before the query input is finished
Spelling Correction When the query input is finished
Semantic Analysis After the query input is finished
Query Suggestion After the query input is finished
Intention Analysis After the query input
Query Operations
  • Query Reformulation:
    • Query Expansion: Add new terms to query from relevant documents.
    • Term Reweighting: Increase weight of terms in relevant documents and decrease weight of terms in irrelevant documents.

Standard Rochio Method

Ide Regular Method

Query Auto Completion

The auto complete is a drop-down list populated with suggestions of what one can write in the search box.

The auto complete is a list of suggestions of what one can write in the search box to reach different products or categories. These suggestions will also be referred to as query suggestions or completions. After one has written a few letters of the beginning of the query and the list is populated with query suggestions that in some way match the input. In the normal case matching means that the suggestion starts with the input.

Spelling Correction

For simplicity let us first consider correction of individual misspelled words (e.g., “elefnat” to “elephant”). One simple approach to spelling error correction is to calculate the edit distance between the query word and each of the dictionary words. Dictionary words within a fixed range of edit distance or a variable range of edit distance depending on word length are selected as candidates for correction. There are at least two drawbacks for this approach, however. First, probabilities of word usages as well as word misspellings are not considered in the model. Second, context information of correction is not taken into consideration.

To address the issues, probabilistic approaches, both generative approach and discriminative approach, have been proposed. Suppose that the query word is represented as $q$ and a correction is represented as $c$. We want to find the correction $\hat{c}$ having the largest conditional probability $P(c|q)$. Different ways of defining the model lead to different methods.

By Bayes’ rule, we can consider finding the correction $\hat c$ having the largest product of probability $P(c)$ and conditional probability $P(q|c)$ $$\hat c=\arg\max_{c} P(c\mid q)=\arg\max_{c}P(c)P(q\mid c).$$ The former is called source model and the latter channel model.

The source model can be trained by using the document collection and/or search log. (Due to the wide variety of searches it is better to find the legitimate words from data.) A straightforward way would be to estimate the probabilities of words based on their occurrences in the dataset with a smoothing technique applied. The channel model can be defined based on weighted edit distance, where the model is usually trained by using data consisting of pairs of correct word and misspelled word.

Query Suggestion

When a user provides a root input, such as a search query, these algorithms dynamically retrieve, curate, and present a list of related inputs, such as search suggestions. Although ubiquitous in online platforms, a lack of research addressing the ephemerality of their outputs and the opacity of their functioning raises concerns of transparency and accountability on where inquiry is steered.

Query Expansion

Query expansion is a technique studied intensively and widely in IR. The basic idea is to enrich the query with additional terms (words or phrases) and to use the expanded query to conduct search in order to circumvent the query-document mismatch challenge.

Query Relaxation

Unlike relational databases where the schema is relatively small and fixed, XML model allows varied/missing structures and values, which make it difficult for users to ask questions precisely and completely. To address such problems, query relaxation technique enables systems to automatically weaken, when not satisfactory, the given user query to a less restricted form to permit approximate answers as well.

Query Segmentation

Query segmentation is to separate the input query into multiple segments, roughly corresponding to natural language phrases, for improving search relevance.

Query Scoping

We propose Voronoi scoping, a distributed algorithm to constrain the dissemination of messages from different sinks. It has the property that a query originated by a given sink is forwarded only to the nodes for which that sink is the closest (under the chosen metric). Thus each query is forwarded to the smallest possible number of nodes, and per-node dissemination overhead does not grow with network size or with number of sinks. The algorithm has a simple distributed implementation and requires only a few bytes of state at each node. Experiments over a network of 54 motes confirm the algorithm's effectiveness.

Query Understanding

Query understanding: query normalization (encoding, tokenization, spelling); query rewriting (expansion, relaxation, segmentation, scoping)

Levels of Query Understanding:

NTENT’s Search platform choreographs the interpretation of singular query constituents, and the dissemination of relevant answers through a specialized combination of Language Detection, Linguistic Processing, Semantic Processing and Pragmatic Processing.

  • Language Detection: The first step is to understand which language the user is using. Sometimes this is obvious, but many things can make this hard. For example, many users find themselves forced to use a keyboard that makes it hard to use accented characters, so they “ascify” their query. Users sometimes use multiple languages within a single query (“code switching”) or proper names that are the same in many languages.
  • Linguistic Processing: Every language has its own rules for how text should be broken down into individual words (“tokenized”), whether distinctions of case and accent are significant, how to normalize words to a base form (“lemmatization” or “stemming”), and categorization of words and phrases by parts of speech (“POS tagging”).
  • Semantic Processing: A traditional keyword search engine would stop after linguistic processing, but NTENT’s technology goes further, and determines what the user’s words actually mean. Many words have multiple meanings (“homonyms”), and many concepts have multiple ways to express them (“synonyms”). Drawing on many sources of information, such as a large-scale ontology, notability data, and the user’s context (e.g., location), we are able to determine all the possible interpretations of the user’s query, and assign a probability to each one. By distinguishing a particular sense of a word, and by knowing which phrases denote a single concept, we are able to improve the precision of our applications. At the same time, by recognizing that multiple expressions refer to the same concept, and also that broad terms encompass narrower ones (“hyponymy”), we are able to improve recall. Furthermore, NTENT is able to analyze the syntax of how multiple words are combined into composite concepts.
  • Intent Detection (Pragmatic Processing): NTENT goes beyond just the surface semantics of the user’s utterance, and develops hypotheses about why they typed what they did: what their information need is; what transactions they intend to perform; what website they’re looking for; or what local facilities they’re trying to find. This inductive reasoning is key to harnessing NTENT’s extensive set of experts to give the user what they want.
NLP Pipeline of Query Understanding
Tokenization
Stop words removal
Text Normalization
POS Tagging
Named Entity Recogition
Intention Analysis

Intent Analysis goes a level deeper than sentiment analysis and gives an idea of whether a string of text is a complaint, a suggestion or a query. Gauging the intent of messages on social media opens a lot of new possibilities. It uses Long Short Term Memory (LSTM) algorithms to classify a text into different. LSTMs model sentences as chain of forget-remember decisions based on context. It is trained on social media data and news data differently for handling casual and formal language. We also have trained this algorithm for various custom datasets for different clients.

Relevance and Rank

Recall the definition of Discounted Cumulative Gain(DCG):

$${DCG}p= \sum{i=1}^{p} \frac{{rel}i}{\log{2}(i+1)}$$

where ${rel}_{i}$ is the relevance of the document and query.

However, it is discussed how to compute the relevance of the document and query. The document is always text such as html file so natural language processing plays a lead role in computing the relevances. For other types information retrieval system, it is different to compute the relevance. For example, imagine search engine is to find and return the images similar on the internet with the given image query, where the information is almost in pixel format rather than text/string.

--- A part of Ranking Model
Query-independent ranking on-document evidence (retrievability, readability, maliciousness); off-document evidence (centrality, popularity, credibility)
Query understanding query normalization (encoding, tokenization, spelling); query rewriting (expansion, relaxation, segmentation, scoping)
Query-dependent ranking basic models (algebraic models, probabilistic models, information-theoretic models); proximity models (Markov random fields models); structural models (field-based models); semantic models (latent semantics, explicit semantics)
Contextual ranking personalization; diversification; interactivity
Machine-learned ranking query-document representation; loss functions (pointwise, pairwise, listwise loss); optimization strategies; adaptation strategies (intent-awareness, exploration-exploitation)
Ranking evaluation behavioral models; evaluation design; evaluation metrics; offline evaluation; online evaluation

The Machine-learned ranking and Ranking evaluation is discussed in Rating and Ranking partially. Contextual ranking does not discuss until now.

Query-independent Ranking

Query independent scores are prepared and applied to search results. Search results applying query term relevance criteria are combined with query independent scores to form a combined score. The combined score may alter the original ranking using only the query scores. The query independent scores can be used to increase the combined scores of important objects, where importance measurements include frequently accessed objects, objects with more connections and/or objects that are the subject of discussion.

Query-independent ranking includes :

  • on-document evidence (retrievability, readability, maliciousness);
  • off-document evidence (centrality, popularity, credibility)

And it is computed before the query given.

Centrality of network assigns an importance score based purely on the number of links held by each node. PageRank is introduced in Graph Algorithms. HITS algorithm is in the same spirit as PageRank. They both make use of the link structure of the Web graph in order to decide the relevance of the pages. The difference is that unlike the PageRank algorithm, HITS only operates on a small subgraph (the seed SQ) from the web graph. This subgraph is query dependent; whenever we search with a different query phrase, the seed changes as well. HITS ranks the seed nodes according to their authority and hub weights. The highest ranking pages are displayed to the user by the query engine.

Search Engine Optimization(SEO) is a business type to boost the website higher.

Query-dependent ranking

Query-dependent ranking:

  • basic models (algebraic models, probabilistic models, information-theoretic models);
  • proximity models (Markov random fields models); structural models (field-based models);
  • semantic models (latent semantics, explicit semantics).
Features/Attributes for ranking
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Number of Terms in common between query and document
TF-IDF

Term frequency(tf) of a word ${w}$ in a given document ${doc}$ is definded as $$ tf(w| doc)=\frac{\text{the number of the word ${w}$ in},,,doc}{\text{the number of words in},,,doc} . $$ It is to measure how popular the word ${w}$ in the document $doc$. Inverse document frequency(idf) of a word ${w}$ in a given document list $D={doc_i\mid i=1,2,\cdots, N}$ is defined $$ idf(w\mid D)=\log\frac{\text{the number of documents in the list $D$} }{\text{the number of document containing the word $w$}+1}, $$ which is to measure how popular the word $w$ i the document list. tf-idf is a rough way of approximating how users value the relevance of a text match, defined as $$\text{tf-idf}=tf(w| doc)\times idf(w\mid D).$$

Robertson-SparckJones Model

The goal of a probabilistic retrieval model is clearly to retrieve the documents with the highest probability of relevance to the given query.

Three random variables- the query $Q$, the document $D$ and the relevance $R \in{0,1}$. The goal is to estimate the rank of $D$ based on $P(R=1|Q,D)$.

The basic idea is to compute $Odd(R=1|Q,D)$ using Bayes’ rule $$Odd(R=1\mid Q, D)=\frac{P(R=1\mid Q, D)}{P(R=0\mid Q, D)}=\frac{P(Q, D\mid R=1)}{P(Q, D\mid R=0)}\frac{P(R=1)}{P(R=0)}.$$

BM25

The basic idea of BM25 is to rank documents by the log-odds of their relevance. Actually BM25 is not a single model, but defines a whole family of ranking models, with slightly different components and parameters. One of the popular instantiations of the model is as follows.

Given a query $q$, containing terms $t_1,\cdots , t_M,$ the BM25 score of a document $d$ is computed as $$ BM25(d, q)=\sum_{i=1}^{M}\frac{IDF(t_i)\cdot TF(t_i, d)\cdot (k_1 + 1)}{TF(t_i, d)+k_1 \cdot (1-b+b\cdot \frac{LEN(d)}{avdl})} $$

where $TF(t, d)$ is the term frequency of the $t$ th in the document $d$, $LEN(d)$ is the length(number of words) of document $d$, and $avdl$ is the average document length in the text collection from which document are drawn. $k_1$ and $b$ are free parameters, $IDF(t)$ is the IDF weight of the term $t$, computed by $IDF(t)=\log(\frac{N}{n(t)})$ where $N$ is the total number of documents in the collection, and $n(t)$ is the number of documents containing term $t$ .

The language model for information retrieval (LMIR)

The language model for information retrieval (LMIR) is an application of the statistical language model on information retrieval. A statistical language model assigns a probability to a sequence of terms. When used in information retrieval, a language model is associated with a document.

With query $q$ as input, documents are ranked based on the query likelihood, or the probability that the document’s language model will generate the terms in the query(i.e., $P(q\mid d)$). By further assuming the independence between terms, one has $$P(q\mid d)=\prod_{i=1}^{M}P(t_i\mid d)$$ if query $q$ contains terms $t_1,\dots, t_M$.

To learn the document’s language model, a maximum likelihood method is used. As in many maximum likelihood methods, the issue of smoothing the estimate is critical. Usually a background language model estimated using the entire collection is used for this purpose. Then, the document’s language model can be constructed as follows: $$p(t_i\mid d)=(1-\lambda)\frac{TF(t_i, d)}{LEN(d)}+\lambda p(t_i\mid C)$$ where $p(t_i\mid C)$ is the background language model for term $t_i$, and $\lambda \in[0, 1]$ is a smoothing factor.

TextRank

David Ten wrote a blog on TextRank:

For keyword extraction we want to identify a subset of terms that best describe the text. We follow these steps:

  1. Tokenize and annotate with Part of Speech (PoS). Only consider single words. No n-grams used, multi-words are reconstructed later.
  2. Use syntactic filter on all the lexical units (e.g. all words, nouns and verbs only).
  3. Create and edge if lexical units co-occur within a window of N words to obtain an unweighted undirected graph.
  4. Run the text rank algorithm to rank the words.
  5. We take the top lexical words.
  6. Adjacent keywords are collapsed into a multi-word keyword.

TextRank model is graph-based derived from Google’s PageRank. It constructs a weighted graph $G$:

  • the node set $V$ consists of all sentences in the document;
  • the weight is the similarity of each sentence pair, i.e., $w_{i,j}=Similarity (V_i, V_j)$.

The weight of each sentence depends on the weights of its neighbors: $$WS(V_i)=(1-d)+d\times {\sum}{V_j\in In(V_i)}\frac{w{ij}}{\sum_{V_k\in Out(V_j)}}WS(V_j).$$


Text Summarization

A summary can defined as “a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that”. Automatic text summarization is the process of extracting such a summary from given document(s).

In the popular open search engine ElasticSearch, the score formula is more complex and complicated.

Document Similarity

Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. How humans usually define how similar are documents? Usually documents treated as similar if they are semantically close and describe similar concepts. w-shingling

Vector Quantization

Vector quantization (VQ) is an efficient coding technique to quantize signal vectors. It has been widely used in signal and image processing, such as pattern recognition and speech and image coding. A VQ compression procedure has two main steps: codebook training (sometimes also referred to as codebook generation) and coding (i.e., codevector matching). In the training step, similar vectors in a training sequence are grouped into clusters, and each cluster is assigned to a single representative vector called a codevector. In the coding step, each input vector is then compressed by replacing it with the nearest codevector referenced by a simple cluster index. The index (or address) of the matched codevector in the codebook is then transmitted to the decoder over a channel and is used by the decoder to retrieve the same codevector from an identical codebook. This is the reconstructed reproduction of the corresponding input vector. Compression is thus obtained by transmitting the index of the codevector rather than the entire codevector itself.

Latent semantic indexing

Latent semantic indexing (LSI) is a concept used by search engines to discover how a term and content work together to mean the same thing, even if they do not share keywords or synonyms.

Regularized Latent Semantic Indexing

It is a matching method between query and document at topic level based on matrix factorization, which is scale up to large datasets. The parametric model is expressed in the following form:

$$min_{U, {v_n}}\sum_{n=1}^{N}{|d_n - U v_n|}2^2+\underbrace{\lambda_1\sum{k=1}^K {|u_k|}1}{\text{topics are sparse}} + \underbrace{\lambda_2\sum_{n=1}^{N}{|v_n |}2^2}{\text{documents are smooth}}$$

where

  • $d_n$ is term representation of doc $n$;
  • $U$ represents topics;
  • $v_n$ is the topic representation of doc $n$;
  • $\lambda_1$ and $\lambda_2$ are regularization parameters.

It is optimized by coordinate descent: $$u_{mk}=\arg\min_{\bar u_m}\sum_{m=1}^M {|\bar d_m - V^T \bar u_m|}2^2+\lambda_1\sum{m=1}^{M}{|\bar u_m|}1,\ v_n^{\ast}=\arg\min{{v_n}}\sum_{n=1}^{N}{|d_n -U v_n|}2^2+\lambda_2\sum{n=1}^N{|v_n|}_2^2=(U^T U + \lambda_2 I)^{-1}U^T {d}_n.$$

Partial Least Square (PLS)

The input training data set is ${(x_i, y_i, r_i)\mid i=1, 2,\cdots, N}$ where $r_i \in {+1, -1}$.

It is to optimize the following cost function $$\arg\max_{L_x, L_y}\sum_{r_i=+1}\left<L_x x_i, L_y y_i\right>-\sum_{r_i=-1}\left<L_x x_i, L_y y_i\right>\ s.t. \quad L_x^T L_x=L_y^TL_y=I_k.$$ Regularized Mapping to Latent Space will change the constraints $$\arg\max_{L_x, L_y}\sum_{r_i=+1}\left<L_x x_i, L_y y_i\right>-\sum_{r_i=-1}\left<L_x x_i, L_y y_i\right>\ s.t. \quad L_x^T L_x=L_y^TL_y=I_k.$$

Comparison and Matching

Query and Indexed Object is similar with Question and Answers. The user requested a query then a matched response is supposed to match the query in semantics. Before that we must understand the query.

The most common way to model similarity is by means of a distance function. A distance function assigns high values to objects that are dissimilar and small values to objects that are similar, reaching 0 when the two compared objects are the same. Mathematically, a distance function is defined as follows:

Let $X$ be a set. A function $\delta:X\times X\to \mathbb R$ is called a distance function if it holds for all $x,y\in X$:

  • $\delta(x,x)=0$ (reflexivity)
  • $\delta(x,y)=\delta(y,x)$ (symmetry)
  • $\delta(x,y)≥0$ (non-negativity)

When it comes to efficient query processing, as we will see later, it is useful if the utilized distance function is a metric.

Let $\delta:X\times X\to \mathbb R$ be a distance function. $\delta$ is called a metric if it holds for all $x,y,z\in X$:

  • $\delta(x,y)=0\iff x=y$ (identity of indiscernibles)
  • $\delta(x,y)\leq \delta(x,z)+\delta(z,y)$ (triangle inequality)

In a similarity-based retrieval model, it is assumed that the relevance status of a document with respect to a query is correlated with the similarity between the query and the document at some level of representation; the more similar to a query, the more relevant the document is assumed to be. $\color{red}{Similarity \not= Relevance}$


Similarity matching Relevance matching
Whether two sentences are semantically similar Whether a document is relevant to a query
Homogeneous texts with comparable lengths Heterogeneous texts (keywords query, document) and very different in lengths
Matches at all positions of both sentences Matches in different parts of documents
Symmetric matching function Asymmetric matching function
Representative task: Paraphrase Identification Representative task: ad-hoc retrieval

Each search is made up of $\color{red}{\fbox{Match + Rank}}$. Matching Problem is to describe the tasks in IR that:

We focus on the problem of identifying those documents in a corpus that match a conjunctive query of keywords.


Learning to Match

User’s intent is explicitly reflected in query such as keywords, questions. Content is in Webpages, images. Key challenge is query-document semantic gap. Even severe than search, since user and item are two different types of entities and are represented by different features.

Common goal: matching a need (may or may not include an explicit query) to a collection of information objects (product descriptions, web pages, etc.) Difference for search and recommendation: features used for matching!

--- Matching Ranking
Prediction Matching degree between a query and a document Ranking list of documents
Model $f(q, d)$ $f(q,{d_1,d_2,\dots })$
Goal Correct matching between query and document Correct ranking on the top

Methods of Representation Learning for Matching:

  • DSSM: Learning Deep Structured Semantic Models for Web Search using Click-through Data (Huang et al., CIKM ’13)
  • CDSSM: A latent semantic model with convolutional-pooling structure for information retrieval (Shen et al. CIKM ’14)
  • CNTN: Convolutional Neural Tensor Network Architecture for Community-Based Question Answering (Qiu and Huang, IJCAI ’15)
  • CA-RNN: Representing one sentence with the other sentence as its context (Chen et al., AAAI ’18)

Bit-string Signatures

COBS

We present COBS, a COmpact Bit-sliced Signature index, which is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length. We compare COBS to seven other index software packages on 100000 microbial DNA samples. COBS' compact but simple data structure outperforms the other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

BitFunnel

In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index.

The key idea of bit-string signatures is that each document in the corpus is represented by its signature. In BitFunnel, the signature is essentially the sequence of bits that make up a Bloom filter representing the set of terms in the document.

Approximate Nearest Neighbors

[Nearest neighbour search is the problem of finding the most similar data-points to a query in a large database of data-points,}(https://learning2hash.github.io/) and is a fundamental operation that has found wide applicability in many fields, from Bioinformatics, through to Natural Language Processing (NLP) and Computer Vision.

An approximate nearest neighbor search algorithm is allowed to return points, whose distance from the query is at most c times the distance from the query to its nearest points.

Annoy

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

FALCONN

FALCONN is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures.

SPTAG

SPTAG assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.

SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.

It explains how SPTAG works:

SPTAG is inspired by the NGS approach [WangL12]. It contains two basic modules: index builder and searcher. The RNG is built on the k-nearest neighborhood graph [WangWZTG12, WangWJLZZH14] for boosting the connectivity. Balanced k-means trees are used to replace kd-trees to avoid the inaccurate distance bound estimation in kd-trees for very high-dimensional vectors. The search begins with the search in the space partition trees for finding several seeds to start the search in the RNG. The searches in the trees and the graph are iteratively conducted.

Faiss

Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.

Most of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server.

The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.

HNSW: Hierarchical Navigable Small World

We present a new approach for the approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy (Hierarchical NSW, HNSW). The proposed solution is fully graph-based, without any need for additional search structures, which are typically used at the coarse search stage of the most proximity graph techniques. Hierarchical NSW incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers) for nested subsets of the stored elements. The maximum layer in which an element is present is selected randomly with an exponentially decaying probability distribution. This allows producing graphs similar to the previously studied Navigable Small World (NSW) structures while additionally having the links separated by their characteristic distance scales. Starting search from the upper layer together with utilizing the scale separation boosts the performance compared to NSW and allows a logarithmic complexity scaling. Additional employment of a heuristic for selecting proximity graph neighbors significantly increases performance at high recall and in case of highly clustered data. Performance evaluation has demonstrated that the proposed general metric space search index is able to strongly outperform previous opensource state-of-the-art vector-only approaches. Similarity of the algorithm to the skip list structure allows straightforward balanced distributed implementation.

Scalable Similarity Search

The aim of the project is to improve theory and practice of algorithms for high-dimensional similarity search on big data, and to extend similarity search algorithms to work in settings where data is distributed (using a communication complexity perspective) or uncertain (using a statistical perspective).

Semantic Search

Alexis Sanders as an SEO Account Manager at MERKLE | IMPAQT wrote a blog on semantic search:

The word "semantic" refers to the meaning or essence of something. Applied to search, "semantics" essentially relates to the study of words and their logic. Semantic search seeks to improve search accuracy by understanding a searcher’s intent through contextual meaning. Through concept matching, synonyms, and natural language algorithms, semantic search provides more interactive search results through transforming structured and unstructured data into an intuitive and responsive database. Semantic search brings about an enhanced understanding of searcher intent, the ability to extract answers, and delivers more personalized results. Google’s Knowledge Graph is a paradigm of proficiency in semantic search.

Personalized Search

Personalized Search fetches results and delivers search suggestions individually for each of its users based on their interests and preferences, which is mined from the information that the search engine has about the user at the given time, such as their location, search history, demographics such as the recommenders.

And here search engine and recommender system coincide except the recommender system push some items in order to attract the users' attention while search engine recall the information that the users desire in their mind.

Learning from User Interactions

When users interact with online services (e.g. search engines, recommender systems, conversational agents), they leave behind traces of interaction patterns. The ability to record and interpret user interaction signals and understand user behavior gives online systems a vast treasure trove of insights for improving user engagement and experimentation.

TrustRank

Information Distribution: Search Engine Results Page

Information Distribution Methods – Information distribution is the timely collection, sharing and distribution of information to the project team. Methods can be portals, collaborative work management tools, web conferencing, web publishing, and when all technology is not available, manual filing systems and hard copy distribution.

Keywords Highlight

Webpage Snapshot

Google Cache is normally referred as the copies of the web pages cached by Google. Google crawls the web and takes snapshots of each page as a backup just in case the current page is not available. These pages then become part of Google's cache. These Google cached pages can be extremely useful if a site is temporary down, you can always access these page by visiting Google’s cached version.

Google web is usually updated in a few days. The actual time of the updates depends on the frequency the website updates itself.

Information Retrieval Evaluation

Evaluation is used to enhance the performance of the result of the information retrieval.

Neural Information Retrieval

Neural networks or deep learning as a subfield of machine learning, is widely applied in information processing.

During the opening keynote of the SIGIR 2016 conference, Christopher Manning predicted a significant influx of deep neural network related papers for IR in the next few years. However, he encouraged the community to be mindful of some of the “irrational exuberance” that plagues the field today. The first SIGIR workshop on neural information retrieval received an unexpectedly high number of submissions and registrations. These are clear indications that the IR community is excited by the recent developments in the area of deep neural networks. This is indeed an exciting time for this area of research and we believe that besides attempting to simply demonstrate empirical progress on retrieval tasks, our explorations with neural models should also provide new insights about IR itself. In return, we should also look for opportunities to apply IR intuitions into improving these neural models, and their application to non-IR tasks.

Deep Structured Semantic Model

DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. DSSM, developed by the MSR Deep Learning Technology Center(DLTC), is a deep neural network (DNN) modeling technique for representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space and modeling semantic similarity between two text strings (e.g., Sent2Vec).

DSSM can be used to develop latent semantic models that project entities of different types (e.g., queries and documents) into a common low-dimensional semantic space for a variety of machine learning tasks such as ranking and classification. For example, in web search ranking, the relevance of a document given a query can be readily computed as the distance between them in that space. With the latest GPUs from Nvidia, we are able to train our models on billions of words

DSSM: Brief Summary

  • Inputs: Bag of letter-trigrams as input for improving the scalability and generalizability
  • Representations: mapping sentences to vectors with DNN: semantically similar sentences are close to each other
  • Matching: cosine similarity as the matching function
  • Problem: the order information of words is missing (bag of letter-trigrams cannot keep the word order information)

Matching Function Learning:

  • Step 1: construct basic low-level matching signals
  • Step 2: aggregate matching patterns

Deep Relevance Matching Model

It is asserted that

the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements.

A novel deep relevance matching model (DRMM) for ad-hoc retrieval employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above.

  • Matching histogram mapping for summarizing each query matching signals
  • Term gating network for weighting the query matching signals
  • Lost word order information (during histogram mapping)

DeepRank: Text Matching as Image Recognition

Calculate relevance by mimicking the human relevance judgement process

  1. Detecting Relevance locations: focusing on locations of query terms when scanning the whole document
  2. Determining local relevance: relevance between query and each location context, using MatchPyramid/MatchSRNN etc.
  3. Matching signals aggregation

Challenges

  • Representation: representing the word level matching signals as well as the matching positions
  • Modeling: discovering the matching patterns between two texts
  • Our solutions
    • Step 1: representing as matching matrix
    • Step 2: matching as image recognition

Matching matrix $M_{ij}=\mathbb I_{w_i=v_j}$ or $M_{ij}= \frac{w_i^T v_j}{|w_i| |v_j|}$ or $M_{ij}=\left<w_i, v_j\right>$.

$$\fbox{MatchPyramid} =\underbrace{Matching,, Matrix}{\text{Bridging the semantic gap between words}}+\underbrace{Hierarchical,, Convolution}{\text{Capturing rich matching patterns}}$$

Tree-based Deep Match(TDM)

The idea of TDM is to predict user interests from coarse to fine by traversing tree nodes in a top-down fashion and making decisions for each user-node pair.

A recommendation tree consists of a set of nodes N, where $N = {n_1, n_2, \dots, n_{|N |}}$ represents $|N |$ individual non-leaf or leaf nodes. Each node in $N$ except the root node has one parent and an arbitrary number of children. Specifically, each item $c_i$ in the corpus $C$ corresponds to one and only one leaf node in the tree, and those non-leaf nodes are coarse-grained concepts. Without loss of generality, we suppose that node n1 is always the root node. An example tree is illustrated in the right bottom corner of the following figure , in which each circle represents a node and the number of node is its index in tree. The tree has 8 leaf nodes in total, each of which corresponds to an item in the corpus. It’s worth mentioning that though the given example is a complete binary tree, we don’t impose complete and binary as restrictions on the type of the tree in our model.

Vector Search Engine

Vector search engine (aka neural search engine or deep search engine) uses deep learning models to encode data sets into meaningful vector representations, where distance between vectors represent the similarities between items.

Weaviate

Weaviate uses vector indexing mechanisms at its core to represent the data. The vectorization modules (e.g., the NLP module) vectorizes the above-mentioned data object in a vector-space where the data object sits near the text ”landmarks in France”. This means that Weaviate can’t make a 100% match, but a very high one to show you the results.

Milvus

As an open source vector similarity search engine, Milvus is easy-to-use, highly reliable, scalable, robust, and blazing fast. Adopted by over 100 organizations and institutions worldwide, Milvus empowers applications in a variety of fields, including image processing, computer vision, natural language processing, voice recognition, recommender systems, drug discovery, etc.

Jina

Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) on the cloud.

Modeling Diverse Ranking with MDP

Key points:

  • Mimic user top-down browsing behaviors
  • Model dynamic information needs with MDP state

States $s_t=[Z_t, X_t, {}]$ $s_t =[\it{Z}_t,\it{X}_t,\mathrm{h}_t ]$ consists of

  • sequence of $t$ preceding documents, $\it Z_t$ and $\it Z_0=\emptyset$;
  • set of candidate documents, $\it X_t$ and $\it X_0 = \it X$
  • latent vector $\mathrm h_𝑡$, encodes user perceived utility from preceding documents, initialized with the information needs form the query: $\mathrm h_0=\sigma(V_qq)$
MDP factors Corresponding diverse ranking factors
Timesteps The ranking positions
States $s_t=[Z_t, X_t, h_t]$
Policy $\pi(a_t\mid s_t=[Z_t, X_t, h_t])=\frac{\exp{\mathrm x^T_{m(a_t)}\mathrm U \mathrm h_t}}{Z}$
Action Selecting a doc and placing it to rank $\it t+1$
Reward Based on evaluation measure nDCG, SRecall etc.
State Transition $s_{t+1}=T(s_t, a_t)=[Z_t\oplus {\mathrm{x}{m(a_t)}}, {X}t\setminus {\mathrm {x}{m(a_t)} }, \sigma(V\mathrm{x}{m(a_t)}+W\mathrm {h}_t)]$

Here $\mathrm x_{m(a_t)}$ is document embedding. Model parameters $\Theta=(V_q, U, V, W)$ is to optimize the expected reward. The goal is maximizing expected return (discounted sum of rewards) of each training query $$\max_{\Theta} v(q)=\mathbb E_{\pi}G_0=\mathbb E_{\pi}[\sum_{k=0}^{M-1}\gamma^k r_{k+1}].$$

Monte-Carlo stochastic gradient ascent is used to conduct the optimization (REINFORCE algorithm) $$\nabla_{\Theta}\hat{v}(q)=\gamma^t G_t\nabla_{\Theta}\log(\pi(a_t\mid S_t;\Theta)).$$


Vertical Domain Search: Beyond String and Texts

As we have learned how to handle text, information retrieval is moving on, to projects in sound and image retrieval, along with electronic provision of much of what is now in libraries.

Web Table Extraction, Retrieval and Augmentation

Tables on the web, referred to as web tables henceforth, differ from traditional tables (that is, tables in relational databases and tables created in spreadsheet programs) in a number of ways. First, web tables are embedded in webpages. There is a lot of contextual information, such as the embedding page’s title and link structure, the surrounding text, etc. that can be utilized. Second, web tables are rather heterogeneous regarding their quality, organization, and content. For example, tables on the Web are often used for layout and navigation purposes.

Among the different table types, relational tables (also referred to as genuine tables) are of special interest. These describe a set of entities (such as people, organizations, locations, etc.) along with their attributes. Relational tables are considered to be of high-quality, because of the relational knowledge contained in them. However, unlike from tables in relational databases, these relationships are not made explicit in web tables; uncovering them is one of the main research challenges. The uncovered semantics can be leveraged in various applications, including table search, question answering, knowledge base augmentation, and table completion. For each of these tasks we identify seminal work, describe the key ideas behind the proposed approaches, discuss relevant resources, and point out interdependencies among the different tasks.

Scholar Search Engine

Scholar Search Engine helps to find the digital scholar publication.

For example, Semantic Scholar helps researchers find better academic publications faster. Our engine analyzes publications and extracts important features using machine learning techniques. The resulting influential citations, images and key phrases allow our engine to “cut through the clutter” and give you results that are more relevant and impactful to your work.

Different from general web search engine, scholar search engine does rank the web-page. The citation network is used to evaluate the importance of one scholar article.

Patent Search and Analysis

MACHINE LEARNING FOR HEALTHCARE

Health Information Retrieval: Biomedical and Health Informatics

ChartRequest claims that:

Requesting medical records is vital to your operations as a health insurance company. From workers’ compensation claims to chronic-condition care, insurance companies require numerous medical records—daily. Obtain records quickly and accurately with our medical information retrieval software. ChartRequest offers a complete enterprise solution for health insurance companies—facilitating swift fulfillment and secure, HIPAA-compliant records release.

What is Biomedical and Health Informatics?

Biomedical and health informatics (BMHI) is the field concerned with the optimal use of information, often aided by technology, to improve individual health, healthcare, public health, and biomedical research.

  • Unified Medical Language System (UMLS)
  • Systematized Nomenclature of Medicine--Clinical Terms (SNOMED-CT)
  • International Classification of Diseases (ICD)

We focus on ICD10.

Why is medical information retrieval important?

To health professionals, applications providing an easy access to validated and up-to-date health knowledge are of great importance to the dissemination of knowledge and have the potential to impact the quality of care provided by health professionals. On the other side, the Web opened doors to the access of health information by patients, their family and friends, making them more informed and changing their relation with health professionals.

To professionals, one of the main and oldest IR applications is PubMed from the US National Library of Medicine (NLM) that gives access to the world’s medical research literature. To consumers, health information is available through different services and with different quality. Lately, the control over and access to health information by consumers has been a hot topic, with plenty government initiatives all over the world that aim to improve consumer health giving consumers more information and making easier the sharing of patient records.

Why is medical information retrieval difficult?

Simply, it is becasue medical information is really professional while critical. It is not easy to understand the semantics of texts with medical terms for patients the ones who need such information.

This can be difficult because health information is constantly changing as a result of new research and because there may be different valid approaches to treating particular conditions.

It is therefore of interest to find out how well web search engines work for diagnostic queries and what factors contribute to successes and failures. Among diseases, rare (or orphan) diseases represent an especially challenging and thus interesting class to diagnose as each is rare, diverse in symptoms and usually has scattered resources associated with it.

How knowledge bases can improve retrieval performance?

Digital Health

The opportunities for interdisciplinary digital health research bringing together computer science to dramatically improve public health, global health and wellbeing of individuals and populations globally are extraordinary.

Multimedia Search Engine

Indexing multimedia is much more complex than indexing text. In some cases the media can be converted to text: broadcast television often includes digital text as closed-captions for the hearing impaired, and scene titles and captions within a video can be converted to text using OCR. Speech-recognition technology can digitize words spoken on audio tracks. Continuous media, such as video, also can be broken up into chunks by transitional effects, for better precision in results. Some groups are also working on form and shape recognition, which could allow searchers to draw a shape, such as a bridge or a tumor; or select an example picture and find others like it.

FGCrossNet
Image Search Engine

Milvus is the world's fastest similarity search engine for massive-scale feature vectors. Built with heterogeneous computing architecture for the best cost efficiency. Searches over billion-scale vectors take only milliseconds with minimum computing resources.

Visual search

Popular sites like Houzz, Pinterest, and LikeThatDecor, have communities of users helping each other answer questions about products in images.

Music Information Retrieval

Music information retrieval (MIR) is an exciting and challenging area of research. Music not only connects people but also relates to many different research disciplines including signal processing, information retrieval, machine learning, musicology, and psychoacoustics. In its beginnings, research in MIR has borrowed many ideas and concepts from more established disciplines such as speech processing or computer linguistics. After twenty years, the MIR field has matured to an independent research area that has many things to offer to signal processing and other research disciplines

Interactive Search

E-commerce Search

According to a recent survey, over 55% of online customers begin their online shopping journey by searching on an E-Commerce website like Amazon as opposed to a generic web search engine like Google. While information retrieval research to date has been primarily focused on optimizing generic search experiences, and search engines in the past decade have improved significantly, not too much attention has been paid to search for E-Commerce. In this talk, I will explore some intrinsic differences between web search and E-Commerce search that makes the direct application of traditional search ranking algorithms to E-Commerce search difficult. In addition, I will present some recent attempts at Etsy to tackle challenges in E-Commerce search.

Multimodal Search

A multilingual, multimodal search and access system for biomedical information and documents. The system allows access to biomedical data:

  • from many sources,
  • analyzing and indexing multi-dimensional (2D, 3D) medical images, with improved search capabilities due to the integration of technologies to link the texts and images to facts in a knowledge base,
  • in a multilingual environment,
  • providing trustable results at a level of understandability adapted to the users.

Princeton CASS: Content-Aware Search Systems

The Content-Aware Similarity Search (CASS) project investigates research issues in searching, clustering, and classification, and management for feature-rich, non-text data types.

Knowledge Graphs

Search is not only on string but also things. Knowledge graphs are large networks of entities and their semantic relationships. They are a powerful tool that changes the way we do data integration, search, analytics, and context-sensitive recommendations. Knowledge graphs have been successfully utilized by the large Internet tech companies, with prominent examples such as the Google Knowledge Graph. Open knowledge graphs such as Wikidata make community-created knowledge freely accessible.

WordNet

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the creators of WordNet and do not necessarily reflect the views of any funding agency or Princeton University.

Mag[i]

MagiBot (project name Matarael, hereinafter referred to as MagiBot) is Magi’s web crawling program (also known as “spider”). Crawling may be used to refer to MagiBot’s extracting and/or updating process of webpages.

Magi is a machine-learning-based information extraction and retrieval system developed by Peak Labs. Magi can summarize knowledge from natural language texts in any field into structured data, and provide human users as well as other AI an interpretable, retrievable, and traceable knowledge system that can automatically gather and amend the information through lifelong learning.

Labs and Resources

Labs on Search and Information Retrieval

Conferences on Information Retrieval

Courses on Information Retrieval and Search