Search Deep Dive - Query Expansion

25 min readJan 27, 2019

If you are new to search space, you will benefit from this post about the basics of Search.

Motivation

The original query from user is often inadequate in fulfilling their information need, when Search engines use keyword matching. This is because of the following aspects of our languages.

Synonymy: different words can convey the same meaning. This includes
- traditional synonyms such as “smart” and “clever”
- abbreviations (eg: tv = television)
- related terms (eg: “atomic bomb” and “Manhattan project”).
The synonymy aspect of the language can lead to an issue with Search engines called vocabulary mismatch. When different words are used to describe the same concept in queries and documents, Search engines have trouble matching them. For example, the query “cold medicine” wouldn’t be able to retrieve a document with title “remedies for runny nose”.
Polysemy: a word can mean different things depending on context. The meaning of “book” is different in “text book” and “book a hotel room”. Polysemy leads to ambiguous queries.
Short queries can often be devoid of context. For example, user’s intention is not immediately clear from “Lincoln”. It is probably related to “Abraham Lincoln” in politics and “Lincoln cars” in automobile category.

The issues resulting from both synonymy and polysemy are more prevalent in shorter queries. Given that users performing web search use 2.4 words per query on average, addressing vocabulary mismatch and intent ambiguity elegantly is really important in providing good experience.

Spelling mistakes, different forms of the words (eg: “walk”, “walking”, “walker”, etc) also contribute to issues in keyword matching. But, these are easier to address using query refinement which involves spelling correction, stemming and lemmatization. In this post, we will cover the harder aspects of vocabulary mismatch and ambiguity of the queries.

Lexical approach

Using a thesaurus would be one way to deal with the issue of synonymy.

Along with the query words, we can also use their synonyms to retrieve and rank the documents.

For example, given the query “bike”, we can identify “bicycle” as a synonym and add it to the query, matching both bikes and bicycles in the corpus. This process of adding new words to the original query is called query expansion. The added words are called expansion words.

In practice, we use lexical databases / knowledge bases (eg: WordNet, ConceptNet and Cyc) that describe more interesting relationships between words than just synonyms. WordNet and ConceptNet are open source (and much more popular) while Cyc is a commercial database.

These knowledge bases can be visualized as a graph of words connected by relationships between them. For example, WordNet contains nouns, verbs and adjectives. It groups similar words together into sets called synsets. Each synset represents one sense (meaning), which is described as a definition. A word with multiple meanings will be part of multiple synsets and WordNet has frequency scores that indicates how commonly the word is used in each different sense it is associated with. Synsets are connected by relationships such as “is a” , “part of”, “member of”, etc.

Senses can be used to disambiguate the query words that contribute to polysemy issue. For example, “plane” is associated with multiple senses like airplane, sheet, and carpenter’s plane. If the query is “woodworking plane”, we can find “woodworking” in the definition of carpenter’s plane synset. This can help us resolve the meaning of “plane” in the given context.

We can then add “planer” as an expansion term instead of “airplane”.

Another approach would be to compare the synsets of all query words and pick the dominant synset as the intended meaning for the ambiguous words.

The idea behind ConceptNet is to capture commonsense knowledge, which is millions of basic facts and understandings most people possess. ConceptNet contains concepts (words or phrases) connected to each other, similar to synset connections in WordNet. One difference is, ConceptNet has many more relationships (eg: effect of, location of, property of, capable of, etc.).

The most popular way to choose expansion words using knowledge bases is called Spreading activation. It involves graph traversal starting from the nodes corresponding to words / concepts from the original query. Edge weights (called activations) are multiplied to get the total activation score of a given path. Words from paths with highest activation scores are considered for expansion. Note that ConceptNet connections don’t have weights, but we can borrow the weights from WordNet (ConceptNet relationships form a superset of WordNet relationships). It turns out that spreading activation on WordNet and ConceptNet results in different expansion words that are complementary.

These databases only cover relationships between generic / common words. If your corpus is specialized (eg: medical documents) or you want to increase coverage beyond common words, you can use a special knowledge base custom made for you.

Global analysis

Various statistical approaches have been proposed to build custom thesauri using data from the corpus.

The idea of leveraging the whole corpus for query expansion is called Global analysis.

In contrast, if we only use a select few documents from corpus for query expansion, it is called Local analysis. We will cover local analysis in subsequent sections.

One way to gather related words from the corpus is by looking at term co-occurrence. If two terms often appear together within a sentence (or a fixed size window of words), you can deem them as related words. For example, “home” and “car” frequently appear with “rental”. Based on this, one can infer that (“home”, “rental”), (“car”, “rental”) are more closely related than, say, (“food”, “rental”).

If only one of these related words appear in the query, the others can be considered as expansion candidates. We can rank all such candidates by how frequently they co-occur with query terms and choose the top K words for expansion. Term co-occurrence is one way to generate correlation between pairs of terms, but it’s not the only one. The following measures have been used in the area of query expansion.

Methods that leverage any of these approaches to capture association / correlation between word pairs are known as association methods or correlation models. We will use the latter name. Please remember this because correlation models are very popular and they will show up multiple times through out this post.

Correlation models suffer from polysemy. If one of the terms in the query has multiple meanings, we will likely have multiple related words for this term in our thesaurus, one for each meaning of the term. Using all of the related words for expansion without resolving the intended meaning of the term can potentially result in irrelevant results. One way to resolve the intended meaning of the term is by looking at other words in the query. For example, “herb”, “remedies” co-occur frequently with “medicine”, while “herb”, “cooking” co-occur with “spices”. So, depending on what we have along with “herb” in the query, we can expand the query to either “medicine” or “spices”.

Phrase co-occurrence whenever we can, will yield better results. For example, consider the phrase “waterfall glass” which refers to a type of glass. Taken separately, “waterfall” refers to a very different entity and “glass” can mean a few different things. So, using each word independently for query expansion may result in adding terms that are not relevant to the original intent. Using phrase co-occurrence may expand the query to “decorative waterfall glass panes”, which preserves the intent.

But, using phrases is too strict and its coverage can be very low. This is because user’s queries are not necessarily well structured - the words may not be in the same order as we expect (eg: “book hotel rooms inexpensive Paris”). So, a work around is to use bag of words for determining co-occurrence. Another approach would be to consider term co-occurrence not across the whole corpus, but at topic / category level. Be careful with bag of words approach though, sometimes the word order is very important. For example, user intent is very different between “watch harry potter” and “harry potter watch”.

In general, query expansion using global analysis works better than no expansion. But on its own, it is one of the least effective expansion strategies. One of the reasons is because of its difficulty in bridging the vocabulary gap given that it only works in document space and doesn’t consider query space. Also, global analysis can be seen as unsupervised learning and usually supervised learning strategies, whenever applicable, yield better results when you have sufficient amount of training data.

But, don’t discount global analysis completely yet. Global analysis is often used alongside many of the other approaches we are going to discuss below.

Relevance Feedback

Relevance Feedback is one of the first frameworks used to carry out query expansion. Relevance Feedback constitutes the following steps.

The results from user’s original query are presented to the user.
User marks relevant documents from the result set.
The words (called expansion words) that help distinguish between the relevant and non-relevant documents as marked by the user are identified.
New results using the original query + expansion words are presented to the user.

This approach is extremely effective, but has largely fallen out of grace because it requires explicit feedback from users. It has been superseded by techniques, called pseudo relevance feedback, that make use of implicit feedback.

I do see variations of this appear in experiences other than Search from time to time. For example, a popular e-commerce website presented below module to receive explicit feedback from users on the product page of a table.

As you can see, the results are updated with round tables as user “upvotes” them. Note that I included this only for illustration purpose. This particular module is most likely using similar items approach (“more like this”) than the query expansion based on relevance feedback.

Before moving on to the query expansion techniques that are currently used in Search, I will briefly review popular methods in Relevance Feedback for completeness. Feel free to jump to the next section (pseudo relevance feedback).

The magic of Relevance Feedback is in identifying the expansion words based on user’s feedback. Intuitively, these are the words that occur more frequently in relevant documents compared to non-relevant ones. That’s the idea behind Rocchio algorithm (1971), one of the earliest popular algorithms used to identify expansion words.

Rocchio algorithm was introduced when “Vector Space Model” was the popular way to carry out Search. In vector space model, both query and documents are represented by vectors. For example, the vector of a document may contain tf-idf scores of each term in the vocabulary for that document.

To get a vector representation of words that occur more frequently in relevant documents compared to non-relevant ones, you can simply subtract centroid of non-relevant document vectors from relevant document vectors.

C_r is the set of relevant documents and C_nr is the set of non-relevant documents. d is the document vector.

Adding the resulting vector to the original query vector gives you the expanded query vector. You probably want to provide more weightage to positive feedback (relevant documents) than negative feedback, so Rocchio is often parameterized with weights (typically, α = 1.0, β = 0.75, and γ = 0.15).

Don’t worry if the vector space model seems very abstract. The rest of the post covers techniques that are much more intuitive.

Language models

Natural languages are not designed. They evolve over a long period of time and hence, they are hard to describe with a set of rules. Statistical language models attempt to capture the essence of a language by looking at examples of text.

Given a sequence of words, language models can provide an estimate on how likely it is to form a valid (sub)sentence using that sequence. Also, given a bunch of words, we can obtain the probability that a particular word would appear along with those words. These capabilities lend Language models a central role in wide range of NLP tasks.

A number schemes based on Language models (called Relevance Models or RMs in short) have been proposed for information retrieval / Search ranking as well. It turns out that they are very effective in identifying the expansion words from relevant documents. The most popular scheme among them is known as RM3.

In RM schemes, we build a language model using the text from relevant documents. This language model is called relevance model (RM). The idea is to use the language model to compute the probability of occurrence for each expansion word candidate and choose a few candidates with highest probabilities for expansion. We don’t treat all relevant documents equally. We weigh them based on their relevance to the original query, as determined by a measure called query likelihood model, P(Q|D_i).

D_i is the i’th relevant document.

To build the query likelihood model, let’s use unigram model for simplicity, the easiest language model out there. While at it, let’s also assume that words appear independently of each other.

P(q_j|D_i) is the probability of seeing j’th word of the query in i’th relevant document.

In unigram model, this probability is equal to the number of times that word appeared in the document divided by total number of words in the document, as per the maximum likelihood estimate.

If a word from the query is not in the document, i.e. P(q_j|D_i) = 0 for any word, P(Q|D_i) becomes 0 as well. To avoid this, we assign some probability to even the words that don’t appear in the document. This is called smoothing and the most popular strategy is known as Dirichlet smoothing. In this approach, instead of just using the word count from the document, we also use the number of times the word appears in entire corpus. Both counts are combined using suitable weighing factors.

The weighing factor “mu” ranges from 1000 to 2000. P(w | D_i) also follows the same pattern.

We can now obtain the probabilities to rank the expansion word candidates by substituting these in P(w | RM). This is the first version of relevance models, called RM1. If we also consider the “relevance” between the expansion word candidates and the original query, we get better results. Adding RM1 and original query model in a linear combination is called RM3.

One of the shortcomings of relevance feedback approach is that it suffers from presentation bias. If our Search algorithm was only able to present mediocre results to users and completely missed relevant documents, users wouldn’t have a chance to identify the true relevant documents from result set. There is a good chance that the documents selected by the users from the mediocre result pool do not yield useful expansion words.

But the major hurdle in using Relevance feedback in practice is that it requires explicit feedback from users. Since it is unreasonable to expect users to take time to provide feedback for every Search query, researches came up with various strategies to “approximate” the feedback. One approach, called pseudo relevance feedback, simply assumes that the top K results obtained using original query are relevant, compared to rest of the documents. Another approach uses user logs / query logs to mine clicked documents for past queries and treats the clicked documents as relevant ones.

Pseudo Relevance Feedback

This approach also involves two stage retrieval process like relevance feedback, with the following steps.

The results from user’s original query are obtained.
Top R results from the result set are considered to be relevant documents.
The words (called expansion words) that help distinguish between the “relevant” and non-relevant documents are identified.
New results using the original query + expansion words are presented to the user.

As you can see, the only difference compared to relevance feedback is the second step where we now assume top R results to be relevant, without requiring users to provide feedback. We can use the same algorithms covered in Relevance Feedback (Rocchio, RM3) for extracting expansion words from top results.

Since we are only relying on top R results to carry out query expansion (instead of the whole corpus), pseudo relevance feedback is classified under local analysis. In particular, this approach is also referred to as local feedback in literature.

In terms of relevance of results, local analysis is more effective than global analysis. Combining local and global analysis provides better results than either of them. One attempt to combine them (called local context analysis) involves using correlation modeling that we have seen in global analysis, not only on the query, but also on words from top relevant documents.

We had formulated pseudo relevance feedback / local analysis method as a two stage retrieval process. But, it has been noticed that simply re-ranking the results from first retrieval using expanded query is enough. It produces good enough results compared to carrying out the full retrieval second time, which is computationally more expensive. However, this may not be universally true and can very well depend on the corpus you are dealing with.

While pseudo relevance feedback improves relevance of results for many queries, it also hurts relevance for many other queries. In particular, pseudo relevance feedback doesn’t work well for difficult queries that result in non-relevant documents in the first phase. Even worse, we can’t even use it when a query returns 0 results despite having many relevant documents in the corpus (a possible scenario, if the vocabulary mismatch between the query and the corpus is large).

So, the effectiveness of pseudo relevance feedback depends a lot on the validity of our assumption (that top R documents obtained in first phase are relevant). In other words, the success in this approach depends on 1) how good our Search algorithm is (in fetching relevant documents in first phase of retrieval), as well as 2) how well structured our corpus is. Improving Search algorithm is in our hands, but how about the quality of the corpus?

Here is a (very) hypothetical scenario to demonstrate (2). User is looking for “bikes”, but your product database either has “bikes” or “bicycles” in title, description, etc. but not both. Pseudo relevance feedback will not be able to expand “bikes” to “bikes or bicycles”, because only the documents containing “bikes” will be retrieved in first phase. Note that in this simple case, a thesaurus can help since both bikes and bicycles are common words, but the issue is with related words that are not common enough to be covered by thesauri. What can you do if your corpus is skewed in terms of the vocabulary or if the corpus is too small to cover all terms that users might use in their queries?

Encyclopedia based approach

You can use a different, large, well structured corpus (eg: Wikipedia) for query expansion. The idea is that you would use an external encyclopedia during your first phase of retrieval. Everything else stays the same. Top R documents retrieved from the external encyclopedia are assumed to be relevant and expansion words are determined based on them. The expanded query is then used to perform Search on your corpus and show the results to users. If your domain is specialized, say medical domain, you can use a large corpus in your area (eg: something like pubmed).

Note that the lexical approaches we used earlier can also be applied on encyclopedias such as Wikipedia since entities in encyclopedias are also typically connected with various relationships (eg: hierarchical, synonyms, etc).

Supervised learning

So far, none of the above approaches using supervised learning. The initial attempts at introducing supervised learning into query expansion involved building a classifier or ranker that helps choose expansion words from the set of candidates. The expansion word candidates are generated using unsupervised approaches as before, only the selection is done using the ML model. One of the common approaches to build such an ML model was to use features around the query and expansion word candidates, to predict the impact of using a candidate word for expansion on a relevance metric such as MAP (Mean Average Precision). These approaches were definitely more effective than just relying on unsupervised learning. However, they require a lot of labeled data for training. To train a good model, we need to identify all of the relevant documents from corpus for tens of thousands of queries, if not more. This is not an easy undertaking.

How about “approximating” that labeled data with something that we already have? That’s where past user behavior data (called query logs or user logs) comes in. It turns out that such data is invaluable in a wide range of tasks, including query expansion. I briefly alluded to an unsupervised approach that leverages this data in the context of pseudo relevance feedback. We can treat the clicked documents for similar queries from the past as relevant documents (instead of top R results) and carry out pseudo relevance feedback. There are many other effective unsupervised as well as supervised learning methods that leverage this data.

Query log based approaches

Correlation models that we have seen in global analysis were also tried on query logs. One approach is to look at term co-occurrence across all “successful” historical queries. Queries that resulted in clicks could perhaps be considered as successful. The central idea behind this approach is that users, motivated by their information need, formulate queries that are of high quality i.e. queries that offer a better description of documents than document titles. This assumption is certainly questionable and in certain domains, authors of the documents have more expertise or motivation. For example, in e-commerce, sellers can be expected to provide good quality titles for the products they list since they are highly motivated to sell them.

Similar queries

Even if the queries are of high quality, looking for co-occurrence across all queries will lead to polysemy issue that we have seen in global analysis. To get around that, methods that leverage similar queries from user logs have been proposed. The idea is that using correlation models (and other approaches) on only similar queries instead of the full set of queries mitigates polysemy issue.

Grouping similar queries together involves clustering queries based on a similarity measure. To check similarity, we can use any of the following approaches, ordered by strictness of comparison.

words from the two queries are compared
documents in the top results of both queries are compared
clicked documents for both queries are compared
similarity of the content of clicked documents is compared
categories of clicked documents are compared (if documents in corpus are organized hierarchically)

If we are comparing sets (of words, for example), Jaccard similarity, Dice coefficient can be used. If we are using vector representations, cosine similarity, edit distance are commonly used. These measures are often modified for more effectiveness. For example, when using edit distance to compare sequences of words, cost of each edit operation can be calculated based on term co-occurrence statistics from document space or query space or both. This modification incorporates word / semantic similarity into the measurement of syntactic similarity done by edit distance.

Following the success of restricting the scope to similar queries, some methods went one step further by incorporating user sessions. One approach was to limit the correlation models to similar query sessions, instead of all queries or similar queries. Another one carried out expansion based on current session, instead of just using the current query.

Session based reformulations

The most effective approaches that leveraged user session data are the ones that looked at how users reformulated their queries to get satisfying results. For example, lets take a user session that involved these queries: “AJ1”, “Airjordan”, “Nike Air Jordan 1 Shoes”. Based on user logs, only the last query was successful (resulted in a click). From this, we can infer that “Nike Air Jordan 1 Shoes” is a better formulation of user’s intent than “AJ1”, hence the latter should be expanded to the former. So, when we see “AJ1” again, we know what to replace it with. This can be a simple dictionary lookup or we can build a supervised ML model based on this training data.

Query click data

A few methods that use query, clicked document pairs were proposed and they turned out to be even more successful. These approaches leverage information about the clicked documents such as their titles (unlike previous methods that only used whether a query resulted in a click or not).

Correlation models were used on query - clicked document pairs. These term co-occurrence methods that use bag-of-words approach are good at deriving word associations. For example, we will be able to learn that word “mortgage” is associated with “home”. But, they are not effective at identifying synonyms, eg: “auto” in “auto wash” can be substituted by “car”. Although we can determine that “auto” and “car” frequently co-occur with “wash”, using co-occurrence alone to tackle synonymy will result in noisy expansion words (“laundry” also frequently co-occurs with “wash”). We also need to take syntactic structure of the sentences for that, which can be accomplished through Statistical Translation Models. Translation models have another advantage over correlation models. Correlation models are purely based on frequency statistics and don’t “learn” from data like a Machine Learning model does. So, they can suffer from data sparsity issues unlike ML based translation models that estimate.

Translation methods

Translation based methods have been very effective in bridging the vocabulary mismatch. Most of the approaches train translation models using query clicked document pairs as training data. It has been shown that translation models outperform correlation models even when the exact same data is used to build both.

There is one caveat in using existing translation models as is - the user queries are often not well structured (words can be jumbled), as pointed out earlier. So, we need “relaxed” translation models for query expansion. Different types of translation models have been tried and interestingly, some of them tend to result in complementary expansion words. The best results are obtained when the results of different translation models are combined using a ranker.

Using query, clicked documents pairs alone in above methods can result in presentation bias similar to the one that we discussed in relevance feedback section. If a set of relevant documents were never presented to the users for a query, they would never amass clicks. Another issue is that we may not have enough click data for queries with low volume traffic (tail queries). These limitations can be addressed by grouping together the clicked documents of similar queries.

In general, user log based approaches have been highly successful and I think there are three main reasons for that.

Strong user signal:
- User clicks provide reasonably strong signal of relevance.
- User’s in-session updates and click on the subsequent query also indicate successful reformulation.
Bridging query space and document space i.e. vocabulary mismatch
- We are using data from query and corresponding documents to build the models (eg: translation models).
- Although we are not directly using document title in session based reformulation, we are only using a successful query which would have been similar to the title of the document that was clicked on.
Supervised learning, which typically outperforms unsupervised methods as training data gets bigger.
- Query, clicked document pairs can be used for training translation models.
- Original and in-session reformulated queries can be used for training a generative model.

The best part is that the session based reformulations and query click data based approaches are complementary. The former methods are good at identifying indirectly related refinements while the latter are effective in identifying synonyms (substitutions). Together, they form one of the most effective approaches used in the field of query expansion till date.

Query click bipartite

The query click data can be modeled as a bipartite graph. This is a graph where nodes are either queries or documents. There is an edge between a query node and a document node, if that document was clicked by a user for the query. The edges typically have weights that correspond to number of clicks the document received. The expansion methods that leverage query click data can be formulated as random walks on the graph.

We can generalize the graph further by connecting nodes with different types of edges. For example, we can connect nodes of similar queries. With enough types of edges, we can implement any method that uses query logs using constrained graph traversal. Such graphs can be very useful for researching new methods, but not popular in practice.

Word Embeddings

We started off this post with the following statement.

The original query from user is often inadequate in fulfilling their information need, when Search engines use keyword matching.

How about tackling vocabulary mismatch by getting rid of the root cause i.e. keyword matching? That’s where word embeddings come in.

Word embeddings are dense vectors of real numbers used to represent words. These vectors are shown to capture various relationships between words and we can use vector operations to exploit these relationships. For example, the vector operation (King - Man + Woman) produces a vector closest to “Queen”. This indicates that the word embeddings were able to capture that the difference between King and Queen as well as Man and Woman is same (gender).

A number of commercial search engines have started using word embeddings in Search. There are many approaches to extend the word embeddings to sentences that can be leveraged to represent queries and document titles using vectors. We can rank documents in the corpus by the cosine similarity of query vector with document vectors. Such a scheme is expected to obviate the need for query expansion. However, reality is a little different and I will cover the challenges in this approach in a separate post.

If you have a keyword based Search engine, you can still leverage word embeddings in various sub-components of that platform. For example, you can use the cosine similarity between query and document vectors as an additional signal in your ranking scheme. We can also leverage word embeddings to carry out query expansion. One of the simplest approaches is to choose words from vocabulary that are similar to query terms, based on cosine similarity of their word embeddings. The effectiveness of this strategy depends on the methodology behind the construction of embeddings.

There are parallels between the construction of word embeddings and many of the query expansion methods we have discussed before. Using pre-trained word embeddings such as word2vec or GloVe vectors for query expansion is akin to using lexical approaches. Generating custom embeddings for your corpus is similar to carrying out global analysis. Using query clicked document pairs for generating word embeddings in a supervised learning setup can be expected to yield results similar to query log based approaches. We can also use word embeddings in conjunction with any of the earlier approaches. For example, we can carry out pseudo relevance feedback / local analysis on word embeddings of top R results.

Sequence to Sequence Models

We are certainly in deep learning territory now, if we weren’t already (with word embeddings). Approaches that map a sequence of words to another sequence such as session based reformulations and translation methods can be implemented using sequence to sequence models.

Conclusion

It is understandable that the effectiveness of a query expansion method is tied to the nature of search engine and corpus it is used on. What is less obvious is that its success also depends on the query. In other words, any query expansion method can be expected to improve results of some queries while worsening that of a few others. For this reason, along with using precision and recall measures averaged across thousands of queries, researchers started using robustness (the ratio of number of queries that were positively impacted by expansion to number of them impacted negatively) to measure effectiveness of query expansion methods.

Interestingly, the queries where one method might perform poorly could be handled well by a different method. So, the ideal query expansion framework calls for a classifier that invokes the best expansion method for a given query. Better yet, let all methods generate expansion candidates and a ranker would weigh them based on the query. Luckily, with recent advancements in deep learning, we don’t have to hand design each part of the query expansion framework. We can build a neural network that will serve as an end to end query expansion framework.

We will cover such a design in the post about query guidance. Query guidance and query expansion have many common elements. I think that this framework can be used to address both expansion and query recommendation.

Here is the timeline of papers describing some of the methods discussed in this post. Each dot represents one of the methods appearing in a paper. The lines connecting dots indicate ensembling.

References

Search Deep Dive - Query Expansion

Written by Kranthi Kode

Responses (3)