SearchBetter: helping DART find even more search results

Neel Mehta, Harvard CS 2018
 

There are numerous open source packages that enable text-search of custom text corpuses. A major challenge for these implementations is addressing zero result searches, where a user may misunderstand that a small corpus should not return millions of results like Google or other Internet search engines. Harvard has recently implemented search across a large body of education content from HarvardX and Harvard YouTube channels. The new project called Digital Assets for Reuse in Teaching project (DART), available at dart.harvard.edu, catalogs over 43,000 videos, html pages, and assessments used in HarvardX’s online courses. With a open-source implementation of search, DART allows faculty and instructional staff to find and reuse these teaching assets in their on-campus courses.

The problem: zero search results

However, although DART’s 43,000 resources are a large amount in the realm of educational content, Daniel Seaton and Selen Turkey caution that this was a “limited amount of content compared to searching the Internet at large” and DART’s search engine does “not always return a result, an experience people are unfamiliar with due to the extensive use of Google, which always provides some result.” In fact, we tested DART’s search engine on 1718 example search terms and got zero results for 23% of all the terms we tested, from alkane to xylem.
The zero-results problem is not limited to educational content search tools like DART. Any small text corpus (a body of text to search over) can have this problem. To prove that, we looked at course description data from other MOOCs. A dataset of course descriptions from Udacity, which has 172 courses, returned zero hits for 14% of the 138 terms we searched for (we took a subset of our computer-science related search terms since Udacity is primarilty CS). A similar dataset for edX, with 2,444 courses across many disciplines, returned zero hits for 92% of our 1718 search terms.

So the problem of getting zero search results for certain terms is common across many small corpuses. What can we do about it?

As part of a Harvard undergraduate research-for-credit course, I teamed up with Professor Dustin Tingley, the faculty director of the Vice Provost on Advances in Learning research team (VPAL-Research), and Daniel Seaton, a Senior Research Scientist at VPAL-Research, to tackle the zero-results problem. We built an open-source Python tool — SearchBetter — that DART and other platforms can use to help users find more, and more relevant, search results when searching for specific topics.

Introducing query rewriting

One popular way of getting around the zero-result problem is called query rewriting. Instead of searching for just one term, you replace the term with a set of related terms and search for all of those terms, then combine the results. This broadens the list of search results. More specifically, query rewriting surfaces content that is closely related to the original term, which might be useful to the user but would not show up without query rewriting.

Many search engine software packages, from Oracle’s to open-source ones like Whoosh, will do “lexical query rewriting”, which rewrites a given word to a list of words with similar spellings or roots. One example is stemming, which adds alternative grammatical forms to words. For instance, stemming could rewrite sink to {sink, sinks, sank, sunk}. Lexical query rewriting is useful and easy to implement, but it won’t necessarily make large improvements.

To make larger improvements, we propose using “semantic query rewriting”, which finds entirely new words with similar meanings. For instance, semantic query rewriting could rewrite couch to {couch, sofa, chair, furniture} — which would probably give you many more results than couch alone. Huge search engines like Google and Yahoo already do this, but they rely on troves of user data, which many small search engines, including DART, don’t have.

Instead, our research explored how to implement semantic query rewriting by gathering lists of similar words from external resources such as Wikipedia. This way, even small search engines like DART’s could use semantic query rewriting to reduce the number of search terms that get zero results.

SearchBetter: a free semantic query rewriting tool

We built two semantic query rewriters in Python and released them to the public as the free, open-source Python package SearchBetter:

  • The first rewriter uses Wikipedia’s Categories API to find the terms that Wikipedia says are related to the user’s original search term -- or “seed term.” You can find categories of any Wikipedia article by scrolling all the way to the bottom. For instance, the Wikipedia categories for inertia include classical mechanics, mass, and velocity.

  • Our second rewriter uses a natural language processing framework called Word2Vec to find semantically-related terms to a seed term. Word2Vec reads over a large body of text (in our case, the first 100 megabytes of text from the English Wikipedia) to make multi-dimensional vectors for each word it sees. Words with similar meanings or in similar contexts have similar vectors.

We also built sample search engines that searched over the DART corpus, as well as Udacity’s course list and the edX course list. The code for these search engines is also freely available in SearchBetter, as is code for a generic search engine that others can reuse in their own apps. Note that Udacity’s course data is public, while DART’s and edX’s are proprietary and were only available for our research use.

Testing our rewriters

For DART and edX, we compiled a list of 1718 academic terms from Wikipedia, covering fields including biology, chemistry, physics, and economics; this list is available in SearchBetter as well. For Udacity, which focuses only on computer science courses, we made a separate list of 138 technology terms, which is also freely available.

We ran the search engines on the corresponding lists of terms to see how many search results we got without query rewriting. Then we attached our query rewriters to our search engines and determined how many search results we got with both forms of query rewriting (Categories-based and Word2Vec-based.)

Results: SearchBetter effectively fights the zero-result problem

In many cases, we found that the Wikipedia and Word2Vec rewriters can find many more search results for a term that normally gets very few hits. For instance, meiosis only gets three hits on the DART search engine (namely, videos on human reproduction) without rewriting. With the Word2Vec rewriter, meiosis (a technical term for sexual cell division) is rewritten to a set of highly-related biology terms, including meiosis, cell division, fertilization, DNA replication, and endocytosis. When we connect the Word2Vec rewriter to the DART search engine, we get 93 hits that cover topics including DNA replication, genetic material, RNA, cell structure, and evolution — material that would likely be useful to someone interested in meiosis. The Wikipedia rewriter rewrites meiosis to {meiosis, cell cycle, cellular processes}, yielding 103 hits.

For each pair of search engine (of which we had three) and query rewriter (of which we had two), we graphed the number of hits before and after rewriting for each term. Generally, the more hits that the query rewriter added, the more effective it was. The graphs are shown in Figure 1 below.

search better comparison grid
Figure 1: for each query rewriter (Wikipedia Categories on the left, Word2Vec on the right), we measured the number of hits before and after rewriting for each search term. We compared these numbers to the control case (dashed line), where there was no rewriting and hence exactly as many hits before as after. This was repeated across all three tested search engines.

Note first that, since our search engines rank results by relevance, finding more results is always better than finding fewer. Even if a query rewriter adds totally useless results, those results would sink to the bottom of the ranking, doing no harm to the user.

The Wikipedia Categories-based rewriter adds a few results on top of what the unaided search engines found, especially for terms that normally got very few results (consider the left side of the Wikipedia/DART graph in Figure 1.) The trend lines are barely higher than the control line, which suggests that the Wikipedia rewriter was not that effective. The trend lines are somewhat misleading, though, since they are skewed by some outliers to the right.

Meanwhile, the Word2Vec rewriter shows many more added search results, as shown by the trend lines that are far above the dashed control line. This indicates that it was generally more successful than the Wikipedia-based rewriter. However, future work would have to determine how relevant these additional terms are.

Comparing search better hits.

Figure 2: the average number of search results that the DART search engine provides, with and without rewriters. The orange series considers only terms that get zero search results without rewriting and indicates how well the query rewriters address the zero-result problem.

We took a close look at DART to see if we were truly solving the zero-result problem. As Figure 2 shows, DART found an average of 83 results without rewriting, 184 with the Wikipedia Categories rewriter, and 354 with the Word2Vec rewriter. And for those search terms that yielded zero results without rewriting, the Wikipedia rewriter found an average of 106 results, and the Word2Vec rewriter found an average of 75. This is a massive improvement and an indication that our query rewriters are truly improving upon the zero-result problem. Of course, there is always room for improvement.

Code and documentation available online

As we mentioned earlier, our search engines and query rewriters are free and open-source; you can find them on GitHub at https://github.com/hathix/searchbetter or install them via Python’s “pip” tool. For documentation, see http://searchbetter.readthedocs.io/.

Thanks and looking ahead

I had the privilege of working with Daniel Seaton and Professor Dustin Tingley on this project as part of CS 91r, Harvard’s computer science research-for-credit course. I thank Daniel and Professor TIngley for providing the vision for the project and lots of subsequent technical help. Thanks also to the HarvardX and DART teams for graciously providing me the data and resources needed to make and test my libraries.

It’s not very common that undergraduate research gets to make such a big impact in such little time. But DART, as a large, new, and user-facing product, makes it easy for undergraduates like me to contribute both theoretical findings and practical software. There are plenty more opportunities for anyone, especially undergraduates, to work at the forefront of educational technology and machine learning with DART. DART is a place where undergraduate research can make a big impact in a short time.

Disclaimer: Responsibility for the information and views expressed in this blog lies entirely with the author and are NOT endorsed in any way by organizations mentioned in the blog.