This article is one in a series on designing the search experience.
This article is one in a series on designing the search experience.
If the search interface is the most visible and familiar element of the search experience, then the quality of the indexed content is the most overlooked. Good quality content is critical for creating amazing search experiences. The challenge, it seems, is first being aware of the correlation.
Search still has a ‘magic box’ appeal to many in the enterprise. They feel that search should work with any content. Again, they cite Google as an example. Little do they know that web publishers these days work hard to create good quality, web-friendly content so that it can rank high in Google search results.
For example, Google rewards pages that follow good HTML markup practices such as using the <title> and <h1> tags for structure and those that use schema.org tags for marking up data in the pages such as <Place> and <PostalAddress> for describing locations. In fact, there are over 500 types listed in schema.org with new ones regularly added.
Comparatively, in the enterprise, much of the content is messy. Structure and markup are nonexistent, and content is hidden away in PDF and Word documents. Users looking for revenue projections for 2017 may see an Annual Report PDF file suggested, not because the search algorithm has gone bonkers, but because there is a table in the PDF document listing the next year’s revenue projection figures.
A chart with an accompanying data table is a much better way of responding to the query on revenue projections. But you will have to draw the data out from the PDF or connect directly to the financial system housing the data. You would go through this trouble if getting revenue projections was a designated top task—important information that many people in the organisation need. You would ensure that the content correctly represents such tasks.
The process of optimising content to meet search tasks is sometimes referred to as content relevancy modelling. It involves cleaning and enriching the content.
You can think of search content as a table of rows and columns. Each row is a resource. It could be a single document such as the Annual Report of a piece of information such as the Revenue Projections for 2017. The columns are metadata such as Publish date or Financial year. The entire table is called a collection. You will have many such collections in a search project.
Collection of Annual Reports
|Name||Financial year||Publish date||URL||...|
|Annual Report 2016||2016||01 Dec 2016||http
Collection of revenue projections
|Revenue projection ($)||Line of business||Financial year||...|
The collections can be metadata-rich or metadata-poor. Metadata-poor collections are particularly problematic, especially for document or text collections. They present walls of text without any hooks to latch on to. However, it is not about just the number of metadata columns; it’s about having the right metadata columns to meet search tasks. A content relevancy pipeline becomes necessary in such situations. The source content is transformed in the pipeline before it goes into the search index.
Cleaning a collection involves filling empty cells, adding new rows, removing redundant rows, or standardising the format of the data. For example, standardising the telephone number and address fields.
314 Tanglin Road, Singapore 247977
Enriching a collection involves adding new columns based on search needs. For example, adding a % change YoY column on the Revenue projections table to report percentage change year-on-year. This addition requires some computation or processing.
|Revenue projection ($||% change YoY||Line of business||Financial year||...|
Cleaning a collection is pretty straightforward but enriching one is challenging. There are two parts to it:
- Figuring out which columns to add
- Ensuring the terms in these columns are accurate and standardised
The good news is that we don’t have to do all the enriching by hand; there are well-developed algorithms or modules to do it automatically. These capabilities fall under the growing discipline of Text Analytics (Tom Reamy's book, Deep Text, offers a good foundation on the subject). The more well-formed your content gets, the easier it becomes to use text analytics to automatically enrich new content. The effort you put in now will pay off again and again later on.
Some text analytics modules include:
- Rules-based processing
- Named Entity Recognition
This module can carry out computations based on predefined rules. For example, in the example above, the % change YoY was computed based on a rule that compared the actual revenue figures of last year and compared it to the next year’s projections. You can create such rules for both data and text. For example, you can write a rule that goes through free text comments to isolate those that mention "charging" or "powering up" problems.
Named Entity Recognition
This module can analyse text for named entities such as names of people, places, organisations, monetary values, etc. This process is called Named Entity Recognition or NER.
But how do we know that the extracted entities are accurate? For example, the term USA is the same as United States of America? Well, you look up a Reference Store.
A Reference Store houses taxonomies (relationships) and dictionaries (values) that you can use to validate the accuracy of the terms the NER finds in the text. Look up the Reference Store for 'USA' and it will tell you that the preferred term is the 'United States of America'. This way you can ensure that new terms are accurate and standardised.
This module can create a summary of long text. It uses the most informative sentences to create an abstract that is representative of the full document. Summarisation is handy for long, legacy documents that do not have an extract or abstract that can be leveraged in search.
This module can create topics to describe the 'aboutness' of the text. This process is called auto-classification. There are two types of auto-classification: supervised and unsupervised.
In supervised auto-classification, an expert uses terms from the Reference Store to tag a sample of the collection. This sample is called the training set. The module uses the training set and looks for patterns in the text and correlates it with the assigned taxonomic terms, creating a classification model. This model is then used to auto-classify the remaining collection.
In unsupervised auto-classification, the module relies on lexical and statistical correlations to automatically identify topics. However, the topics will always be words or phrases present in the text. For example, it might not pick out 'negligence' from a collection on cancer case studies, especially if the term is not in the text. In this sense, unsupervised auto-classification is more arbitrary and less standardised. It is more apt to content discovery needs where you want to explore possible correlations.
There are other modules in the Text Analytics stack like Sentiment Analysis (inferring if the text conveys positive or negative feelings) that can be used to enrich the collection. The key takeaway here is that you may have to work the content to meet search tasks; you can’t assume that the raw source version will be search-ready.
Let’s take an example and see how content relevancy modelling works in practice.
Let’s say we have a news collection (shown below). Now, let’s also assume we did some research and found that users are looking for specific things, like tv shows, celebrities and companies. The source format is too flat to answer such queries, so we need to enrich it.
|title||‘Late Show with Stephen Colbert’: When it debuts, and why we (and Stephen) can't wait|
|content||Tuesday night brings the long-awaited debut of \"The Late Show with Stephen Colbert,\" as the host drops his \"Colbert Report\" persona, and welcomes his first guests, George Clooney and Jeb Bush, and musical director Jon… \r \nThe debut of \"The Late Show with Stephen Colbert\" finally arrives Tuesday night, after what seems like an eternity of Colbert teasing us with clips…
Based on the requirements we create new columns using text analytics (shown below). The extracted values are first checked with the Reference Store to ensure that they are valid and correctly formatted. If the NER algorithm finds an entity that is not available in the Reference Store (such as ‘Beverly Hilton’), then this can be submitted to the people managing the Reference Store, informing them of new entities discovered in the content.
|title||Late Show with Stephen Colbert: When it debuts, and why we (and Stephen) can't wait|
|content||Tuesday night brings the long-awaited debut of "The Late Show with Stephen Colbert," as the host drops his "Colbert Report" persona, and welcomes his first guests, George Clooney and Jeb Bush, and musical director Jon… The debut of "The Late Show with Stephen Colbert" finally arrives Tuesday night, after what seems like an eternity of Colbert teasing us with clips…
|people||Stephen Colbert, George Clooney, Jeb Bush, Marshall Mathers, David Letterman, Stephen Sondheim, Beverly Hilton *|
|televisionShow||The Late Show|
|facility||Beverly Hilton hotel ballroom|
|organization||Television Critics Association|
As you can see, search now has many hooks it can leverage to answer specific queries. The enrichments help improve search relevancy and satisfaction.
Search is as good as the quality of the available content. If the content is messy, search can’t magically make sense of it. You need to model the content to meet specific needs. The repertoire of methods explained in this article offers an opportunity to design amazingly effective search experiences.