When starting a new search project, the more you know about the current state of search, and what the user and business goals are to be achieved, the greater outcome for all.
Some clients do not know exactly what they need or want when they say “just make search better”. Here are some questions to think about when meeting stakeholders and the users of the search application to really understand what we should be aiming for.
Understand the current state
The following is a guide to understand the current state of the search system and to learn more about how the existing system can be improved.
Understanding your business and user requirements
What business objectives are to be achieved?
At the end of this project, how can you prove that your search project is successful? Consider the following for examples:
- Save money? How much or how much more?
- Save time? How much or how much more?
- Increase revenue? How much or how much more?
- Increase user satisfaction? Which ones?
- Create advantages over competitors?
- Decrease risk? How much or how much more?
- Any others?
Select the top three, describe in more detail, and rank them in order.
What objectives are not being met with current state?
A system for finding information is most likely already in place. Why is it unsatisfactory? Describe what the current pain-points of the existing system are. If you were to improve on these, which items in previous question would be affected?
Which improvements in search behaviour contribute to improved business results?
Select which properties will have the most impact on the business results?
- Speed of which new content is available (Near real-time indexing).
- Increased precision (likelihood that user’s results is in the top n results).
- Increased recall (completeness of result set returned to the user).
- Speed of result set returned to user.
- Flexibility of handling different types of queries.
- Ability of the system to never deliver zero results.
- Ranking of results for particular queries. For example, for a bank website to display the result set of credit cards, give more weight to promo cards to rank them higher.
- Reduced effort required for users to find previously unknown content, finding similar documents based on original query.
- Likelihood the user will return to use the search system repeatedly.
Select the top five and rank them in order.
How much (enhanced) control do you need over the results?
The previous question asks for general changes in search behaviour. Here, consider how direct control of the result set is to the success of the application.
- Context — should the results change based on specific context? For example: time of year, location, device, user profile, other factors?
- Access control — should some documents be available to some, and hidden from others?
- Custom function queries — custom algorithms written to influence the ranking of result set.
End users familiarity of content
The behaviour of your search application will be judged by your users. How much do you know about them? And how are your users building their queries? You will need access to your query logs, either the search engine log files or 3rd party analytic tools to answer these questions. Consider the following:
- Are users expressing queries in very specific terms or phrases? Or broad, general terms that retrieve broad results?
- Are users spelling the terms correctly?
- Are users searching for known documents (e.g., “find that project document with the terms ‘Cambodia bridge’ in it from May last year”)?
- How many zero result sets are returned? What term or phrases are they trying to find? Are there alternative ways to find the same document? Why would they be searching for that particular term?
- If filtering is available, is it being used? If not, what could be the reason?
- Are users specifying quantitative parameters such as distance, time, price, location, as part of their search query?
- Are users familiar with logic-oriented search (e.g., boolean queries such as AND or OR, or wildcard characters) or is it natural language, or a set of keywords?
Technical characteristics of your search application
This section will explore some of the key inputs needed to consider before architecting and designing your search application. The answers to these questions may change over time as we learn more about the content.
Often, a working prototype is built to validate the assumptions and it is recommended to have two versions of the set of questions that follow: one for the prototype and one for production. Building a prototype will help you accumulate more experience and familiarity of your content and uncover a fuller range of possibilities.
What format are your documents in?
Data comes packaged in different formats based on where they originated from and who created them. Different formats require different levels of interpretation and different parsing techniques are used to extract the raw text data. Which of the following document format types will you be indexing and sorting?
- JSON documents
- XML documents
- HTML documents
- Microsoft office documents (please specify MS Office version)
- PDF documents
- CSV or TSV documents
- Open Office documents
- Others (engineering drawings, audio, video?
How and where are the data and documents stored, independent of format?
Which of the following data repositories is your data stored in? How easy is it to extract the raw data from the original source if we need to clean and enhanced the data for search?
- Relational databases (MySQL, etc.)
- Non-relational databases (NoSQL data stores such as Hadoop)
- Open source content management systems (Drupal, Wordpress, etc.)
- Proprietary enterprise CMSs (Documentum, OpenText, etc.)
- Traditional directory-oriented filesystems
- Web servers (HTML files)
- REST API (JSON format)
- Triple stores (Virtuoso, etc.)
- XML data stores
How big are your documents?
Configuring your search application requires an understanding of your document sizes, as performance and throughput depend heavily on accounting for the size of documents to be indexed. What is the average file size of your documents?
- > 1KB
- 1KB to 100KB
- 100KB to 500KB
- 500KB to 1MB
- 1MB to 5MB
- 5MB to 10MB
- 10MB to 50MB
- 50MB to 100MB
How much new content is added, or existing documents updated, per unit time?
The quality of your search results can be affected by the interval when a document is complete or ready, and when it appears in the index for searching. Describe your content in further detail, below are some examples:
- Millions of new, very small documents, such as tweets or log files, are added by users or systems when they are created.
- Existing documents are updated either by users or systems.
- New documents are added on a regular schedule.
The second part to this question is the length of the interval between indexing new or updated docs. Select from the options below:
- From minute to minute
- Approximately every 15 minutes
- Up to two per hour
- Once every four hours
Define the end goals
Can your content use faceting or a taxonomy to support productive navigation and discovery?
Faceted search provides an effective way to drill down and refine your search results. Consider the following when thinking about faceted navigation.
- What metadata currently exists in the documents that can help users narrow down results?
- Are there rules that can be created and applied to derive appropriate attributes? (enhancement)
- Can you use third party tools to extract entities or build a taxonomy to identify appropriate attributes?
How will/can the following features benefit your search application?
How will/can the following improve the search experience? Will these features help you get closer to your user or business goals?
Which fields should autosuggest look at to offer suggestions? Most of the time, it’s the title but you can add others as well, such as product ID numbers. Think about how users search in the search bar and target those fields for a better autosuggest experience.
There are multiple scenarios that can take place when a term, or terms, are spelled wrong, including “did you mean”.
More like this
Can providing similar documents to the user’s query be helpful? This can also be used in recommendation systems.
Can emphasizing terms from the query in the search snippets help users find what they’re looking for? This can be helpful when the document size is large and want to show users how the terms are used in the sentence. Hit highlighting can also be used to bring subtitles out of a document.
If documents contain geo-coordinates, can we create more personal searches by influencing the ranking of docs by distance?
As you can see, there are many questions that need to be asked before a project starts and the answers to these questions will help you to steer the project in the right direction.
The more you understand about your users and business objectives the better the search project will be and you will also have some metrics you can use to measure the progress and improvements.