Full-Text Search Fundamentals and Terminology
What is?
Full-Text Searching (or just text search) is a series of SQL techniques that allow natural-language document identification that satisfies a query and sorts them by similarity.
Query and similarity have very flexible notions depending on the application. The simplest searches consider query as a set of words and similarity as the frequency of query words in the document.
Kind of like ‘LIKE’?
Nope! The LIKE predicate only searches based on pattern matching characters as a simple regular-expression. This type of textual searching has existed in databases for years but lacks the essential properties required of modern information systems.
LIKE
- Insufficient handling of words with similar derivatives (eg satisfies, satisfy), because there is no linguistic support
- No ranking of results, which is ineffective in cases of thousands of matches
- Slow because no indexing, must process all documents for every search
Full-Text search
- Uses linguistic analysis of metadata, through operating on words and phrases based on a particular languages rules.
- Allows words to be preprocessed and adds an index for later rapid searching.
Preprocessing Capabilities
Parsing documents into tokens — Identifying the exact class of tokens (eg numbers, words, emails, complex words) is important to how they are processed. While token classes can be specific to the application, most of the time referencing a predefined set of classes is adequate.
Databases use parsers to pinpoint these: standard parsers are provided for predefined classes, custom parsers can be created for esoteric classes.
Converting tokens into lexemes — Lexemes are strings, the same as a token, except normalized. Normalization involves treating different forms of the same word the same. For example, “Koala” would be folded into “koala”. Suffix removal is also common in normalization as well: “blackmailing” would convert to “blackmail”, even “spaceships” to “spaceship”. As you can imagine, this provides a big benefit when searching for a word without having to enter all the potential variants, which can sometimes be in the thousands.
So tokens are raw fragments of document text, lexemes are words considered to be useful for indexing and searching. This conversion is done using dictionaries: standard dictionaries are provided, or custom dictionaries can be created for esoteric lexemes.
Storing preprocessed documents optimized for searching — Representing a document as a sorted array of normalized lexemes, for example. Other helpful document information is also stored. Proximity ranking places documents with a more dense region of query words at a higher rank than one that has the words scattered.
So, if an Ichthyologist’s PDF compendium for fish classification happens to mention “eels” in multiple sections, it would be ranked lower than a niche article focused on the American eel life cycle with the same number of text hits.
Dictionary Capabilities
Dictionaries allow fine grained control over token normalization. They can define stopwords that shouldn’t be indexed:
From the English list of stopwords: a, about, above, after, again, against, all, am, an, and, any, are, aren’t, as, at, be, because, been, before
Uses Ispell, a program that allows typographical errors to still parse as the proper word, (like Google’s did you mean … ?). Ispell also maps synonyms to a single word as well as different variations of a word to it’s canonical form.
Query expansion is the reformulation of a seed query to improve information retrieval performance. Using small string processing languages, dictionaries reduce derived or inflected words back to their word stem. This process, called stemming, does not require that the root word be identical, it may even be a synonym. It just locates all morphological forms relating to the original query text. This hierarchical equalization of matching similar words is called conflation.
What is a Document?
Documents are the units of searching in a full-text search. They can be articles, blogs, or emails, etc.. The text searching engine must be able to parse documents and store associations of any lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
Side Note: depending on the database you’re using, the document may be stored as a textual field within a row of a table. Some of these fields may even be concatenated, or stored in several tables, referenced by indexing. So in some circumstances the document may never actually be stored as a whole in the database.
Test!
Here’s a quick 6 question quiz on Full-Text searching if you want to gauge your understanding and fortify what you read:
1 — In what ways might LIKE be insufficient for querying?
2 — What is normalization?
3 — What is the difference between tokens and lexemes?
4 — Why is proximity ranking important?
5 — How does query expansion return more results?
6 — Define a Document
I like to test myself at the end to pinpoint exactly what I realized I didn’t truly absorb. It’s easy to think you’ve got something down, but applying it in practice measures exactly what you’ve taken away.