Notes on Search Indexing
An advanced description of the hows and whys of nvUltra’s search index.
- Doing a full text search of all notes in real time is feasible, but increases memory requirements and drive access proportional to size and numbers of files. For users with really large collections of notes, this may become problematic, particularly on iOS devices. I wanted a search system that had lower memory requirements, and was more suitable for other devices in the future, even in use cases that consist of a large number of large documents.
- nvUltra indexes files when a folder is added, or when files within a folder change. This process is relatively fast, but can take a bit of time for large folders, or large files (especially large PDF files).
- The index is cached on disk to speed up opening a folder that has been previously indexed.
- The index is loaded into memory while the folder is in use in order to speed up search.
- In the General preferences, you can clear the cached indexes if you need to for any reason.
Search is focused on trying to balance speed, accuracy, and functionality:
- Speed is accomplished by using a fast search algorithm, as well as using SQLite for the index to speed access. The goal is for search to be effectively “instantaneous” as you type.
- Speed of search is O(n) for number of documents, but does not increase with length of documents (searching long documents is just as fast as searching short documents). Searching 2 documents takes twice as long as searching 1 document.
- The index is designed to have 0% false negative rate (searching for a term will always identify all files that include the term).
- However, one trade-off is that the indexing algorithm does have a non-zero false positive rate (very rarely, you may get a match that does not include the search term.). The indexing parameters are tuned so that this false-positive rate is estimated to be less than 2 in 10,000 documents for a given query.
- Keep in mind that boolean “NOT” inverts the meaning of a query, and could effectively cause a false negative result. For example, searching for “foo” may have 2 in 10,000 false positive rate. This means that “NOT foo” may have a 2 in 10,000 false negative rate.
- Incremental search is supported – for example, “foo” is indexed as “f”, “fo”, and “foo”. This means that as you type “foo”, the search is accurate for each keystroke along the way.
- Text is case folded, and Unicode text is processed prior to indexing and searching. In other words, “Foo” and “foo” are treated as identical. “Bücher” and “bucher” are treated as identical.
- Sequences of Arabic numerals are indexed as words (e.g. searching for “123” will match “1234”)
- Boolean functionality is supported using the keywords “AND”, “OR”, “NOT”, or the corresponding symbols “&&”/“&”, “||”/“|”, or “!”.
- Search terms can be grouped using parenthesis (“(foo OR bar) AND bat”)
- “Multipart” words containing ASCII hyphens or underscores are also split into sub-words for indexing – “foo-bar” and “foo_bar” are indexed in their entirety, as well as indexed by “foo” and “bar” separately
- Tags are supported, both in terms of macOS tags (“Finder” tags), and MultiMarkdown metadata tags. You can search for a specific tag using “tag:foo” or “#foo”. Searching for “foo” will also match documents that are tagged with “foo”.
- Because indexing is performed at the word level, searching for the middle of words is not supported. Searching for “oo” will not match “foo”.
- Regular expressions are not supported. True regular expression support would require access to the full text of a document, which defeats the purpose of using an index.
- Punctuation is not indexed – in order to minimize index sizes, indexing is focused on “words”