Electronic Searching in Discovery
If you have any type of computer, smart phone, or personal home IoT device, electronic searches are a daily part of your routine. The word “search” is to look for something, but we more often use the word to refer to a specific logical and calculated procedure we execute when typing a word or phrase into a search engine, like Google. This is the way we search through files in eDiscovery, as opposed to searching by hand or physically reading through paper documents.
When you type in a search for a word or phrase, you expect the search tool to return exactly the results you desire and expect. In practice, search tools do not work that uniformly. There are characteristics of digital information that can impede your access to relevant information. This article provides more details that may help you improve your methods for searching documents for litigation.
Indexed Searches vs. Live Searches
Some searches are fast and some are slow – this can be due in part to the power of the computer or the search method used, but indexed searches are generally faster than live searches. Electronic indexes operate similarly to paper indexes. Instead of reading an entire cookbook front to back to find recipes that use potatoes, you would consult the cookbook index for the word “potato”. A search index will look similar – every word in a set of indexed documents will be stored in the index along with a reference to which documents include that word. The index will also show where in the document the word appears.
The Windows 10 operating system indexes file names and properties to assist users in locating files. Apple computers with recent versions of MacOS installed have an indexing tool called Spotlight which adds document contents to its index, further improving search performance.
A live search does not use an index but instead reads through every document in the set to find a word or phrase. Live searches are advantageous when you cannot create a thorough content index or only need to search through a few files. Live searches will take longer than indexed searches because they must read through full documents instead of locating documents by their index entry. When searching many files, it is better to index the documents before searching to speed things up. Both live and indexed searches can be run with tools like dtSearch, Elasticsearch, or other common eDiscovery and forensic software.
Noise Words
When you use a search tool, be careful to consider whether it ignores “noise words.” Noise words vary by tool, but often a search tool will ignore two letter keywords, the word “the,” and conjunctions. If you were searching for the phrase “Catherine the Great,” an indexed search may instead ignore the word “the” and return all documents containing “Catherine” and “Great.” In most cases, this does not cause problems, but should be a consideration to ensure that a search has returned all relevant documents.
Optical Character Recognition, or Searching Unsearchable Text
Optical Character Recognition (OCR) is a procedure used to extract text from images. Computers review the shapes contained in scanned documents to perform their best guess at their text contents. The accuracy of the text output from OCR depends primarily on the clarity of the of the file being processed. Handwriting is often highly inaccurate without training the OCR engine to recognize a large amount of text with the same handwriting. Printed documents that are poorly scanned or photos of paper documents may also have inaccurate OCR extracted text.
It is crucial to consider OCR before searching documents. For instance, let’s say that an employee stole a critical secret document by scanning a printed copy and then attaching it to an email. The email says “Document attached.” The document itself says “Secret – Do Not Distribute.” If you search the employee’s email account for term “Secret” without running OCR on the documents, you will not find that scanned document because the text has not been extracted yet. OCR processing may take some time to complete, but its results can greatly improve the accuracy of your searches.
Keywords, Operators, and Conditions
Search terminology is loose and varies in conversation, but I like to consider search terms as being a combination of keywords, operators, and conditions.
Keywords are the words or phrases of a search. Examples of keywords are “John Doe,” “electric scooter,” “damage,” or “insurance claim.”
Operators are used to combine keywords in a way that limits the search results to more relevant items. Examples of operators are “AND,” “OR,” and “NOT.” These are also called “Boolean operators.” You would often use a search operator to combine someone’s name (keyword) with a relevant term (keyword). Using the keywords above, we may only be interested in documents pertaining to John Doe’s involvement in an electric scooter crash. An appropriate search term might be “John Doe AND (Electric Scooter OR Damage).”
You might be wondering why I inserted parentheses – the parentheses allow us to create more complex logic. In this case, a document that contains the word “electric scooter” but not “damage” will be responsive if it also contains the keyword “John Doe.” It will also be responsive if it contains all three keywords. However, a document that mentions “electric scooter” and “damage” will not be responsive if it does not contain the keyword “John Doe.”
While not technically an operator, you may also use a proximity search, like “John w/4 Doe.” This translates to “the word ‘John’ within three words of ‘Doe’.
Finally, conditions can be used to limit the set of documents searched based on their characteristics. These characteristics could include the date a document was created, the owner of the document, or the size of a document. We may create a search term “John Doe AND (Electric Scooter OR Damage)” where the document is newer than January 1st, 2019. We might also be interested in photos of the damage and therefore exclude documents that are not JPEG or PNG files.
Often conditions are applied to a set of search terms. They can be applied differently to each keyword and operator phrase. We might complement the prior search term with another search term “insurance claim” where documents are dated between 03/14/2019-03/24/2019 if we know that a claim was filed within that date range.
The search terms I have included here may look different depending on the search tool you use – syntax refers to the specific rules that a tool may provide for entering terms. This allows a tool to understand what you are looking for based on how it is programmed. In a hypothetical search tool, I might have to format my search term as “(body contains ‘John Doe’) & (body contains ‘electric scooter’ | body contains ‘damage’) & (date>=01012019).”
Other Terminology
Here are some other phrases you may have heard or may consider when running electronic searches.
Cascading Searches are searches run after an initial set of search terms uncover new relevant phrases or keyword variations. This is why it’s always good to preserve as much as possible – you never know if an individual went by a pseudonym or username, or if slang was used in a series of email communications.
Stemming or stemmed searches will match the base of a keyword and any variations. For example, a stemmed search for the keyword “steal” would also search for “stealing” or “steals.”
Fuzzy searches, specifically fuzzy string searches, will match words similar to a keyword in structure. For example, a fuzzy string search for “dig” could return “digs,” “dog,” “dug,” or “ding.” This is a great search option if you suspect parties in your case are poor spellers.
Phonic searches identify words that sound like a keyword. A phonic search for “witch” could return “which” or even “wish.”
Text Encoding and Symbols
What do you do if your search terms are in a language that uses non-English characters such as French or Bulgarian? If you’re a native English speaker, you may not even think of the possibility of dealing with these languages in your search. Non-English searches will require non-English search terms – search tools do not (usually) search for every possible translation of a word. While this is a concern when creating search terms, it is also a concern when executing them. You will need to consider encoding options for searching.
Documents in languages that use special characters like Cyrillic script or special accents may be encoded differently than standard English ESI (electronically stored information). Search tools may allow you to search for a keyword in multiple text encodings, preventing you from missing this data. Even English may be encoded differently – Unicode (UTF-8) is a standard, but native SMS text messages are stored in GSM-7. These issues are often taken care of easily by using search tools equipped to handle eDiscovery and processing forensic acquisitions with appropriate software.
Different encoding includes support for different symbols, like stars and hyphens. That doesn’t necessarily mean that your search tool does too. Search indexes may require custom settings to index symbols or else they will be treated as spaces. Symbols may also conflict with tool syntax. If a search tool uses the character “&” as an operator, but you have a company name keyword “Doe & Sons, LLC.,” this search term may not be executed correctly as is. Symbols can often be converted into different code for machines to understand – & is “Chr(38)” in ASCII encoding.
Emoticons and emoji are a recent and unique challenge for investigators and document review personnel. Smileys, hearts, and mermaids may be skipped or interpreted as pictures during eDiscovery searches. Be sure to look out for these trouble items during your document review.