The Relevant Needle in the ESI Haystack

Make the needle easier to find by making the haystack smaller

By Daniel Rupprecht, Director, eDiscovery Consulting

An analogy often invoked when discussing data interrogation and how eDiscovery technology can be used to manage large volumes of data is that of finding the “needle in a haystack”.

A misconception about eDiscovery methods is that they can be used to find so-called needles directly. If you are lucky this might be true, but the majority of eDiscovery technologies and techniques are put into place to help manage and reduce the size of the haystack you are starting with. In turn, this creates a much more relevancy rich environment of documents for review, thus making it easier to find the needles we are after.

I have been in the field of digital investigations and eDiscovery for close to 15 years. In this time, the data volumes of electronically stored information (ESI) has expanded exponentially.  Over this same time period however, there has not been the same explosion in the volume of evidence (i.e., we are not finding more information to build our arguments from).  We are therefore now being tasked with sifting through far more irrelevant data than ever before to find what is important to our investigations. All indications are that this trend will only increase in the future.

The legal field has historically operated on a search term-based analysis when dealing with the growing haystack of ESI. Logic dictates that if data volumes are increasing exponentially each year, should search terms alone be applied, then the volume of results from this approach would be growing at the same exponential rate. On this basis, simply doing things the way they have historically been done is not the most efficient approach. While we will never turn our backs completely on search terms, there is a need for us to think differently and deploy data analytics to augment our current strategies. Smarter, faster – and importantly, more cost-efficient.

Over the course of this and the following two articles, it is my goal to introduce some concepts, technologies, and techniques that are driving the industry to do things better and more efficiently. These approaches allow legal teams to access the information they need faster, minimising the cost of reviewing documents that have nothing to do with the matter at hand.

  • Part 1 – Discussions of how reductions in volumes can be realised without setting eyes on a single document.
  • Part 2 – Technologies that can help focus the review and place legal teams in relevant rich data environments.
  • Part 3 – Continuous Active Learning, an updated form of predictive coding.

Through the three-part series, it is my intension to provide context to some of the strategies that can be used to reduce mountains of data into manageable molehills.

Part I: Reducing data volumes with your eyes closed

Reducing the volume of data, without performing any review: simple approaches that can be used to reduce data volumes, without having to look into the content of the data itself.

Data Source and Custodian Identification

On any matter, whether litigation or investigation, understanding the data landscape is one of the single most important elements to assist an effective and measured eDiscovery project.  Advising clients on proactive methods to understand their digital environment is key when trying to assess what might be at issue in a particular matter. Sound information governance allows organisations to be more selective if issues arise.

Take, for example, an investigation into anticompetitive behaviour related to the market allocation activities of a large multinational organisation. It might be assumed that the marketing or sales team is involved. This could amount to hundreds of individuals being identified as a potential source of illegal activity. From a data perspective, not knowing who or what might be involved could require the collection of a vast quantity of email and other documentation.

Knowing how teams operate and who they interact with within their digital environments might allow for a better understanding of what needs to be collected rather than just taking a broad stroke at all information stored by the company. Questions to consider may be:

  • Is there a particular product at issue?
  • Are there regional distinctions that can preclude consideration of any illegal action?
  • Are there subsets of teams within the larger group that can be focused on?

In a similar manner, understanding the data sources which could contain relevant information is also key. Again, this is linked to sound information governance and there are proactive steps that companies can take to make this step easier for them in the event of litigation or investigation, prior to there being an issue.

As an example of this, consider a company that uses a single network shared area, which all users have access to. If there is no structure, then relevant information could fall anywhere – which could in turn lead to having to consider all the data in the network share.

Alternatively, if the network share is split down in a structured manner (for example by department, project etc.) and access is limited to those that require it, then it will be easier to identify the locations where relevant data can be stored. Again, this can help immediately reduce data volumes.

By gaining a better understanding or clear data map of who and what could potentially be involved, large swaths of data can be set aside at the collection stage, which will have an immediate effect in reducing data volumes.  Understanding this is essential and should be discussed in detail with legal counsel at the earliest of stages to prevent over collection of data.

Date Ranges

Date ranges can be used in much the same manner.  Understanding with clarity what timeframe should be considered can be very helpful in filtering down data. In an investigation context, it is usually helpful to ensure that you have allowed for room on both ends of the date range to account for slight variances. Data volumes are typically reduced in line with the length of the relevant time period, and the narrower the scope the less information needs to be interrogated.  While date ranges can be key in reducing data volumes, it is important to understand how they are used.


It was not that long ago that reading the same document over and over was a prevalent source of frustration for reviewers. Today, however, technology can help identify duplicative documents across vast amounts of material with great accuracy and speed, and allow for the removal of duplicative content.

Upon collection and processing of digital information, each document is provided with a hash value (commonly referred to as an MD5#). This is effectively a fingerprint for each document found in the universe of documentation to be considered. Emails are given a hash value in a different way – by considering many of the pieces of information available – such as To, From, Date Sent, as well as the content of the email itself. This approach allows for deduplication of the same email from various parties – e.g., the sender and the receiver.

The hash values are designed such that they will only match if the documents are the same – the likelihood of two different documents having the same hash value is mathematically incredibly small. This means that the system can identify duplicates by considering just the hash value and thereby exclude them.

There are some considerations to take into account. The slightest of anomalies between two files will mean they have different hash values. For example, if you save a copy of a word document, then use “save as” to save a copy of the same document again without changing the content, the last modified date stored within the file will change. This means that the hash value will differ from the initial saved copy, and while the reviewable content is the same, the files will not deduplicate. If you have ever printed a word document to a pdf, although the files appear identical in content, the file type will differ, and the new file will have a new hash value. Even with these considerations an average reduction in total volume of 25% or more is common.

The other consideration is the importance of tracking the documents that have been removed from the system (or rather workflow). Most investigations require a level of flexibility. Rarely do investigative teams have a clear understanding of exactly who and what is at issue on day one. When deploying deduplication, once a hash value is set, all other versions of that document from other custodians will be identified and dropped out of the workflow as duplicates. Should that initial custodian later be deemed as no longer a focus and downgraded from consideration, it is essential that you consult with your service provider to ensure that any duplicates removed in this way are reintroduced into the document review cycle. This is typically an easy process, but one that is not automatic and should be identified upon any adjustment in approach.

There are different types of deduplication and it can be applied in many ways, making it important to understand the type deduplication in use. While in litigation, global deduplication is typically used, so that only one copy of any file is in the review set, in an investigation, it may be key to understand which users had copies of which documents, so custodian level deduplication may be more appropriate. In some extreme cases, it may be appropriate to apply no deduplication, though this is rare.

Email Threading

Like deduplication, email threading is an eDiscovery based methodology that is realising upwards of 25% reductions in total volumes of data when applied. It is the process by which rather than reviewing individual emails, they can be reviewed as series of complete conversations – with email content already represented in the conversation removed as it is already represented in the dialog.  Not only are data volumes instantaneously reduced, huge efficiencies are recognised by the implementation of email threading as it makes the date easier to review all emails within the context they were written.

Consistency of review is a natural advantage recognised through the implementation of email threading. Many years ago, to ensure consistency across large volumes of data, teams of lawyers would be required to pass through already reviewed material to ensure the correct call was made across email chains that were divided.

By way of example, take privilege redactions.  One lawyer might view a single sentence as warranting a redaction; another a paragraph or page of an email chain. A third could mark the entire document as privileged. All three are effectively reviewing the same conversation – as such, inconsistencies could lead to waiving privilege altogether.  With email threading, one team member makes the decision. The use of email threading has greatly reduced the possibility of confusing the judgments made by the subject matter expert. Similar to deduplication, its activation and tracking is not automatic and should be discussed with your service providers and legal counsel at the earliest opportunity.


I recently attended an IBA conference on “Communications and Competition” focusing in part on the consumer protections around the telecommunications industry. One of many takeaways I had related to the upcoming advent of 5G networks. Not just another “G”, I was told. In essence, the speed and rate in which we create data will explode to the level we have not considered before. In eDiscovery terms, the haystack is about to get even bigger.

There are ways we can manage these volumes in the early stages. Tools and analytic approaches found in virtually all review platforms can reduce data without the need to open a single document.  This can, however, only take you so far.  At some point, human decision making needs to be employed to help make sense of what is left – to find a way to search through the reduced haystack.

In Part 2, I will discuss how technology is allowing us to be more strategic in our approach to eDiscovery and its application. Through the creative use of tools and techniques which are already familiar, legal teams are now able to find more while reviewing less.  This is something we desperately need today but will need even more in the very near 5G future.