Sometimes it’s good to get back to the basics of eDiscovery. This post will focus on how to deal with duplicate documents.
Deduplication – or deduping – is the process of identifying identical documents in a collection and then suppressing those copies from a document review. It’s a crucial eDiscovery function, and choosing the right deduplication strategy can positively impact the speed, cost and consistency of your review.
How are documents identified as duplicates? Whenever case documents are first loaded into an eDiscovery application, an algorithm calculates unique hash values, or “digital fingerprints”, for them based on their binary content, metadata and text depending on the type of file. Since identical documents will generate the same digital fingerprint, the software can flag them as duplicates.
eDiscovery platforms will vary, but a Servient client has four paths to choose from in deciding how their documents ought to be deduped.
1) Global Deduplication – Exact: With this option, all duplicates of a document are suppressed (meaning held back and retained in an archive) and only a single copy is submitted for review. This is the most widely used deduplication strategy since it typically offers the greatest cost savings. (In our experience, a 30% - 40% cost reduction is not uncommon.)
In this case, a “Global Custodian” field is associated with the retained version of a document listing the names of all other custodians who possessed a duplicate of the document. This lets you keep track of the multiple owners of a given document throughout the review process.
2) Global Deduplication – Exact & Content: The only difference here is that content duplicates are held back in addition to exact duplicates. Content duplicates have different digital fingerprints but their text content is precisely the same. An example of this would be a PDF created from a Word document; since these are in different file formats they will not share the same digital fingerprint, yet analysis by the review software will discover that their text content is identical.
3) Custodian Deduplication – Exact: With this option, only exact duplicates within a single custodian’s data set are removed. For example, if custodian John has three exact copies of Document X, two of those will be suppressed and only one will be promoted to the document review set. But if any other custodian has their own copy of Document X, that version would be promoted even though it is identical. (Servient has a “cascade calls to duplicates” feature that will ensure that all copies of a document are coded in the same way).
4) Custodian Deduplication – Exact & Content: Here, documents are also deduped only within custodian groups, but again, content duplicates are suppressed in addition to exact duplicates.
This is only a cursory glance at the deduping alternatives available to the eDiscovery customer. Never hesitate to ask the professionals for more detail – the choice will be yours, and it could have significant consequences in savings and in how your document review process unfolds.