
The goal of processing a large quantity of data is to defensibly cull the data set to one that encompasses all of the potentially relevant documents, yet is small enough to reasonably review within the given timeline and other parameters of the case.
Unlike most eDiscovery vendors, UHY Advisors starts the culling activities before the data is uploaded into our system (long before the “meter” starts running).
- We start with the extraction of the Class A & B data set (see below) from the larger quantity of data. This serves to filter out only the user-created files from a larger set.
- When restoring data from tape, we selectively restore only those directories that contain user-created data, leaving behind operating system and other related files.
- In the case of an email server, our process allows us to selectively extract only the mailboxes belonging to key individuals in the case.
Once the Class A & B data is extracted, it is uploaded to our processing framework, where we have the ability to deduplicate the user-created data within a single custodian or across custodians, which can decrease the data set by over 75%. In some cases, we have the ability to deduplicate the data set across backup instances, further decreasing the size of the data set by over 90%.
Rather than just blindly applying search terms, our iterative searching process allows us to work with our clients to establish an initial list of terms and apply those to the data set and create a frequency report. Frequency reports allow our clients to review their list of terms, the number of documents each of the terms hit on, and the percentage of the entire data set represented by those hits. With our frequency report in hand, our clients have the information they need to modify the search terms set to eliminate over-inclusive terms, and modify others as the constantly changing parameters of the case require. Finally, when a final list of terms is identified, we encourage our clients to take a look at a small statistical sample of the data set that did not receive search term hits, just to certify that none of them are relevant to the case. All of these searching efforts culminate in one of the most airtight and defensible searching processes available today.
In addition to our deduplication and search culling mechanisms, we offer the ability to cull the data set:
- by custodian
- by data set restriction, focusing on only those documents created within a select date range
- by file types that are key to the investigation at hand
After the data set has been culled in a defensible manner, we have the ability to export the culled set into the proper format, allowing it to be uploaded to one of our hosted review solutions, or in preparation for delivery to another third party vendor for alternative hosting. The key to using a hosted data review service is to find the right tool for the job. Not all hosted platforms are created equal - complete with all the “bells and whistles” to enable fast, efficient, and most importantly, a quality document review.
Data Classifications
Over the last several years, a lot of confusion has been generated in the industry with respect to the definition of user-created data. The definition of user-created data is simple: It is any electronic file type that a given computer user can modify with substantive content. Some would argue that all electronic files meet the standard of this definition; however the spirit of this definition restricts the eDiscovery process to a grouping of about 250 file types that have the highest, reasonable probability of containing user data. This definition applies mostly to the electronic discovery process and even more so to those services as they apply to civil litigation. Outside the parameters of civil litigation, where a forensic evaluation or the potential of criminal activity has occurred, all bets are off and every file on a given piece of media is considered to be suspect.
UHY Advisors has gone further to classify the definition of user-created data into four primary groupings, which range from Class A to Class E – Class A being the document types with the highest probability of containing user content and Class E being at the other end of the spectrum, consisting of mostly files created by the computer operating system.
Class A Data:
Class A data includes email and user-file types that have the highest probability of containing user content. Included in this group are the usual email suspects like Microsoft Exchange/Outlook and Lotus Notes, along with others, and standard user-created files like those generated from the Microsoft Office suite of utilities.
Class B Data:
Class B data is similar to Class A in that it has a reasonable probability of containing user-created content, but within the definition of Class B, we expand our investigation to include some of the more obscure email data types and some of the less common file types, as well as commonly used compressed archives, database files, files created by some of the more common financial programs, internet material, and graphics/multimedia files.
Class C Data:
Class C data consists of files that are initially identified as an unknown file type or can not easily be classified as a system or application file and are unlikely to contain user-created content. Automated processes can not determine whether the files are system/application files or files that contain user-created content.
Class D Data:
Class D data consists of all files that are identified as system or application files not containing user-created content that can safely be excluded from a data collection. UHY Advisors will report Class D data to the client and will do no further processing of this data unless specifically requested by client.
Class E Data:
Class E data consists of system or application files that do not contain user-created content. These are files that match the MD5 hash value of a known list of over ten million system and application files maintained and updated by the National Software Reference Library. UHY Advisors will not report Class E data to the client unless specifically asked to do so.