Is Your Machine Learning Review Tool Accurately Tagging Your Documents?
What are we trying to benchmark?
For every civil court case, there is something called the discovery phase, where each side gives the other all of the documents (these days, usually in electronic form) that may or may not be relevant to the case. Then each side has their own lawyers read each and every one of these documents, looking for something to help them, or better yet, finding the smoking gun. These days, law firms can use machine learning programs, which they can train to find which documents are closest to what they call relevant.
The problem is, how can they tell which software is better for their case? They can’t get a second opinion on how well their documents were coded, because the documents are privileged. So how can they tell which of two machine learning programs that classify text documents into two piles (relevant and not relevant) is doing the better job? That’s what this paper is about, and that’s a big issue for lawyers.
The legal field needs a public data set of email that is large enough and complex enough to classify documents with different and distinct definitions of relevant. At least one public data set does exist: postings on the public news site Usenet.
What is TAR, and what does it have to do with machine learning?
By “TAR” I mean Technology Assisted Review, a term used in the electronic discovery world of litigation. The EDRM group released a detailed paper regarding Technology Assisted Review (TAR) Guidelines, in which the definition given is
“Technology assisted review (referred to as “TAR,” and also called predictive coding, computer assisted review, or supervised machine learning) is a review process in which humans work with software (“computer”) to train it to identify relevant documents.”
This is an excellent resource, and goes into detail (readable by a layperson) about what TAR can do to improve a review, and what the pros and cons are with regard to many aspects of the procedures. However, one important area that is not addressed is how to differentiate between the different software options. Every software company claims that their TAR product is the best, and, for each of the ways litigators use TAR, one of them is right. We just don’t know which one. Yet.
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
What do I mean by Benchmarking TAR?
When I was Director of Innovation at the legal services provider Discovia, the first set of products I tested were TAR systems. To do this, we needed a tagged set of documents that we could use to see how well the TAR systems performed compared to human reviewers. One of our clients let us use documents from a settled case, and we also created our own set of documents based on internal mailing lists, but neither was entirely satisfactory. One program, Equivio’s Relevance (now part of Microsoft’s Office 365), shipped with a subset of the Usenet data set, and that gave the clearest comparison. We felt comfortable with our evaluations of the products, but our tests weren’t ideal, nor could we discuss the tagged documents of the first two data sets due to the privileged nature of the documents.
What we really needed was a public data set that had already been tagged by professionals based on the different ways legal teams used TAR. The results of using this public data set could be used to compare two TAR systems side by side, to provide a benchmark for users to help them decide which product would best suit their needs.
What data sets do we have?
The only complete public set of documents from an actual case that was publicly tried is from the Enron scandal. This document set has 1.3 million email messages and attachments from former Enron staff, and is about 40 GB of data. The size and complexity of this case makes it a poor choice for a benchmarking data set.
The 20 Newsgroups Data Set
The 20 Newsgroups Data Set was the set packaged with Equivio’s Relevance as a sample test set, and at Discovia we used it as our third testing data set. This data set was created in 1997, and consists of close to 20,000 newsgroup documents partitioned across 20 different newsgroups, each on a different topic, including politics, religion, sports, and computers. A newsgroup was a public area on the internet divided into different subjects, where people could post their opinions and ideas. These are pretty much the equivalent of the comments sections of webpages, with a good mix of intelligent discussion, arguments, and flame wars.
For example, here is a sample post from the group
comp.sys.mac.hardware file #50423From: firstname.lastname@example.org (Charles Holden Winstead)
Subject: ftp site for Radius software???
Organization: Electrical and Computer Engineering, Carnegie MellonHey All,Does anyone know if I can ftp to get the newest version of Radiusware
and soft pivot from Radius? I bought a pivot monitor, but it has an
old version of this software and won't work on my C650, and Radius said
it would be 4-5 weeks until delivery.Thanks!-Chuck
Note that because these were publicly posted, with the understanding that all of this information would be publicly visible, that names, organizations, and similar information do not need to be redacted or removed. However, so that TAR systems don’t match documents based on obvious metadata similarities, I created a version of the Usenet data set that does not contain Path:, Newsgroups:, References:, and Message-ID:.
The newsgroups available are
For example purposes, Equivio suggested that people testing their product mark anything in the
religion groups as relevant, and everything else not relevant. This meant that about 20% of the documents were relevant, and the resulting model was very accurate. While these groups had some long, detailed discussions on the group topic, not every posting within those groups was strictly relevant. Also, other groups, such as
misc.forsale, were broad enough that the articles within the group were too varied to classify them as a single topic.
So how do we use this data set for benchmarking?
This data set contains some large (in proportion) sets of documents that have detailed discussions around a particular subject. There are also smaller sets of documents around other subjects, in the sports and science groups. Multiple sets of documents relevant to different subjects can be tagged to make a “gold standard” of relevant tags, that can be used to compare how well a TAR system performs.
For this data set to be used as a benchmark, obviously each document needs to be tagged relevant or not relevant for topics chosen as useful by the legal community. Obvious topics are politics and religion as general topics, or broken down into specific subsets, such as the Branch Davidians. Other, smaller topics around sports, computers, and science can be found. The documents were collected in 1997, so the politics and religion topics involved are relevant both to what was happening at the time, and to timeless discussions, such as
space shuttle schedules
the best snow tires
the existence of God
the difference between Windows 3.0 and 3.1
the Atlanta Braves vs the Houston Astros
GEICO’s extended warranty plan
Note that just because something was posted in a particular newsgroup does not mean it will have anything to do with the newsgroup topic.
The Gold Standard
To benchmark, a gold standard document should be created that lists several different subjects, along with whether each document in the data set is relevant or not relevant. In order to test how well the TAR systems handle those situations, this gold standard document should contain relevance tags for
- Subjects with quite a few relevant documents (aspects of
religionare obvious choices)
- Subjects that match fewer documents in total
- Subjects that match fewer relevant documents
The gold standard file should be created that lists the definitive tags for these different subjects, so that every file in the Usenet data set should have a tag for each of the subjects. As a simplistic example, here are some documents pulled up at random, with the standard for relevance the answer to the question, “Does this document mention politics or motorcycles?” (Obviously, more realistic questions should be developed that more closely match the types of topics used in civil cases.)
Who decides what’s actually relevant?
Who should do the tagging? Ideally, for the legal community one of the groups in EDRM could take ownership, letting members of the legal community choose the topics. Using groups of three to decide relevance works well, so that if there is a question about an edge case then the document’s relevance will be whatever tag was given by two of the the three reviewers.
And who watches the watchmen?
Once the benchmarking group has decided on topics and tagged the topics, they will now need to use this new Gold Standard list to test the different TAR systems. While I’m certain that the TAR companies will offer to take the list and do the testing themselves, that’s like letting the fox into the henhouse.The final step of this project is for the members of the EDRM group to test the TAR software itself.
Note that this isn’t necessarily a strict pass/fail test. There are different ways to use TAR depending on the client and the use case. While the default test would be to train the TAR system to rank all the documents from most relevant to least relevant, then compare that to the Gold Standard, there are other tests that can be done. These can be determined at testing time, based on how the TAR system is designed to perform.
The legal community needs a way to validate the TAR software they use, and this data set is a good way to provide this validation.