Making a Lawyer out of a Computer

Turning technical legal documents into technical data science fields

River Bellamy
5 min readMay 27, 2021

The Project

For the last month, I have been working on a team of about 14 Lambda School (LS) students on a product for Human Rights First (HRF). Immigration judges make decisions about granting or denying asylum all the time, and these opinions are not public, and are not systematically available in databases the way other court opinions often are. So we are building a database where lawyers can upload cases, access cases, and importantly from a data science perspective, analyze patterns in cases.

In order to analyze the cases in a large scale systematic way, our code has to look at these natural language documents, and extract a number of different variables from them. Where in the US was the case filed? When was it decided? Where did the asylum seeker come from, what is their gender, on what basis did they request asylum? The list goes on.

The lawyers who submit these cases to our database are very busy people. Some of them do this full time, and will probably have more than a hundred cases going at any given time. Others work at big firms and do this pro bono. Either way, they aren’t going to be spending a lot of time breaking the cases down for us. If our code makes an error, they might notice and correct it, but our code has to make a first and fairly accurate pass at extracting all of this information from the legal opinions.

The One-Year Field

One of the pieces of information that HRF asked us to extract is about a filing deadline. There is a rule in immigration law that a person who wants to apply for asylum in the US must do so within one year of entering the US, unless there are some “changed” or “extraordinary” circumstances that justify a delay. HRF wants to know which cases involve an argument about an asylum applicant not meeting that one year deadline, and wants to be able to analyze patterns in those cases. I was tasked with improving the accuracy of the method that extracts this information.

As the code came to me from the previous cohort of LS students, it was 94.8% accurate, though I did not know this yet. We had manually scrapped data on more than a hundred cases, but some of this manual scrapping was incorrect. So my first step was to go through all of the cases where the code disagreed with the manual scrapping, read those opinions, and either correct the manual scrapping or make a note of why the code was getting the wrong answer and how it might be fixed.

The rule that had come to me was essentially to look for the phrases “changed circumstances” or “extraordinary circumstances” in the same sentence with a word about time (year, delay, time, period, or deadline). I had to adjust the rule to allow the words changed or extraordinary to be in quotation marks in the opinions, and to also look for the phrases “untimely application”, “within one-year” (with or without hyphen, and with the number either spelled out or as a digit).

As the code came to me, it looked like this. Our code uses spaCy to tokenize and process the opinions, self.doc comes from that.

It gets the job done, but it doesn’t use the Matcher class built into spaCy for this purpose, instead relying on python lists and loops to process the entire document, and a lot of indexes and compound conditionals that take more time to read and update.

My first pass at implementing my rules came out a bit cleaner, and substituted sets for term lists in an attempt to improve efficiency. It did get the accuracy up to 100% on our training data. (We didn’t have enough cases to have separate test data). But it still used a lot of indexes and sets rather than the spaCy matcher. And because of that, I needed separate checks against different slices based on whether “within 1-year” had a hyphen in is, as spaCy sees the hyphen as a separate token. Here it is:

My next revision of the code uses the spaCy matcher. This results in a slight improvement in efficiency, and a definite improvement in readability. And it allows all four variants of “within 1-year” to be specified as a single pattern, making the structure of the code better match the conceptual structure of what we are searching for. Here it is:

As you’ll note, one of the phrases that I added, “within one year”, was returning True whether or not there was another time-based word in the sentence, as it seemed inefficient to check for more things and I already had 100% accuracy. (My other phrase, “untimely application”, did look for another time-based word. I don’t recall a reason for this difference).

Almost on a whim, I decided to check what happened if I just checked for time-based words regardless of which phrase was found. That change simplified the code, maintained the 100% accuracy on the training data, and was slightly more efficient. So I am making a final update to the code:

Conclusions

Some of the lessons I have taken from this experience are these:

  • Sometimes a simpler rule that checks slightly more things may be more efficient, try it and see.
  • Digging deeply into the data you are working with is essential. Understanding the thing in the world you are studying is at least as important to a data scientist as programming skills.

--

--