Leveraging GTP-3 to read legal documents

Jan 16, 2023

Introduction

At Memorelab, we believe that machine learning and data science do not need to be "R&D" - by leveraging the experience to choose the right problem to tackle and the usage of modern tools, one can build powerful applications fast and grow their business.

In this article, we will showcase our process for developing a proof of concept A.I. app in just two days. Using modern tools such as Budibase, Dagster, and Openai API, we are able to quickly develop a proof-of-concept app.

Budibase is a low-code platform that allows us to easily build and deploy web applications without the need for extensive coding experience. Dagster is an open-source platform for building, testing, and deploying data pipelines, which is useful for cleaning and preprocessing data before it is fed into the A.I. model. Finally, Openai API is a powerful machine learning platform that allows us to easily train and deploy A.I. models.

By utilizing these modern tools, we are able to quickly develop a proof of concept A.I. app in just two days - allowing us to test and validate the solution before investing a significant amount of resources in the project.

Problem

We are tackling a real estate auction “true price” problem identification:

  • Real estate auctions can be very profitable business opportunities if the buyer has all the information about the property being auctioned

  • The problem is that obtaining all the necessary information can be a daunting task. The buyer needs to read through all the legal documents associated with the auction and the property itself. This can be a time-consuming and tedious process, especially for properties with complex legal documents.

  • Furthermore, the user interfaces (UIs) available in the market today only provide the auction's starting prices and do not factor in important information like debts that are carried by the property and should be paid by the buyer after the auction session. This lack of information can lead to an uninformed decision-making process and ultimately, a loss for the buyer.

Proof-of-concept solution

The solution involves using a simple data pipeline to extract basic information about the auctions and using the GPT-3 API to "read" the legal documents' important sections and extract the carry-over debt information. This will allow us to gather all the necessary information about the property being auctioned and present it to the buyer in an easy-to-understand format.

In addition to this, we will also create a basic front-end user interface (UI) that can be used to parse new auctions and show the results. This UI will allow users to:

  • quickly and easily find the properties they are interested in

  • view the relevant information about the property and the auction

This will make the decision-making process much more efficient for the buyer and increase their chances of making a profitable purchase (i.e. reducing the risk of uninformed decision-making).

Techstack

For the front end, we tested Budibase (https://budibase.com/) as our main tool. Budibase is a low-code tool that allows us to generate internal tools. It promises to create crud (create, read, update, delete) applications in minutes instead of days, which sounds like exactly what we need for our fast proof-of-concept.

For the backend, we experimented with Dagster for workflow execution. Dagster is an open-source platform for “building modern data applications”. We also used Postgresql as our main database - a powerful, open-source object-relational database system that is battle-tested and easy to use.

For the "A.I." component, we leveraged the OpenAI API GPT-3 endpoints together with the langchain framework. Langchain is a framework for building large language model applications through composability, which enables more effortless operability with large language models. This allowed us to easily build our LLM (Large language model) interactions.

Description

The data

Our proposed solution is targeting a specific website, MegaLeiloes (https://www.megaleiloes.com.br/), which is a Brazilian platform for real estate auctions.

We are focusing on "Leilões Judiciais" (i.e. judicial auctions) where the most discounts are applied, and which usually has property debt. These types of auctions are known for having a lot of legal documentation and detailed information about debts, which is crucial for our solution.

From the MegaLeiloes website, we are targeting to extract the main fields that are already structured:

  • First auction price (i.e. 1a praca)

  • Second auction price (i.e. 2a praca)

  • Property address

These fields are essential for our solution as they will allow us to present the buyer with all the necessary information about the property and the auction in an easy-to-understand format. The only information lacking is the debts, that we get using our A.I. component described below.

The final component of our pipeline also persists data in the database. This allows us to store the information that has been extracted and enriched by the pipeline to be consumed by our user interface (UI) later on.  The pipeline is triggered by a sensor that monitors every 30 seconds for URLs to scrape in a separate table called **`scraped_pool`**. This sensor is responsible for checking if there are any new URLs that need to be processed by the pipeline.   The **`scraped_pool`** table is then where our front-end app will create records to be processed by the pipeline. This allows for seamless integration between the front-end and the pipeline, making it easy for the user to add new URLs for processing and for the pipeline to pick them up and process them.

The pipeline

For the pipeline workflow, we experimented with Dagster. From our limited experience with Dagster, we found it to be easy to start and integrate with existing python code.

We found that there is definitely a learning curve to understand how things are working, and how you suppose to organize your code. Although, packaging all the docker files to deploy to (self-hosted) production is a bit of a challenge.

There is a bare-bones repo, but it needs some undocumented modifications to adapt it to your needs. I imagine that by using their services the process of moving to production is way easier.

Our main pipeline is very simple. The first job gets the information from the auctioneer’s website with minimal information extraction. The second job (i.e. parse_scraped_data) gets document information and uses Openai API to extract the debt information.

This job takes the minimal information and enriches it with additional information, which is then passed on to the next job in the pipeline. This simple pipeline is able to effectively extract the necessary information from the auctioneer’s website and enrich it with additional information from the OpenAI API.



The A.I. component

This is the most “experimental” component of our stack. The usual way of dealing with the problem of extracting information from legal documents would mean using tons of regex with ad-hoc rules and would require a lot of maintenance after deployment. This is because these ad-hoc rules often break and the pipeline stops working, requiring constant monitoring and maintenance.

To overcome this challenge, we decided to formulate the problem as a Question and Answer language problem. In this approach, we pass the document information and instruct the A.I. model on which information we are looking for. The model then uses its natural language understanding capabilities to extract the required information. This approach is more robust and requires less maintenance as the model can adapt to changes in the documents.

However, this approach also comes with its own set of challenges. One of the biggest challenges we faced was that the legal documents are in Portuguese, which is not the best performant language of GTP-3. Additionally, the document is too big to be sent straight to the model, and there is a cost problem with sending a lot of information that you do not need. To overcome these challenges, we had to develop strategies for processing the documents, such as breaking them down into smaller chunks and using other natural language processing techniques to extract the most important parts of the documents.

Solution

We used langchain to build our A.I. process together with OpenAI "text-davinci-003” model. Langchain is an open-source tool that allows developers to easily build AI models using the OpenAI GPT-3. We experimented with different models that OpenAI made available, but the best results were with text-davinci-003, both in terms of accuracy of results and output consistency.

The prompt is the standard Q&A formula that is provided by langchain, but with a twist at the end. We ask the GPT model to format the answer to a JSON object as provided by the TAG and TYPE marks available in the context for each question. This enables us to parse the model's answer consistently in the pipeline and use the results programmatically.

The prompt used to generate the model to parse results is:

""" Use the following pieces of context to answer the question at the end. {context} Question: {question} Format your answer to a JSON object using the question TAG as the key value in the JSON object. The value should be the TYPE available at the question or null if there is not enough information. Return only the JSON object as the answer """

The {context} is restricted since there is a limit to the document size we can send the GPT model. Thus, we needed to do a basic extraction of excerpts that contained useful information for answering the questions. We applied a keyword search using words like “debt” (”débitos”) and “R$” (monetary values) and extracted the neighboring words to define our excerpts of interest. Those excerpts were the ones passed to the model in the context area.

The {question} are structured as:

  • TAG: "q1" - TYPE: text - Question: What is the price of '1ª praça'?

  • TAG: "q2" - TYPE: text - Question: What is the price of '2ª praça'?

  • TAG: "q3" - TYPE: text - Question: What are the debts values?

  • TAG: "q4" - TYPE: text - Question: What is the address?

Langchain enabled us to fast iterate and experiment with different prompts until we find the one that worked flawlessly. This iterative approach was key in fine-tuning the model and achieving the best results.


The frontend

For our front-end stack, we experimented with Budibase. From our limited experience, we found it to be very straightforward to use. To start, all we had to do was connect to our database and jump into building our components. Deployment was also a breeze, as it only required the press of a button. We didn't have to spend a lot of time reading the documentation to be able to use the app effectively.

One of the great features of Budibase is its "real-time" connectivity with the database, which made it easy for us to try things out and see the results immediately. Additionally, we were able to encapsulate our components in a "front-end'ish" style, although it is still missing some configurations to adjust component sizes and display. For example, max-width/height is not available and support for responsive displays is still lacking. Despite this, we were still able to create a simple but effective UI.

Ou main component for checking properties auctionsPoC Home UI

Another great thing about Budibase is the out-of-box themes it offers. We applied the black theme and it looked great. Overall, we found Budibase to be a great tool for creating simple and effective front-end interfaces.

Final remarks

The team was able to quickly spin up an end-to-end natural language processing (NLP) powered application within two days of work. The current stack is deemed to be sufficient to accommodate additional features and robust enough to support a better user interface in the future. It was also proven that is possible to employ large language models (LLMs) for complex data processing.

As an additional note, we consider that tools like Budibase and other low-code user interface builders can be extremely useful for rapidly developing graphical interfaces that are easily connected to data. They enable fast iteration of ideas and can be a valuable addition to the development process.