Affordable AI for Clinical Trials, the story of João Nabais

João proved that small organisations can train effective AI models to save time, and reduce costs. His work is opening possibilities for open science and clinical research.

Affordable AI for Clinical Trials, the story of João Nabais

It's weird that the web we created doesn't help bring to the spotlight people who do great things. This is something we also see in academia and probably pre-dates the Web. How many theses are archived by libraries never to be read again? How many bright ideas didn't reach the right people?

I am hoping that @João Nabais won't be another unsung and forgotten hero. João finished his Master’s in Electrical and Computer Engineering at the University of Lisbon. His father and sister are both doctors, and fortunately for us, he took a different path.

For his Master's thesis, he made it possible for small and medium-sized organisations to train their own Large Language Model for clinical trials.

We had a chat where he explained that, usually, these models can be trained for specific uses by using dedicated Graphics Processing Units (GPU) that can be purchased for about 15,000€. It is also possible to rent a GPU on a per-hour basis and run the training on it.

But it can take a long time to train an LLM on a specific task.

The way I did it, using a smaller model, it can take one or two days to train the LLM, or more. It depends on other factors, and I had to keep an eye on the process to stop the training if something wasn't working.

Even if you are renting a GPU to handle the processing, the cost will grow with each iteration.

It's hard to be precise. Training a model will depend on the size of the data, how much of it is annotated with the expected output, and the required level of confidence. And clinical trials are harder to process than articles. Looking at current prices, a GPU to run this training can cost $3.39 per hour and take something like 48 or 72 hours for a dataset of 5,000 clinical trials.

Once trained, the model can be run locally to infer information from the clinical trials, at no cost.

For Gregory-MS, we are using Machine Learning (ML) to filter articles using only their title and abstract. An ML model is easier to train and requires less computing power. It can be done well enough on a laptop.

Training the LLM is one step; running it daily is another. But using the model will require less processing, and therefore less infrastructure.

The main difference is that an ML model does a single task, while an LLM can potentially be applied to multiple tasks and plugged into a workflow of several AI Agents.

A practical and important use for João's model is matching patients to clinical trials. This saves time and tackles an important barrier in recruiting participants. Another would be to take data from different clinical trial registries and clearly identify which disease is under study without requiring uniform and structured data.

Is it safe?

Unlike other uses of AI in health research, there isn’t much danger for participants because the model works by inference, using data that was already screened for quality, and not generating text out of nothing.

This is the same that happens when we run Google's LLM notebook on a PDF or tell ChatGPT to only respond in accordance with a set of files.

Science is only finished when you tell people about it

João's thesis will be available at the university's website, but fortunately for us, he also published the code on GitHub.

GitHub is where most developers and other enthusiasts like me gather to share open-source code and collaborate in improving the software that comes out of it.

With the code available on the link below, a decent developer should be able to run the necessary training on a new dataset of Clinical Trials.

GitHub - joaoassisnabais/SemEval2024-Task2: A repository that contains public code which was used to submit runs to SemEval2024, specifically in the context of Task 2: Safe Biomedical Natural Language Inference for Clinical Trials
A repository that contains public code which was used to submit runs to SemEval2024, specifically in the context of Task 2: Safe Biomedical Natural Language Inference for Clinical Trials - joaoassi…

The other piece of the puzzle comes from HuggingFace, "The platform where the machine learning community collaborates", and where the open-source LLM that he used can be downloaded for training and deployment.

meta-llama/Meta-Llama-Guard-2-8B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

GregoryAi joins in

If you follow me for a while, you knew this was coming. João’s work is the missing piece for GregoryAi for three reasons.

Clinical Trials are one of the things where we still aren’t adding much value.
We simply consolidate data from the three main registries and send out email alerts whenever there is a new trial.

João’s approach to training an LLM doesn't compromise independence.
Most of the AI apps and software I find are relying on OpenAI and other third-party services to run their AI tasks. This is a liability for a project that wants to be technically sustainable and where it is so important to control operational costs.

Both works are Open-Source software.
Packaging the LLM training into GregoryAi will let everyone be able to click a button to gather the training data from real-world sources, and another to begin the training. There will still be some requirements in infrastructure, but that’s out of our control.

I don‘t know how and if I will be able to put these two things together, but not trying would be a disrespect to the work.