Can LLMs be a tool for Science Writers?

ChatGPT can't turn science papers into plain English—at least not well enough for the Science Magazine team. After testing it for a year, science writers found it missed nuance, overhyped results, and defaulted to jargon. Where does that leave projects trying to democratise medical research?

Can LLMs be a tool for Science Writers?

Abigail Eisenstadt and her colleagues published an interesting article titled "Can ChatGPT Help Science Writers?"

This topic is important to me because I have been trying to achieve a good translation of technical abstracts into plain English summaries for the Gregory-MS project.

Gregory-MS runs on an open-source project called GregoryAi. It already does a decent job at filtering out the papers with hope for patients, but fails miserably at producing a decent summary. We'll come back to it later; first, the interesting findings from Abigail's white paper.

In short, the SciPak team wanted to know if ChatGPT could take a paper and produce a summary based on their framework. Their framework for writing is a variation of the inverted pyramid style, to which they added their "5 bits outline".

The 5 bits outline is:
1. What is the very specific question these papers are trying to answer?

2. What methods did these papers use to explore the question? (If many methods, pls give a brief recap of those most important to getting the conclusions—not seeking insight on “validation” steps.)

3. What’s been the hold-up in the field?

4. How does this work move the field forward?

5. What lede do you imagine The New York Times would write for these papers?

The white paper also shares the 3 prompts used, and the last one seemed to be the most promising:

3. News-style: Prepare a news-style summary according to the below instructions:
a. The summary must consist of a single paragraph between 200 and 250 words in length and a clear and compelling title. It must open with an AP-style breaking news opening sentence that quickly and accurately conveys the main finding.

b. The summary should convey how the finding advances what’s been done in this field, even if others in the field have done very similar work and briefly describe the methods.

c. The summary must explain the implications of the work, communicating whether they are near or distant.

d. The summary must be free of jargon and include definitions for acronyms. e. It is optional but advised that the summary includes some additional details, for example a sentence regarding a unique effort the authors made in their methods or analyses, if truly illustrative of the study.

The team spent a year trying out the results of these prompts, and ChatGPT didn't shine.

The tone of writing was too hyped; it missed important highlights from multifaceted studies; It fell back to using jargon for complex studies. The effort required to fact-check and improve the output was as high as writing from scratch, and possibly even higher.

The best conclusion from their test is that ChatGPT can be a good teaching tool. Scientists can use it to see examples of non-technical language and improve their communication.

What I learned from developing GregoryAi

One of the project's requirements is that it should be as contained as possible. This means that we rarely use APIs, only for services like CrossRef and ORCID. We also use open APIs to fetch papers and clinical trials. All of these services are free so that everyone can install and run their own GregoryAi for Multiple Sclerosis, Alzheimer's, Parkinson's, or any other field.

We are using a module that tries to summarise the abstract into a paragraph that we use to post on Blue Sky and Mastodon. It's called philschmid/bart-large-cnn-samsum and you can find more details about it on Hugging Face. It fulfils the requirement of keeping API dependencies low, but the result is not great.

So, of course, I have played around with ChatGPT and other LLMs to see if I could get better results. None of them has given me confidence, mostly because I lack the neurology background to evaluate the output's quality. And so do the LLMs. I also don't consider them an option to include on GregoryAi because it would mean new users would need a paid account to access ChatGPT through the API. It would be a barrier to adoption.

Improving the output for the SciPak team

There are things that I haven't tried that could be a solution to improve the output. First of which would be to set up an agent workflow. The first agent could extract the 6 WH questions, the second could take that output together with the paper to identify any sector-specific topics and query a knowledge base to get a better context.

These two agents would be improving a prompt for two others, one that would draft the summary using the inverted pyramid and another that would rewrite it using the 5 bits.

This is a multi-step approach, and the white paper mentions that ChatGPT wasn't given the option to rewrite the output. My approach doesn't guarantee quality, and each agent would need their own prompt tuned for the best results.

Not all LLMs are equal

I am not sure ChatGPT would be the best LLM for this job. But I haven't been making benchmarks of LLMs. Simon Willison and Ethan Mollick are better suited to answer that question. If I had to go with my gut, I would bet on Claude.

And then there is the fine-tuning step. The SciPak team used the generic version of ChatGPT, and given the amount of data they have, the original papers and the human summaries, wouldn't it be valuable to try to train an open-source LLM for this specific task?

As a side note and reminder, João Nabais proved it is feasible to train an open-source LLM to help with tasks related to clinical trials without breaking the bank.

Context matters

I wonder if any innovation published in a science paper can be identified by an LLM. Not because of the corpus we are analysing, but because the LLMs training has a cut-off date, and sometimes it's hard to pass on fresh training or up-to-date context.

My workaround for this comes from the 2,000 or so notes that I keep on Obsidian. I can group notes to attach context to my prompt. And since these notes are in Markdown, they are easier to process by the AI. That's why, in an agentic workflow, I would dedicate one step to figuring out what context may be missing from the article and try to fetch it from a curated knowledge base. (Internal wikis could make a comeback thanks to the new LLM workflows.)

Let's not give up

ChatGPT isn't ready to match a high-quality standard like the one guiding the SciPak team, and they have identified the main issues to solve. The over-hyped discourse, the difficulty in understanding multi-faceted topics, and the inability to understand new concepts it wasn't trained on.

I think it's worth the effort to try an agentic workflow, or to see if other LLMs perform better than ChatGPT.

Finding a way to generate plain English summaries from science papers without depending on APIs would be a major win for the GregoryAi project. And it would be valuable for everyone, because we are trying to get a new website off the ground, this time dedicated to all areas of brain regeneration.

For patients like me, an accessible Brain Regeneration Knowledge Hub would be crucial to keep track of the latest research and take that burden from our healthcare teams. Can you imagine the empowerment given to a patient who is finally able to understand technical abstracts? This is a very different task from the one performed by the SciPak team.

For science writers, I would ask a different question. Is this the AI workflow model we want? I would prefer to see use cases of AI-assisted refinement or retrieval of related information that could help increase the quality, and not the output, of science writers.

Side Note: A piece of this blog post was written by AI. Can you find it?