Flattery and pitfalls of AI in science

Illustratie van AI algoritme

What are the possibilities and limitations of AI in scientific research? Data scientist Andres Algaba (32) closely follows the rapid developments in AI, contributes to the university’s AI policy, and focuses on the reliability and transparency of large language models. “What was new last year is now considered normal.”

There’s hardly a student or researcher today who hasn’t asked ChatGPT a question. Some scientists go a step further and outsource aspects of their research, such as literature reviews, to save time. Others dream of a future where AI can conduct their entire research process—from hypothesis to publication. Is that possible? And what should we be cautious about before we get there? As an FWO postdoctoral researcher, Andres Algaba focuses on the transparency of large language models. He is acutely aware of AI’s impact on scientific practice.

What is the great promise of AI for scientific research?
Andres Algaba: “The promise is, of course, to automate scientific research—at least partially. Unlike humans, computers don’t need to eat or sleep, they can run 24/7 and operate multiple systems simultaneously. Automation through AI could lead to a massive acceleration of scientific research.

We’ve been dreaming of that acceleration for a while, but recent developments in large language models (LLMs) offer new possibilities. Algorithms can now suggest innovative hypotheses within a scientific field—hypotheses that are comparable to those proposed by real scientists.

With the advent of AI agents, full automation is no longer far off. One system generates a hypothesis, another provides feedback, and yet another conducts the research. Two years ago, you’d have called me crazy for predicting this, but things are moving incredibly fast. What was new last year is now standard.”

"The model has no bad intentions, it's simply been trained that way"

Suppose we automate research—what’s the major risk?
“Large language models are trained with human feedback. Before a model works well, it must learn our values, norms, and expectations of an assistant. This training involves presenting the model with questions or tasks and ranking its responses from least to most helpful.

Now, imagine that the human trainer rewards hypotheses and results that are most publishable. The model will then do everything it can to achieve that goal. It might manipulate data and select only the most significant results. It’s not that the model has bad intentions—it’s simply trained that way.”

Most research isn’t automated yet. Are there major pitfalls when researchers test hypotheses using AI?
“People like to be right, and that’s something the models have learned from human feedback. We tend to rate answers that begin positively (‘This is a good hypothesis, here are a few improvements’) higher than those that start critically (‘This is a poor hypothesis, here’s how to improve it’), even if the latter is more accurate.

Large language models have learned that a little dishonesty is acceptable if it pleases the user. This sycophancy—flattery from a language model—is a major issue for science, which is fundamentally about truth. As a scientist, you must learn not to reveal your preferred answer in your question, or you’ll reinforce confirmation bias.

That sycophancy (flattery by a language model, editor's note) is a major problem for science, because science is fundamentally about truth. As a scientist, you need to learn how to frame your questions in a way that doesn’t reveal which answer you’re hoping for — otherwise, you’ll only reinforce confirmation bias.”

You helped develop algorithms at VUB that interrogate other algorithms. Can these expose biases?
“To some extent, yes. We developed an algorithm that examined the citation behaviour of large language models. What we found was that when you ask an LLM for a literature review, it tends to cite papers with shorter titles and fewer authors. So there are systematic biases in citation behaviour, which is problematic because those criteria don’t necessarily lead to the most relevant papers.”

"Just because an an algorithm makes the selection, does not mean you don't discriminate anymore"

What’s the solution? A ‘fairer’ tool?
“Technically, it’s easy to fix. But there’s no clear answer to the question: ‘What is desirable citation behaviour in science?’ I don’t think the solution lies in another algorithm. It’s about awareness. Students and researchers need to understand that if they use AI to compile a literature list, they’ll likely get papers with shorter titles and fewer authors. The message is: be cautious. Know that if AI creates your literature review, you might miss important papers.”

What’s the solution? A ‘fairer’ tool?
“Technically, it’s easy to fix. But there’s no clear answer to the question: ‘What is desirable citation behaviour in science?’ I don’t think the solution lies in another algorithm. It’s about awareness. Students and researchers need to understand that if they use AI to compile a literature list, they’ll likely get papers with shorter titles and fewer authors. The message is: be cautious. Know that if AI creates your literature review, you might miss important papers.”

AI is here to stay in research. What skills will students and researchers need in the future?
“Some processes will be partially automated. That will change the nature of our jobs. We need to think about this—not just as a university, but as a society.

It’s important to learn how to collaborate effectively with these models. That’s not straightforward, and I think it partly explains why adoption rates aren’t very high yet. Sometimes it seems easier to let the model do things on its own than to collaborate with it as a human.

For example, there was an experiment where three groups received the same patient files: one group of doctors with access to GPT-4, one without, and GPT-4 itself. All three had to make diagnoses based on the files.

What happened? GPT-4 alone performed better than the doctors without GPT-4. But surprisingly, the doctors with GPT-4 performed slightly worse than their colleagues without it.

Why? Those doctors made their own diagnosis first and then checked it against GPT-4. That opened the door to sycophancy—the system reinforced incorrect assumptions. To collaborate better with these models, we’ll need to learn how to avoid sycophancy.”

"A model doesn't write or interview like a professional would"

If science becomes automated, what role remains for the scientist?
“We might need more scientists to keep up with the pace at which LLMs produce scientific output. AI generates far more results that we, as scientists, must assess and process.

Scientists will also need to decide which questions we want to answer. You can’t physically conduct all experiments at once. Some argue that smarter models should decide which research to pursue. But is that what we want?”

And science communication—will that become AI’s job?
“You can ask AI to communicate your research to policymakers or specific audiences. But if you give a generic prompt, you’ll get a generic response. A model doesn’t write or interview like a professional would.

How could you expect that from a model trained on the entire internet? Hopefully, someone professionally engaged in science communication doesn’t sound like the average of the internet.

But if you know exactly what you want and clearly define what paths you don’t want to take, AI can be very useful. If you just want to get science communication over with quickly, you’ll get the first generic text it produces.”

AI is on a roll

Since this interview was published in August 2025, a great deal has changed in the world of AI. For instance, a robust European LLM has suddenly appeared on the market, developed by the French start-up Mistral. “But the gap with their American competitors is still very wide,” Algaba qualifies. “To begin with, they have a much smaller budget, which means they cannot train their model as intensively. Furthermore, the ecosystem surrounding this small company is not of the same calibre as the one in which firms such as Google, OpenAI and Anthropic operate. In terms of infrastructure and energy supplies, the latter have a significant lead.”

According to Algaba, we are already seeing much better results from agentic AI, although it isn’t always called that. “If you work in ChatGPT’s Thinking mode and give the tool access to certain parts of your computer rather than just the web (think, for example, of certain apps and some of your own data), you can have the tool perform a sequence of steps. For example, I saw that someone had asked his AI tool to convert a complex paper into a PowerPoint presentation in order to understand it better. I’ve done that myself as well recently.”

The results produced by AI tools are also improving all the time; according to Algaba, there are fewer hallucinations and even the sycophancy is decreasing – at least if you use the right tools. He himself has recently been working more and more with Claude Code from Anthropic and Codex from OpenAI. Another noticeable difference compared to last summer is that the systems are generating results much faster than ever before. According to him, this is currently the challenge of working with AI in a responsible manner: “That data still needs to be validated by humans, but that is becoming increasingly difficult due to the speed at which it comes at us. At one point, choices have to be made: do you thoroughly check whether the result has been produced in the correct way, or do you have sufficient confidence in the system and limit yourself to the basics?” 

Andres Algaba  is an FWO postdoctoral researcher at the Data Analytics Lab of VUB. His main research interests include automated science and innovation with large language models, reliability and transparency of LLMs, and the science of science. He is also a member of the Young Academy of Belgium.

Andres Algaba