In a Study AI model Open Scholar Synthesizes Scientific Research and Cites Sources as Accurately as Human Experts UW News

Keeping up with the latest research is vital for scientists but given that millions of scientific papers are published every year that can prove difficult. Artificial intelligence systems show promise for quickly synthesizing seas of information, but they still tend to make things up or hallucinate.For instance, when a team led by researchers at the University of Washington and The Allen Institute for AI, or Ai2 studied a recent OpenAI model GPT 4o they found it fabricated 78 90% of its research citations. And generalpurpose AI models like ChatGPT often can’t access papers that were published after their training data was collected.

So the UW and Ai2 research team built OpenScholar, an open source AI model designed specifically to synthesize current scientific research. The team also created the first large multi domain benchmark for evaluating how well models can synthesize and cite scientific research. In tests OpenScholar cited sources as accurately as human experts and 16 scientists preferred its response to those written by subject experts 51% of the time. The team published its findings Feb 4 in Nature. The project’s code data and a demo are publicly available and free to use.

After we started this work we put the demo online and quickly we got a lot of queries far more than we’d expected said senior author Hannaneh Hajishirzi a UW associate professor in the Paul G. Allen School of Computer Science & Engineering and senior director at Ai2. When we started looking through the responses we realized our colleagues and other scientists were actively using OpenScholar. It really speaks to the need for this sort of open source transparent system that can synthesize research.

Researchers trained the model and then created a set of 45 million scientific papers for OpenScholar to pull from to ground its answers in established research. They coupled this with a technique called retrieval augmented generation which lets the model search for new sources incorporate them and cite them after it’s been trained. Early on we experimented with using an AI model with Google’s search data but we found it wasn’t very good on its own said lead author Akari Asai a research scientist at Ai2 who completed this research as a UW doctoral student in the Allen School.

It might cite some research papers that weren’t the most relevant or cite just one paper or pull from a blog post randomly. We realized we needed to ground this in scientific papers. We then made the system flexible so that it could incorporate emerging research through results.To test their system, the team created ScholarQABench a benchmark against which to test systems on scientific search. They gathered 3,000 queries and 250 longform answers written by experts in computer science physics biomedicine and neuroscience.

AI is getting better and better at real world tasks Hajishirzi said. But the big question ultimately is whether we can trust that its answers are correct.The team compared OpenScholar against other state of the art AI models such as OpenAI’s GPT 4o and two models from Meta. ScholarQABench automatically evaluated AI models answers on metrics such as their accuracy writing quality and relevance.

OpenScholar outperformed all the systems it was tested against. The team had 16 scientists review answers from the models and compare them with human-written responses. The scientists preferred OpenScholar answers to human answers 51% of the time but when they combined OpenScholar citation methods and pipelines with GPT 4o a much bigger model the scientists preferred the AI written answers to human answers 70% of the time. They picked answers from GPT 4o on its own only 32% of the time.

Scientists see so many papers coming out every day that it’s impossible to keep up Asai said. But the existing AI systems weren’t designed for scientists specific needs. We’ve already seen a lot of scientists using OpenScholar and because it’s open source others are building on this research and already improving on our results. We’re working on a followup model DR Tulu which builds on OpenScholar’s findings and performs multi step search and information gathering to produce more comprehensive responses.