The following constitutes an incomplete list of research questions for methods research that would help to enable the responsible use of AI in evidence synthesis.
General¶
Dealing with unreliable human annotations¶
See Dealing with imperfections in validation data
Can we correct disagreements between human-annotated and AI-annotated data without biasing our evaluations of the AI? Can we not correct disagreements without biasing our evaluations? How should we approach this?
Under what conditions do more human annotators improve the fidelity of gold-standard data? Does accuracy always approach 1 with more annotators?
Can we infer the number of annotators we would need to achieve a given level of fidelity to ground truth data from measures of inter-rater reliability?
Does a ground-truth exist? How do we deal with this epistemologically? What would superhuman performance mean?
Managing uncertainty¶
Are binomial confidence intervals for recall and precisions good baselines for estimating confidence intervals around performance metrics?
Can confidence intervals be narrowed?
By using Bayesian statistics and priors based on performance on similar tasks?
By estimating jointly across predicted categories / tasks?
Automation and the results of systematic reviews¶
How do errors and uncertainties around errors compound across tasks, when automation is used for multiple stages of a systematic review?
How do errors affect the results of systematic reviews?
How can we incorporate uncertainty around the accuracy of specific tasks (screening, data extraction, criticial appraisal) into the overall uncertainty in our results?
Optimising the distribution of labour between humans and machines¶
How can we quantify the costs and benefits of conducting evidence synthesis tasks by hand and using different AI approaches?
How can we manage trade-offs and allocate resources efficiently?
Screening¶
What stopping criteria are reliable Repke, 2026
How do stopping criteria work in a living review context?
Is prioritised screening with stopping criteria the best paradigm for screening with LLMs? Or are other methods appropriate? What would replace stopping criteria in giving confidence scores?
Under what circumstances does LLM-based screening outperform supervised learning + prioritised screening
Data extraction¶
How do we set up processes to develop and evaluate approaches to automate data extraction without overfitting through prompt development?
- Repke, T., Tinsdeall, F., Danilenko, D., Graziosi, S., Müller‐Hansen, F., Schmidt, L., Thomas, J., & van Valkenhoef, G. (2026). Don’t Stop Me Now, `Cause I’m Having a Good Time Screening: Evaluation of Stopping Methods for Safe Use of Priority Screening in Systematic Reviews. Cochrane Evidence Synthesis and Methods, 4(1). 10.1002/cesm.70068