Dealing with imperfections in validation data - Evaluation for Automated Evidence Synthesis

Introduction¶

The standard way of assessing the accuracy of AI outputs is to compare them with “ground truth” data that is produced by humans. However, we know that humans make mistakes when they are producing this data. As well as clear mistakes, there are also areas of judgement and interpretation where disagreements are not necessarily wrong, but simply other valid and justifiable representations of the data.

If we want to evaluate well, this raises a few issues and questions.