Here is a summary of the risks we identified with System Pro v1.0 and our work to mitigate them:
Risk 1: The information presented in a synthesis is inaccurate or unreliable.
We have reviewed our pipeline and identified a few areas in which inaccuracies could occur or be allowed to persist into the synthesis.
- Starting with the extraction process, we use large language models (LLMs) to pull key statistical findings from the source. This process, while 84%-90% accurate depending on the model used, can lead to inaccurate information being pulled, including flipped variables or wrong statistic types or p-values. This could ultimately result in the information being incorrectly presented in data tables or syntheses.
Mitigation: To address these inaccuracies, we employ human labelers to continuously evaluate the performance of our extraction models and we manually assess a random sample of relationships from each statistic type before approving the extraction. We also conduct post-processing and validation on all extracted relationships to remove clear errors and inaccuracies, and we have set up a feedback process for users to report errors in data extraction to our team for mitigation.
- Similarly during the extraction process, our models may not pick up every relationship or statistic type presented in the underlying source, which means information may be missing from the syntheses.
Mitigation: We routinely evaluate where and why relationships may not have been picked up from a source to improve our models and extract this information.
- While we extract information at the highest resolution, the interpretation of that information may not always be clear. For example, the variable name “treatment versus control group” may be accurate to the source, but may not have enough context for our synthesis to correctly interpret the findings.
Mitigation: We are implementing steps to get more information from abstracts on unclear variable names.
- Once the statistical relationships are extracted, we use semantic search to find relevant studies and findings to synthesize. This process has known limitations.
Mitigation: To ensure search relevance, we chose a large embedding model and capture the abstract, variable name, and study title in our vectors.
- Once we have returned the relevant information through search, we cluster the statistical relationships together. Issues that arise in clustering include labeling a cluster with a term that is not representative of its entire contents or wrongly clustering variables that are semantically similar, but not necessarily referring to the same thing.
Mitigation: We use benchmarking to evaluate how the end-product of search relevancy, clustering and synthesis, changes in relation to search and clustering parameters.
- The ultimate synthesis is generated on the underlying statistical evidence using LLMs, which may misinterpret or exclude some of the underlying evidence.
Mitigation: We limit the universe of information the LLM can use to the information contained in SystemDB and generate three versions of the synthesis for each question and in post-processing select the one that is most grounded to the evidence in SystemDB, measured by the number of citations and the ratio of citations to sentences. We have also created a user-reported feedback mechanism for synthesis quality and continuously review user-reported issues with synthesis.
Risk 2: The information presented in a synthesis is biased
We have also identified potential areas of bias both in the corpus we extract from and how the underlying research is reported.
- Today, System has extracted statistical findings from PubMed. While PubMed is the leading index for health and biomedical researchers, PubMed may not index every study that could be relevant for a synthesis.
Mitigation: To address any corpus bias, we are actively expanding beyond PubMed to include full-text articles from PMC, as well as additional corpuses that include the social sciences, environmental sciences, economics, and more.
- Additionally, authors may only report the significant findings from an experiment, and not report findings of no association. This could bias the synthesis toward those papers that only show a statistically significant finding.
Mitigation: To address research biases, we are expanding to full-text articles to extract more findings, including findings of no association.
- Lastly, findings are highly contextual and may only apply to specific populations or settings. The synthesis may report findings in a generalized way, when they are specific to certain populations or settings.
Mitigation: To address this bias, we have started to extract and write population data for all studies containing at least one statistical relationship, which will ultimately be incorporated into our synthesis for improved understanding of the context for any statistical finding.