Blog <-

Evaluating the performance of System's research synthesis

Mehdi Jamei

10.17.2023

In our continuous improvement of System, each feature undergoes a thorough evaluation process to measure its robustness and comprehensiveness, employing methodologies tailored to assess specific attributes and performances of each function within the platform.

Today, we are pleased to present new results from an in-depth, randomized, and blind analysis of the performance of System's AI-assisted synthesis against other products.

Results

You can see the detailed results of the study here. We will regularly update this page with new findings.

Here are the two main takeaways:

  1. 70% of experts prefer System's synthesis over other options.
  2. System's synthesis stands out for its unparalleled accuracy and thoroughness.

Performance Measurement Methodology

Survey Design

The pipeline that generates System's synthesis of scientific literature is a complex multi-step process. The overall task can be summarized as follows:

Given a biomedical search query, find all relevant research studies and create an overall summary of the findings of those studies.

The specific nature of this task does not allow us to compare it directly with LLM-specific industry benchmarks like PubMedQA, MedQA, and MedMCQA. To measure the accuracy, comprehensiveness, and harmfulness of our synthesis, then, we conducted a study with subject-matter experts (SMEs). In order to maintain objectivity and reduce bias in the assessment process, the study was randomized and blind.

Participants were randomly allocated one of two general tasks:

Task 1: Participants were presented with two syntheses, one generated by System and another by a competitor. They were asked to evaluate and choose between the syntheses, taking into account various dimensions including accuracy, comprehensiveness, clarity, relevance, and helpfulness.

Task 2: Participants were provided with a single random synthesis along with all the associated citations presented in the respective product interfaces. They were tasked with rating it on a scale of 1-10 on each of the following dimensions:

  1. Accuracy: Do the summaries contain factual errors, and do they provide accurate information of the topic?
  2. Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
  3. Relevance: Are the summaries relevant to what you expect to see for the topic?

Statistical Power

Data collection was continued until achieving statistical power of at least 0.8 in the results, utilizing the Two Sample T-Test for each comparison between product pairs across all evaluated dimensions. Detailed information regarding the number of participants and data points collected is available in the table below.

System Synthesis vs
Commercial Product #1

System Synthesis vs
Commercial Product #2

Task 1

25 users / 50 evaluations

8 users / 56 evaluations

Task 2

14 users / 56 evaluations

14 users / 56 evaluations

Evaluating the performance of System's research synthesis

Mehdi Jamei

October 17, 2023

In our continuous improvement of System, each feature undergoes a thorough evaluation process to measure its robustness and comprehensiveness, employing methodologies tailored to assess specific attributes and performances of each function within the platform.

Today, we are pleased to present new results from an in-depth, randomized, and blind analysis of the performance of System's AI-assisted synthesis against other products.

Results

You can see the detailed results of the study here. We will regularly update this page with new findings.

Here are the two main takeaways:

  1. 70% of experts prefer System's synthesis over other options.
  2. System's synthesis stands out for its unparalleled accuracy and thoroughness.

Performance Measurement Methodology

Survey Design

The pipeline that generates System's synthesis of scientific literature is a complex multi-step process. The overall task can be summarized as follows:

Given a biomedical search query, find all relevant research studies and create an overall summary of the findings of those studies.

The specific nature of this task does not allow us to compare it directly with LLM-specific industry benchmarks like PubMedQA, MedQA, and MedMCQA. To measure the accuracy, comprehensiveness, and harmfulness of our synthesis, then, we conducted a study with subject-matter experts (SMEs). In order to maintain objectivity and reduce bias in the assessment process, the study was randomized and blind.

Participants were randomly allocated one of two general tasks:

Task 1: Participants were presented with two syntheses, one generated by System and another by a competitor. They were asked to evaluate and choose between the syntheses, taking into account various dimensions including accuracy, comprehensiveness, clarity, relevance, and helpfulness.

Task 2: Participants were provided with a single random synthesis along with all the associated citations presented in the respective product interfaces. They were tasked with rating it on a scale of 1-10 on each of the following dimensions:

  1. Accuracy: Do the summaries contain factual errors, and do they provide accurate information of the topic?
  2. Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
  3. Relevance: Are the summaries relevant to what you expect to see for the topic?

Statistical Power

Data collection was continued until achieving statistical power of at least 0.8 in the results, utilizing the Two Sample T-Test for each comparison between product pairs across all evaluated dimensions. Detailed information regarding the number of participants and data points collected is available in the table below.

System Synthesis vs
Commercial Product #1

System Synthesis vs
Commercial Product #2

Task 1

25 users / 50 evaluations

8 users / 56 evaluations

Task 2

14 users / 56 evaluations

14 users / 56 evaluations

Filed Under:

Tech

Tech