loading

LLM-as-a-Judge Performance

# of tasks

1814

# of annotators

Metrics

Models

Expert Judges

GPT-4 Judge

Choose models

Expert Judges

GPT-4 Judge

Show targets

turn

Agreement Distribution

Winner

Shown first

Shown second

Annotator Contribution

Winner

Shown first

Shown second

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)

Cohen's kappa	Intepretation
0	No agreement
0.10-0.20	Slight agreement
0.21-0.40	Fair agreement
0.41-0.60	Moderate agreement
0.61-0.80	Substantial agreement
0.81-0.99	Near perfect agreement
1	Perfect agreement

Winner

Shown first

Shown second

Winner

Choose aggregator

Caution: Denominator might vary for categorical metrics.

Shown first

Choose aggregator

Caution: Denominator might vary for categorical metrics.

Shown second

Choose aggregator

Caution: Denominator might vary for categorical metrics.

turn

Human Evaluations (1814/1814)

Click to sort rows by Model header in ascending order	Click to sort rows by Winner header in ascending order	Click to sort rows by Shown first header in ascending order	Click to sort rows by Shown second header in ascending order	Click to sort rows by [object Object] header in ascending order
GPT-4 Judge	2.79 (1)	3.36 (1)	3.56 (1)	3 (1)
Expert Judges	2.49 ± 0.14 (2)	2.98 ± 0.19 (2)	2.92 ± 0.19 (2)	6 (2)

^* (rank) indicates model's comparative position w.r.t to other models for a given metric^* value±std shows averages of aggregate values and standard deviation across all tasks

reflects confidence level on the aggregate values.

■

# of tasks where where minority rating is far from majority rating,

■

# of tasks where where minority rating is similar to majority rating and

■

# of tasks where where all annotators chose same rating

Algorithmic Evaluations (1814/1814)

Click to sort rows by Model header in ascending order	Click to sort rows by Longer response header in ascending order	Click to sort rows by [object Object] header in ascending order
Expert Judges	3.84 (1)	1 (1)
GPT-4 Judge	3.84 (1)	1 (2)

^* (rank) indicates model's comparative position w.r.t to other models for a given metric

# of evaluations

# of annotators

Metrics

Metric A

Metric B

Metric C

Models

Model A

Model B

Duration

Additional Filters

Additional Filters

Agreement Distribution

Winner

Shown first

Shown second

Annotator Contribution

Winner

Shown first

Shown second

Inter Annotator Agreement (Cohen's Kappa How to interprete Cohen's kappa coefficient?Cohen's kappaIntepretation0No agreement0.10-0.20Slight agreement0.21-0.40Fair agreement0.41-0.60Moderate agreement0.61-0.80Substantial agreement0.81-0.99Near perfect agreement1Perfect agreement)

How to interprete Cohen's kappa coefficient?

Winner

Shown first

Shown second

Winner

Shown first

Shown second

Additional Filters

Hide Metrics

Human Evaluations (1814/1814)

Algorithmic Evaluations (1814/1814)

Additional Filters

Human Evaluations

Winner(1814/1814)

Shown first(1814/1814)

Shown second(1814/1814)

Algorithmic Evaluations

Longer response(1814/1814)

Additional Filters

Additional Filters

Additional Filters

Additional Filters

Agreement Distribution

Winner

Shown first

Shown second

Annotator Contribution

Winner

Shown first

Shown second

Inter Annotator Agreement (Cohen's Kappa How to interprete Cohen's kappa coefficient?Cohen's kappaIntepretation0No agreement0.10-0.20Slight agreement0.21-0.40Fair agreement0.41-0.60Moderate agreement0.61-0.80Substantial agreement0.81-0.99Near perfect agreement1Perfect agreement)

How to interprete Cohen's kappa coefficient?

Winner

Shown first

Shown second

Winner

Shown first

Shown second

Additional Filters

Hide Metrics

Human Evaluations (1814/1814)

Algorithmic Evaluations (1814/1814)

Additional Filters

Human Evaluations

Winner(1814/1814)

Shown first(1814/1814)

Shown second(1814/1814)

Algorithmic Evaluations

Longer response(1814/1814)

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)