# of evaluations
# of annotators
Metrics
Metric A
Metric B
Metric C
Models
Model A
Model B
Duration
Click to sort rows by Task header in ascending order | Click to sort rows by Targets header in ascending order | Click to sort rows by Expert Judges prediction header in ascending order | Click to sort rows by GPT-4 Judge prediction header in ascending order |
---|
Cohen's kappa | Intepretation |
---|---|
0 | No agreement |
0.10-0.20 | Slight agreement |
0.21-0.40 | Fair agreement |
0.41-0.60 | Moderate agreement |
0.61-0.80 | Substantial agreement |
0.81-0.99 | Near perfect agreement |
1 | Perfect agreement |
Click to sort rows by Model header in ascending order | Click to sort rows by Winner header in ascending order | Click to sort rows by Shown first header in ascending order | Click to sort rows by Shown second header in ascending order | Click to sort rows by [object Object] header in ascending order |
---|---|---|---|---|
GPT-4 Judge | 2.79 (1) | 3.36 (1) | 3.56 (1) | 3 (1) |
Expert Judges | 2.49 ± 0.14 (2) | 2.98 ± 0.19 (2) | 2.92 ± 0.19 (2) | 6 (2) |
Click to sort rows by Model header in ascending order | Click to sort rows by Longer response header in ascending order | Click to sort rows by [object Object] header in ascending order |
---|---|---|
Expert Judges | 3.84 (1) | 1 (1) |
GPT-4 Judge | 3.84 (1) | 1 (2) |
Majority of annotators selected different values for a given metric.
Majority of annotators selected a same value for a given metric.
Majority of annotators selected a same value for a given metric and the most common value and the 2nd most common value were less that 2 units apart.
All annotators selected a same value for a given metric.