🏆 Medical Classification Leaderboard - Beta

Goal:

The goal of Classification Medical NLP Leaderboard is to track, rank and evaluate the performance of large language models (LLMs) on medical classification tasks. It evaluates LLMs across a diverse array of medical datasets, starting with radiological reports dataset as follows:

S.No	Dataset Name	About the Dataset	Type of Classification	Link to the Dataset
1	Chexpert Plus	CXR reports classified into 14 different pathologies	Multi-label Multi-class classification	Chexpert Plus
2	CT-RATE	CT scan radiological reports classified into 18 pathologies	Multi-label Multiclass classification	CT-RATE

The leaderboard offers a comprehensive assessment of each model's classification aspects.

Evaluation Criteria:

The primary metric used for evaluation is accuracy, which measures the proportion of correct predictions made by the model. We used two levels of accruacy
1. Label-level accuracy : Accuracy is measured in terms of total labels.
2. Record-level accuracy : Accuracy is measured if a report is classified accurately across all labels.

Different Parameters:

The leaderboard displays the different type of settings explored to get various results
1. Different shots prompting : 0 shot, 1 shot, 5 shots.
2. Different roles : Some models like llama allows to assign different content to different roles like system or user in this case.
3. Chain of thought prompting: A kind of prompting technique where, a complex task is broken down into simple steps. It is proven to be better than simple prompting. Refer COT prompting
4. Active Prompt: In progress, it will be used along with different shots prompting

Submit your model or dataset:

If you want to check the performance of your model, you can submit your model to the leaderboard by clicking the "Submit Here" below.
If you want to suggest or collaborate with a new dataset, please fill out this google form.

Note:

The leaderboard is currently in beta. We are working on adding more datasets and improving the evaluation process. Your feedback is welcome!
The primary role of this leaderboard is to assess and compare the performance of the models. It does not facilitate the distribution, deployment, or clinical use of these models. The models on this leaderboard are not approved for clinical use and are intended for research purposes only.

🏅 Leaderboard

Deepseek llama 3.1	14B	0.8744	0.2386	0.9534	0.4965	0.9376	0.2377	null	null	0.7857	0.0749	0.8365	0.0963	0.8586	0.1622


LLama 3.1	8B	0.8744	0.2386	0.9534	0.4965	0.9376	0.363	null	null	0.7857	0.0749	0.8365	0.0963	0.8586	0.1622
Deepseek llama 3.1	8B	0.7841	0.1817	0.9391	0.4722	0.8983	0.2377	null	null	0.5286	0.04	0.7684	0.0699	0.7861	0.0889
Qwen 2.5	7B	0.7514	0.2376	0.9486	0.4797	0.9523	0.4774	null	null	0.2125	0	0.8085	0.1195	0.8349	0.1113
Qwen 2.5	14B	0.818	0.2838	0.9609	0.5459	0.9573	0.5163	null	null	0.4354	0.0228	0.8607	0.1677	0.8757	0.1661
Qwen 2.5	32B	0.828	0.2943	0.9618	0.55	0.9569	0.5145	null	null	0.5125	0	0.8153	0.1309	0.8933	0.2759
Gemma 2	9B	0.8613	0.2536	0.9429	0.4712	0.9457	0.4369	null	null	0.7042	0.0726	0.8498	0.1121	0.8637	0.175
Gemma 2	27B	0.8254	0.2526	0.9506	0.4828	0.9583	0.5698	null	null	0.5916	0.0161	0.7673	0.0417	0.8592	0.1524
Med42 Llama 3.1	8B	0.8658	0.2756	0.9513	0.546	0.9353	0.432	null	null	0.7338	0.0752	0.8619	0.166	0.8468	0.1589

To do

Launching the leaderboard
Add link to the huggingface models
Enhance the table layouts
Add more datasets
Add more prompting techniques