🏆 Medical Classification Leaderboard - Beta
Goal:
The goal of Classification Medical NLP Leaderboard is to track, rank and evaluate the performance of large language models (LLMs) on medical classification tasks. It evaluates LLMs across a diverse array of medical datasets, starting with radiological reports dataset as follows:
S.No | Dataset Name | About the Dataset | Type of Classification | Link to the Dataset |
---|---|---|---|---|
1 | Chexpert Plus | CXR reports classified into 14 different pathologies | Multi-label Multi-class classification | Chexpert Plus |
2 | CT-RATE | CT scan radiological reports classified into 18 pathologies | Multi-label Multiclass classification | CT-RATE |
The leaderboard offers a comprehensive assessment of each model's classification aspects.
Evaluation Criteria:
The primary metric used for evaluation is accuracy, which measures the proportion of correct predictions made by the model. We used two levels of accruacy
1. Label-level accuracy : Accuracy is measured in terms of total labels.
2. Record-level accuracy : Accuracy is measured if a report is classified accurately across all labels.
Different Parameters:
The leaderboard displays the different type of settings explored to get various results
1. Different shots prompting : 0 shot, 1 shot, 5 shots.
2. Different roles : Some models like llama allows to assign different content to different roles like system or user in this case.
3. Chain of thought prompting: A kind of prompting technique where, a complex task is broken down into simple steps. It is proven to be better than simple prompting. Refer COT prompting
4. Active Prompt: In progress, it will be used along with different shots prompting
Submit your model or dataset:
- If you want to check the performance of your model, you can submit your model to the leaderboard by clicking the "Submit Here" below.
- If you want to suggest or collaborate with a new dataset, please fill out this google form.
Note:
- The leaderboard is currently in beta. We are working on adding more datasets and improving the evaluation process. Your feedback is welcome!
- The primary role of this leaderboard is to assess and compare the performance of the models. It does not facilitate the distribution, deployment, or clinical use of these models. The models on this leaderboard are not approved for clinical use and are intended for research purposes only.
🏅 Leaderboard
Deepseek llama 3.1 | 14B | 0.8744 | 0.2386 | 0.9534 | 0.4965 | 0.9376 | 0.2377 | null | null | 0.7857 | 0.0749 | 0.8365 | 0.0963 | 0.8586 | 0.1622 |
Submit Your Model or Prompt
Collaborate with Us
- If you are interested in collaborating with us, to explore more on the datasets, prompts or models, feel free to reach out to me on the mail address E-mail
To do
- Launching the leaderboard
- Add link to the huggingface models
- Enhance the table layouts
- Add more datasets
- Add more prompting techniques