🏆 Medical Classification Leaderboard - Beta

Goal:

The goal of Classification Medical NLP Leaderboard is to track, rank and evaluate the performance of large language models (LLMs) on medical classification tasks. It evaluates LLMs across a diverse array of medical datasets, starting with radiological reports dataset as follows:

S.No Dataset Name About the Dataset Type of Classification Link to the Dataset
1 Chexpert Plus CXR reports classified into 14 different pathologies Multi-label Multi-class classification Chexpert Plus
2 CT-RATE CT scan radiological reports classified into 18 pathologies Multi-label Multiclass classification CT-RATE

The leaderboard offers a comprehensive assessment of each model's classification aspects.

Evaluation Criteria:

The primary metric used for evaluation is accuracy, which measures the proportion of correct predictions made by the model. We used two levels of accruacy
1. Label-level accuracy : Accuracy is measured in terms of total labels.
2. Record-level accuracy : Accuracy is measured if a report is classified accurately across all labels.

Different Parameters:

The leaderboard displays the different type of settings explored to get various results
1. Different shots prompting : 0 shot, 1 shot, 5 shots.
2. Different roles : Some models like llama allows to assign different content to different roles like system or user in this case.
3. Chain of thought prompting: A kind of prompting technique where, a complex task is broken down into simple steps. It is proven to be better than simple prompting. Refer COT prompting
4. Active Prompt: In progress, it will be used along with different shots prompting

Submit your model or dataset:

  1. If you want to check the performance of your model, you can submit your model to the leaderboard by clicking the "Submit Here" below.
  2. If you want to suggest or collaborate with a new dataset, please fill out this google form.

Note:

  1. The leaderboard is currently in beta. We are working on adding more datasets and improving the evaluation process. Your feedback is welcome!
  2. The primary role of this leaderboard is to assess and compare the performance of the models. It does not facilitate the distribution, deployment, or clinical use of these models. The models on this leaderboard are not approved for clinical use and are intended for research purposes only.
📊 Select Columns to Display - Dataset
Select Shot Type
Filter by Parameter Count

🏅 Leaderboard

🏅 Leaderboard
Deepseek llama 3.1
14B
0.8744
0.2386
0.9534
0.4965
0.9376
0.2377
null
null
0.7857
0.0749
0.8365
0.0963
0.8586
0.1622

To do

  • Launching the leaderboard
  • Add link to the huggingface models
  • Enhance the table layouts
  • Add more datasets
  • Add more prompting techniques