Evaluation And Metrics - Foundation Model Prompting for Medical Image Classification

Foundation Model Prompting for Medical Image Classification Banner

Evaluation¶

1. Validation Phase¶

Noted: (1) The order of filanames of all CSV files must follow the order of provided colon_val.csv, chest_val.csv and endo_val.csv! (2) The name of CSV files in result.zip must be the same names "xxx_N-shot_submission.csv" below.(3) All participants can submit their predictions 1 time per day during validation phase.

In validation phase, all participants only need to submit result.zip, which contains prediction results of three dataset validation sets on 1-shot, 5-shot and 10-shot like below:

result/
├── endo_1-shot_submission.csv
├── endo_5-shot_submission.csv
├── endo_10-shot_submission.csv
├── colon_1-shot_submission.csv
├── colon_5-shot_submission.csv
├── colon_10-shot_submission.csv
├── chest_1-shot_submission.csv
├── chest_5-shot_submission.csv
├── chest_10-shot_submission.csv

For each .csv file, it includes prediction of validation set using models trained by corresponding few-shot learning samples.

Take endo_N-shot_submission.csv for example:

  13333_2021.12_0005_56232354.png,0.66523576,0.2536478,0.71171623,0.0001186
  13333_2021.12_0005_56232349.png,0.95211565,0.19268665,0.18163556,5.008e-05
  13333_2021.12_0005_56231540.png,0.24712594,0.00869319,0.6545939,0.00035897
  13333_2021.12_0005_56232343.png,0.11398823,0.08247166,0.25634992,0.00771732

  ...

Each line has the filename of validation set image "xxx.png" and prediction value of four classification labels: Ulcer, Erosion, Polyp and Tumor.

Take colon_N-shot_submission.csv for example:

  2019-08895-1-1-1_2019-05-29 00_16_03-lv1-18024-33741-5509-2953p0013.png,0.9971521,0.00284789
  2019-09674-1-1-1_2019-05-28 20_06_49-lv1-53513-29465-4870-3788p0024.png,0.9965821,0.00341796
  1903326001_2019-06-11 12_07_06-lv1-18648-16700-3610-3417p0009.png,0.9675946,0.0324054
  2019-11827-1-1-1_2019-05-28 14_14_04-lv1-38890-14353-5119-7083p0018.png,0.5542637,0.4457363

  ...

Each line has the filename of validation set image "xxx.png" and prediction value of two classification labels: non-lesion and lesion classes.

Take chest_N-shot_submission.csv for example:

5CB12B9D24ECD3C.png,0.09374527,0.6396048,0.2512434,0.07615143,0.19636974,0.07490193,0.10598115,0.0217598,0.03501937,0.4071378,0.1378933,0.10016607,0.06029703,0.11276143,0.10651199,0.00124831,0.02003378,0.07383478,0.03063678 5DF6CF624624B52.png,0.53989255,0.2971295,0.25088623,0.14487624,0.09833297,0.02330205,0.12636295,0.06603294,0.15085685,0.12354513,0.13233066,0.10814447,0.3340448,0.05048932,0.0677163,0.00836432,0.02691899,0.01357477,0.01547637 5D02ECF513CCD61.png,0.8987339,0.1854706,0.55019003,0.30659977,0.5688816,0.10258165,0.1074397,0.12406027,0.33897737,0.1348909,0.02611324,0.23083255,0.01009721,0.3117113,0.00199115,0.03368886,0.01741716,0.1703031,0.0032014 5C7C7ACB1A881A8.png,0.56226075,0.29110807,0.25753835,0.12572725,0.26798493,0.12900156,0.10164467,0.05507286,0.06495975,0.29712105,0.14845078,0.0485659,0.12877719,0.13042593,0.02545512,0.0189645,0.01440432,0.03248649,0.02154987

...

Each line has the filename of validation set image "xxx.png" and prediction value of 19 classification labels.

2. Testing Phase¶

The final evaluation will be performed on a reserved private dataset, i.e. a small number of samples (1-Shot, 5-Shot and 10-Shot) from public part are randomly selected and the reserved part are used for testing. The final evaluation will be run 5 separate times of each 1-Shot, 5-Shot and 10-Shot of each dataset(ColonPath, ChestDR and Endo) under the same prompt/test conditions, and the results will be averaged as the final evaluation metric. This challenge aims to demonstrate the important role of the underlying model for medical downstream tasks, i.e., reducing the dependence on high-quality annotations and improving the classification accuracy of tail categories.

Evaluation metric¶

To evaluate the performance of experimental results, we compute the overall accuracy (Acc) and area under the receiver operating characteristic curve (AUC) for the multi-class classification tasks in the datasets of ColonPath, and the mean average precision (mAP) and AUC for the multi-label classification tasks in the datasets of ChestDR and Endo. The averaged result of all Acc, mAP and AUC would be final evaluation metric in ranking (see below).

Accuracy reflects the overall correct predictions among all the test images. The predicted label is determined with the maximum softmax outputs in the multi-class classification task. AUC is computed for each class to measure the capability of distinguishing between positive and negative classes at various threshold settings. The AP is the weighted average of precisions, while the mAP for all samples is the mean value of the AP scores for each class.

In validation phase, the results of mAP/Acc and AUC of each 1/5/10-Shot Endo/Colon/Chest would be summed and averaged. For example, in certain submission, the total results are:

  {
    "task": {
      "endo_1-shot": {
        "AUC_metric": "71.38597732085242",
        "mAP_metric": "24.799103463247953"
      },
      "endo_5-shot": {
        "AUC_metric": "65.99679511172724",
        "mAP_metric": "20.54613367576475"
      },
      "chest_1-shot": {
        "AUC_metric": "58.20866053495631",
        "mAP_metric": "13.45018286252051"
      },
      "chest_5-shot": {
        "AUC_metric": "63.89120494491268",
        "mAP_metric": "15.194459029060967"
      },
      "colon_1-shot": {
        "AUC_metric": "82.91080234149541",
        "Acc_metric": "74.28243398392652"
      },
      "colon_5-shot": {
        "AUC_metric": "97.61714910229762",
        "Acc_metric": "92.05510907003445"
      },
      "endo_10-shot": {
        "AUC_metric": "73.84727863236574",
        "mAP_metric": "25.70903935120376"
      },
      "chest_10-shot": {
        "AUC_metric": "67.89146588105439",
        "mAP_metric": "17.599631470006315"
      },
      "colon_10-shot": {
        "AUC_metric": "97.05886129153455",
        "Acc_metric": "91.61882893226176"
      }
    },
    "aggregates": "58.55906205551241"
  }

Thus, the result in leaderboards is

  (71.38597732085242 + 24.799103463247953 + 65.99679511172724 + 20.54613367576475 + 58.20866053495631 + 13.45018286252051 + 63.89120494491268 + 15.194459029060967 + 82.91080234149541 + 74.28243398392652 + 97.61714910229762 + 92.05510907003445 + 73.84727863236574 + 25.70903935120376 + 67.89146588105439 + 17.599631470006315 + 97.05886129153455 + 91.61882893226176)/18

= 58.55906205551241

In testing phase, 1-Shot, 5-Shot and 10-Shot experiments of each dataset (ColonPath, ChestDR and Endo) would be executed 5 separate times.