Evaluation
1. Validation Phase
result/
├── endo_1-shot_submission.csv
├── endo_5-shot_submission.csv
├── endo_10-shot_submission.csv
├── colon_1-shot_submission.csv
├── colon_5-shot_submission.csv
├── colon_10-shot_submission.csv
├── chest_1-shot_submission.csv
├── chest_5-shot_submission.csv
├── chest_10-shot_submission.csv
For each .csv file, it includes prediction of validation set using models trained by corresponding few-shot learning samples.
Take endo_N-shot_submission.csv for example:
13333_2021.12_0005_56232354.png,0.66523576,0.2536478,0.71171623,0.0001186
13333_2021.12_0005_56232349.png,0.95211565,0.19268665,0.18163556,5.008e-05
13333_2021.12_0005_56231540.png,0.24712594,0.00869319,0.6545939,0.00035897
13333_2021.12_0005_56232343.png,0.11398823,0.08247166,0.25634992,0.00771732
...
Each line has the filename of validation set image "xxx.png" and prediction value of four classification labels: Ulcer, Erosion, Polyp and Tumor.
Take colon_N-shot_submission.csv for example:
2019-08895-1-1-1_2019-05-29 00_16_03-lv1-18024-33741-5509-2953p0013.png,0.9971521,0.00284789
2019-09674-1-1-1_2019-05-28 20_06_49-lv1-53513-29465-4870-3788p0024.png,0.9965821,0.00341796
1903326001_2019-06-11 12_07_06-lv1-18648-16700-3610-3417p0009.png,0.9675946,0.0324054
2019-11827-1-1-1_2019-05-28 14_14_04-lv1-38890-14353-5119-7083p0018.png,0.5542637,0.4457363
...
Each
line has the filename of validation set image "xxx.png" and prediction
value of two classification labels: non-lesion and lesion classes. Take chest_N-shot_submission.csv for example:
...
Each
line has the filename of validation set image "xxx.png" and prediction
value of 19 classification labels.2. Testing Phase
Evaluation metric
To
evaluate the performance of experimental results, we compute the
overall accuracy (Acc) and area under the receiver operating
characteristic curve (AUC) for the multi-class classification tasks in
the datasets of ColonPath, and the mean average precision (mAP) and AUC
for the multi-label classification tasks in the datasets of ChestDR and
Endo. The averaged result of all Acc, mAP and AUC would be final evaluation metric in ranking (see below).
Accuracy reflects the overall correct predictions among all the test images. The predicted label is determined with the maximum softmax outputs in the multi-class classification task. AUC is computed for each class to measure the capability of distinguishing between positive and negative classes at various threshold settings. The AP is the weighted average of precisions, while the mAP for all samples is the mean value of the AP scores for each class.
In validation phase, the results of mAP/Acc and AUC of each 1/5/10-Shot Endo/Colon/Chest would be summed and averaged. For example, in certain submission, the total results are:
{ "task": { "endo_1-shot": { "AUC_metric": "71.38597732085242", "mAP_metric": "24.799103463247953" }, "endo_5-shot": { "AUC_metric": "65.99679511172724", "mAP_metric": "20.54613367576475" }, "chest_1-shot": { "AUC_metric": "58.20866053495631", "mAP_metric": "13.45018286252051" }, "chest_5-shot": { "AUC_metric": "63.89120494491268", "mAP_metric": "15.194459029060967" }, "colon_1-shot": { "AUC_metric": "82.91080234149541", "Acc_metric": "74.28243398392652" }, "colon_5-shot": { "AUC_metric": "97.61714910229762", "Acc_metric": "92.05510907003445" }, "endo_10-shot": { "AUC_metric": "73.84727863236574", "mAP_metric": "25.70903935120376" }, "chest_10-shot": { "AUC_metric": "67.89146588105439", "mAP_metric": "17.599631470006315" }, "colon_10-shot": { "AUC_metric": "97.05886129153455", "Acc_metric": "91.61882893226176" } }, "aggregates": "58.55906205551241" }
Thus, the result in leaderboards is
(71.38597732085242 + 24.799103463247953 + 65.99679511172724 + 20.54613367576475 + 58.20866053495631 + 13.45018286252051 + 63.89120494491268 + 15.194459029060967 + 82.91080234149541 + 74.28243398392652 + 97.61714910229762 + 92.05510907003445 + 73.84727863236574 + 25.70903935120376 + 67.89146588105439 + 17.599631470006315 + 97.05886129153455 + 91.61882893226176)/18= 58.55906205551241
In testing phase, 1-Shot, 5-Shot and 10-Shot experiments of each dataset (ColonPath, ChestDR and Endo) would be executed 5 separate times.