Summary

Current research in medical image segmentation heavily focuses on implementing new architectures, and to prove the feasibility of new methods sub-optimal scientific practices are employed. The authors argue that with proper parameter selection and model scaling to new hardware, UNet still remains a gold standard in the field.

Validation pitfalls

1. Coupling “innovation” with performance boosters

  1. Comparison with outdated baselines: U-Mamba compared their novel approach with U-Net without residual connections.
  2. Using different datasets for baseline and proposed methods
    1. UNETR and SwinUNETR (same author) used bigger dataset for their model than for the baseline.

      Untitled

    2. Coupling innovation with self-supervised pre-training, while training baseline from scratch.

  3. Using different computational scale compared to baselines
  4. Comparing ensemble of 20 models to standalone models [https://arxiv.org/abs/2103.10504, https://arxiv.org/abs/2111.14791]

Recommendation: Isolate changes to show meaningful changes in performance.

2. Lack of well-configured baseline

  1. Many authors claim methodological superiority without transparent and efficient parameter selection for the baseline, like nnUNet or other self-configuring methods. [[2401.13560] SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation (arxiv.org), [2103.03024] CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation (arxiv.org)]

Recommendation: Provide adaptation instructions along with methods to enhance the reproducibility and comparability.

3. Lack of efficient benchmarking dataset

  1. Many of the datasets, used as benchmarks in proving the usefulness of a novel architecture, either have high intra-method variance (between different folds on the same method), or low systemic variance (between different methods).

Untitled

Recommendation: use more diverse set of tasks and datasets to prove the claimed advancement.

4. Lack of systematic reporting practices

  1. Usually new algorithms are tested using public leaderboards (such as BraTS challenge, ImageNet, etc.), but those leaderboards usually don’t have specific rules regarding pre-processing, post-processing, dataset sub-splitting, ensembling, etc. [[2107.08623] LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation (arxiv.org), [2201.00462] D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation (arxiv.org)]
  2. Sometimes, the results are only reported for a certain subset of classes from the dataset, without a proper justification [[2102.04306] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (arxiv.org), Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation | SpringerLink]