Summary
Current research in medical image segmentation heavily focuses on implementing new architectures, and to prove the feasibility of new methods sub-optimal scientific practices are employed. The authors argue that with proper parameter selection and model scaling to new hardware, UNet still remains a gold standard in the field.
Validation pitfalls
1. Coupling “innovation” with performance boosters
- Comparison with outdated baselines: U-Mamba compared their novel approach with U-Net without residual connections.
- Using different datasets for baseline and proposed methods
-
UNETR and SwinUNETR (same author) used bigger dataset for their model than for the baseline.

-
Coupling innovation with self-supervised pre-training, while training baseline from scratch.
- Using different computational scale compared to baselines
- Comparing ensemble of 20 models to standalone models [https://arxiv.org/abs/2103.10504, https://arxiv.org/abs/2111.14791]
Recommendation: Isolate changes to show meaningful changes in performance.
2. Lack of well-configured baseline
- Many authors claim methodological superiority without transparent and efficient parameter selection for the baseline, like nnUNet or other self-configuring methods. [[2401.13560] SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation (arxiv.org), [2103.03024] CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation (arxiv.org)]
Recommendation: Provide adaptation instructions along with methods to enhance the reproducibility and comparability.
3. Lack of efficient benchmarking dataset
- Many of the datasets, used as benchmarks in proving the usefulness of a novel architecture, either have high intra-method variance (between different folds on the same method), or low systemic variance (between different methods).

Recommendation: use more diverse set of tasks and datasets to prove the claimed advancement.
4. Lack of systematic reporting practices
- Usually new algorithms are tested using public leaderboards (such as BraTS challenge, ImageNet, etc.), but those leaderboards usually don’t have specific rules regarding pre-processing, post-processing, dataset sub-splitting, ensembling, etc. [[2107.08623] LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation (arxiv.org), [2201.00462] D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation (arxiv.org)]
- Sometimes, the results are only reported for a certain subset of classes from the dataset, without a proper justification [[2102.04306] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (arxiv.org), Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation | SpringerLink]