Scientists are tackling the challenge of reliably identifying individuals using ear biometrics, a field hampered by variations in ear shape, size and pose. Deeksha Arun, Kevin W Bowyer, and Patrick Flynn, all from the University of Notre Dame’s Department of Computer Science and Engineering, present a novel solution in their work on PaW-ViT , a Patch-based Warping Vision Transformer. This innovative preprocessing technique normalises ear images by intelligently aligning image tokens to key anatomical features, effectively mitigating the impact of morphological differences that often plague visual recognition systems. By creating more consistent representations, PaW-ViT demonstrably improves the robustness of Vision Transformer models and offers a promising step towards more secure and accurate ear-based authentication schemes.

Datasets like the Unconstrained Ear Recognition Challenge (UERC) have facilitated progress in this field. Applying standard ViTs directly to ear images presents challenges, as conventional patch tokenization can fragment anatomical structures and ViTs may focus on irrelevant background regions.

Consequently, naive ViT adoption is insufficient for robust ear recognition in unconstrained scenarios. To address these issues, researchers propose PaW-ViT, which standardizes ear images before inputting them into ViTs. PaW-ViT employs a patch-based warping strategy to generate structurally consistent, ear-centric representations. The process begins with boundary extraction and convex hull refinement, followed by uniform sampling of equally-spaced boundary points and centroid computation. These points construct a triangular fan partitioning the ear interior, which is then transformed into fixed-size square patches via affine warping.
The resulting patches are stitched into a 112 × 112 anatomy-preserving canvas. This preprocessing preserves critical geometric relationships, reduces misalignment inconsistencies, and suppresses irrelevant background regions. By providing ViTs with warped, geometry-aware representations, PaW-ViT enhances their ability to learn stable and discriminative embeddings for ear verification. Researchers evaluated PaW-ViT on four datasets , OPIB, AWE, WPUT, and EarVN1.0 , using four ViT configurations: Tiny (ViT-T), Small (ViT-S), Base (ViT-B), and Large (ViT-L). Results revealed that ViT-B and ViT-L consistently outperformed smaller configurations, demonstrating the benefits of larger model capacity with warped representations.

Patch-based warping yielded the greatest benefits on challenging datasets like EarVN1.0, where union and intersection maps surpassed baseline performance, highlighting the value of combining structural cues. The key contributions of the study are anatomy-aware preprocessing via radial partitioning and warping, which improves ear structure representation, and robustness across diverse ear shapes, poses, and occlusions. The use of the human ear for biometric recognition is appealing due to its distinctive morphology, relative stability, and non-intrusive acquisition. Unlike facial features, the ear is less affected by expressions and age-related changes, making it suitable for long-term identification.

The emergence of convolutional neural networks (CNNs) transformed ear recognition by enabling end-to-end feature learning. However, CNN-based approaches faced challenges due to limited ear-specific training data. Recently, transformer architectures have been introduced for ear recognition. Emersic et al showed that ViTs and DeiTs could achieve competitive performance compared to CNNs under unconstrained settings, with reduced reliance on heavy augmentation.

However, these works rely on patch tokenization, which fragments continuous ear structures and can discard fine-grained features spanning across patches. This is problematic for ear biometrics, where subtle anatomical patterns contribute strongly to uniqueness. Furthermore, background regions often distract attention mechanisms, diluting ear-specific cues. The research addresses the issue of rectangular patch tokenization disrupting continuous ear structures, a limitation of standard ViT architectures. Experiments confirm that PaW-ViT effectively normalises ear images by aligning token boundaries with detected ear feature boundaries, resulting in improved robustness to variations in shape, size, and pose. Results reveal that ViT-B and ViT-L consistently outperformed smaller configurations, demonstrating the benefits of increased model capacity when combined with warped representations. Specifically, on the challenging EarVN1.0 dataset, training regimes based on the union and intersection of segmentation and landmark maps frequently surpassed baseline performance, highlighting the value of combining complementary structural cues. Measurements confirm that while segmentation- and landmark-based warping alone yielded limited gains, integrating them through union and intersection maps consistently enhanced recognition accuracy.

The study recorded that PaW-ViT maintains accuracy across diverse ear shapes, poses, and occlusions, demonstrating its robustness in real-world scenarios. The breakthrough delivers a method for addressing the disconnect between ear biometric morphological variation and the positional sensitivity of transformer architectures, presenting a potential avenue for advanced authentication schemes. Further analysis showed that the approach does not introduce overlapping tokenization within the transformer itself; all ViT models used a standard non-overlapping patchification setting with a patch size equal to the stride. The work’s key contribution lies in its anatomy-aware radial partitioning and warping, which improves ear structure representation and facilitates the learning of more distinctive features. This approach leverages anatomical knowledge to normalise ear images, addressing the challenges posed by variations in shape, size, and pose. By aligning token boundaries within ViT models to detected ear feature boundaries, PaW-ViT creates more consistent token representations, enhancing robustness and accuracy. Experiments across multiple ViT models , including ViT-T, ViT-S, ViT-B, and ViT-L , demonstrate the effectiveness of PaW-ViT on various datasets, notably achieving strong results on AWE (0.9760 ±0.0040) and WPUT (0.9272 ±0.0048) using ViT-L.

The authors acknowledge limitations related to datasets containing numerous ear accessories, which introduce occlusions and distortions that reduce the effectiveness of the anatomical boundary-driven warping, particularly on the OPIB and WPUT datasets. Future work could explore methods to mitigate the impact of such occlusions or investigate alternative preprocessing strategies for images with significant accessory-related distortions. Notably, union map-based warped images consistently delivered the most significant improvements.

👉 More information
🗞 PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification
🧠 ArXiv: https://arxiv.org/abs/2601.19771

Tags:

anatomical knowledge Ear biometrics image alignment ! Patch-based Warping PaW-ViT positional sensitivity token representations ViT

Paw-Vit Achieves Robust Ear Verification with Vision Transformers and Accurate Alignment

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently