Paw-Vit Achieves Robust Ear Verification with Vision Transformers and Accurate Alignment

Scientists are tackling the challenge of reliably identifying individuals using ear biometrics, a field hampered by variations in ear shape, size and pose. Deeksha Arun, Kevin W Bowyer, and Patrick Flynn, all from the University of Notre Dame’s Department of Computer Science and Engineering, present a novel solution in their work on PaW-ViT , a Patch-based Warping Vision Transformer. This innovative preprocessing technique normalises ear images by intelligently aligning image tokens to key anatomical features, effectively mitigating the impact of morphological differences that often plague visual recognition systems. By creating more consistent representations, PaW-ViT demonstrably improves the robustness of Vision Transformer models and offers a promising step towards more secure and accurate ear-based authentication schemes.

Datasets like the Unconstrained Ear Recognition Challenge (UERC) have facilitated progress in this field. Applying standard ViTs directly to ear images presents challenges, as conventional patch tokenization can fragment anatomical structures and ViTs may focus on irrelevant background regions.

Consequently, naive ViT adoption is insufficient for robust ear recognition in unconstrained scenarios. To address these issues, researchers propose PaW-ViT, which standardizes ear images before inputting them into ViTs. PaW-ViT employs a patch-based warping strategy to generate structurally consistent, ear-centric representations. The process begins with boundary extraction and convex hull refinement, followed by uniform sampling of equally-spaced boundary points and centroid computation. These points construct a triangular fan partitioning the ear interior, which is then transformed into fixed-size square patches via affine warping.
The resulting patches are stitched into a 112 × 112 anatomy-preserving canvas. This preprocessing preserves critical geometric relationships, reduces misalignment inconsistencies, and suppresses irrelevant background regions. By providing ViTs with warped, geometry-aware representations, PaW-ViT enhances their ability to learn stable and discriminative embeddings for ear verification. Researchers evaluated PaW-ViT on four datasets , OPIB, AWE, WPUT, and EarVN1.0 , using four ViT configurations: Tiny (ViT-T), Small (ViT-S), Base (ViT-B), and Large (ViT-L). Results revealed that ViT-B and ViT-L consistently outperformed smaller configurations, demonstrating the benefits of larger model capacity with warped representations.

Patch-based warping yielded the greatest benefits on challenging datasets like EarVN1.0, where union and intersection maps surpassed baseline performance, highlighting the value of combining structural cues. The key contributions of the study are anatomy-aware preprocessing via radial partitioning and warping, which improves ear structure representation, and robustness across diverse ear shapes, poses, and occlusions. The use of the human ear for biometric recognition is appealing due to its distinctive morphology, relative stability, and non-intrusive acquisition. Unlike facial features, the ear is less affected by expressions and age-related changes, making it suitable for long-term identification.

The emergence of convolutional neural networks (CNNs) transformed ear recognition by enabling end-to-end feature learning. However, CNN-based approaches faced challenges due to limited ear-specific training data. Recently, transformer architectures have been introduced for ear recognition. Emersic et al showed that ViTs and DeiTs could achieve competitive performance compared to CNNs under unconstrained settings, with reduced reliance on heavy augmentation.

However, these works rely on patch tokenization, which fragments continuous ear structures and can discard fine-grained features spanning across patches. This is problematic for ear biometrics, where subtle anatomical patterns contribute strongly to uniqueness. Furthermore, background regions often distract attention mechanisms, diluting ear-specific cues. The research addresses the issue of rectangular patch tokenization disrupting continuous ear structures, a limitation of standard ViT architectures. Experiments confirm that PaW-ViT effectively normalises ear images by aligning token boundaries with detected ear feature boundaries, resulting in improved robustness to variations in shape, size, and pose. Results reveal that ViT-B and ViT-L consistently outperformed smaller configurations, demonstrating the benefits of increased model capacity when combined with warped representations. Specifically, on the challenging EarVN1.0 dataset, training regimes based on the union and intersection of segmentation and landmark maps frequently surpassed baseline performance, highlighting the value of combining complementary structural cues. Measurements confirm that while segmentation- and landmark-based warping alone yielded limited gains, integrating them through union and intersection maps consistently enhanced recognition accuracy.

The study recorded that PaW-ViT maintains accuracy across diverse ear shapes, poses, and occlusions, demonstrating its robustness in real-world scenarios. The breakthrough delivers a method for addressing the disconnect between ear biometric morphological variation and the positional sensitivity of transformer architectures, presenting a potential avenue for advanced authentication schemes. Further analysis showed that the approach does not introduce overlapping tokenization within the transformer itself; all ViT models used a standard non-overlapping patchification setting with a patch size equal to the stride. The work’s key contribution lies in its anatomy-aware radial partitioning and warping, which improves ear structure representation and facilitates the learning of more distinctive features. This approach leverages anatomical knowledge to normalise ear images, addressing the challenges posed by variations in shape, size, and pose. By aligning token boundaries within ViT models to detected ear feature boundaries, PaW-ViT creates more consistent token representations, enhancing robustness and accuracy. Experiments across multiple ViT models , including ViT-T, ViT-S, ViT-B, and ViT-L , demonstrate the effectiveness of PaW-ViT on various datasets, notably achieving strong results on AWE (0.9760 ±0.0040) and WPUT (0.9272 ±0.0048) using ViT-L.

The authors acknowledge limitations related to datasets containing numerous ear accessories, which introduce occlusions and distortions that reduce the effectiveness of the anatomical boundary-driven warping, particularly on the OPIB and WPUT datasets. Future work could explore methods to mitigate the impact of such occlusions or investigate alternative preprocessing strategies for images with significant accessory-related distortions. Notably, union map-based warped images consistently delivered the most significant improvements.

👉 More information
🗞 PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification
🧠 ArXiv: https://arxiv.org/abs/2601.19771

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Photonic circuit chip with looping light paths storing faint persistent glow traces, memory-like optical echoes circulating inside

Light-Based System Recalls Past Data Without Training

February 25, 2026
Two opposing strategic energy landscapes facing each other, neural network core between them calculating optimal equilibrium point

AI Predicts Experiment Outcomes Using Game Theory

February 25, 2026
Cluster of nanoscale semiconductor dots embedded in thin material sheet, sheet slightly stretched with visible strain lines

Tiny Semiconductor Dots Respond Strongly to Material Strain

February 25, 2026