The lack of automatic pose evaluation metrics is a major obstacle in the development of sign language generation models.
The primary objective of this repository is to house a suite of automatic evaluation metrics specifically tailored for sign language poses. This includes metrics proposed by Ham2Pose1 as well as custom-developed metrics unique to our approach. We recognize the distinct challenges in evaluating single signs versus continuous signing, and our methods reflect this differentiation.
Given an isolated sign corpus such as ASL Citizen2, we repeat the evaluation of Ham2Pose1 on our metrics, ranking distance metrics by retrieval performance.
Evaluation is conducted on a combined dataset of ASL Citizen, Sem-Lex3, and PopSign ASL4.
For each sign class, we use all available samples as targets and sample four times as many distractors, yielding a 1:4 target-to-distractor ratio.
For instance, for the sign HOUSE with 40 samples (11 from ASL Citizen, 29 from Sem-Lex), we add 160 distractors and compute pairwise metrics from each target to all 199 other examples (We consistently discard scores for pose files where either the target or distractor could not be embedded with SignCLIP.).
Retrieval quality is measured using Mean Average Precision (mAP↑
) and Precision@10 (P@10↑
). The complete evaluation covers 5,362 unique sign classes and 82,099 pose sequences.
After several pilot runs, we finalized a subset of 169 sign classes with at most 20 samples each, ensuring consistent metric coverage. We evaluated 1200 distance-based variants and SignCLIP models with different checkpoints provided by the authors on this subset.
The overall results show that DTW-based metrics outperform padding-based baselines. Embedding-based methods, particularly SignCLIP models fine-tuned on in-domain ASL data, achieve the strongest retrieval scores.
For the study, we evaluated over 1200 Pose distance metrics, recording mAP and other retrieval performance characteristics.
We find that the top metric
Please make sure to run black pose_evaluation
before submitting a pull request.
If you use our toolkit in your research or projects, please consider citing the work.
@misc{pose-evaluation2025,
title={Meaningful Pose-Based Sign Language Evaluation},
author={Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling},
howpublished={\url{https://github.com/sign-language-processing/pose-evaluation}},
year={2025}
}
- Zifan, Colin, and Amit developed the evaluation metrics and tools. Zifan did correlation and human evaluations, Colin did automated meta-eval, KNN, etc.
- Colin and Amit developed the library code.
- Zifan, Anne, and Lisa conducted the qualitative and quantitative evaluations.
Footnotes
-
Rotem Shalev-Arkushin, Amit Moryossef, and Ohad Fried. 2022. Ham2Pose: Animating Sign Language Notation into Pose Sequences. ↩ ↩2
-
Aashaka Desai, Lauren Berger, Fyodor O. Minakov, Vanessa Milan, Chinmay Singh, Kriston Pumphrey, Richard E. Ladner, Hal Daumé III, Alex X. Lu, Naomi K. Caselli, and Danielle Bragg.
2023. ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition.
ArXiv, abs/2304.05934. ↩ -
Lee Kezar, Elana Pontecorvo, Adele Daniels, Connor Baer, Ruth Ferster, Lauren Berger, Jesse Thomason, Zed Sevcikova Sehyr, and Naomi Caselli.
2023. The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes.
Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility. ↩ -
Thad Starner, Sean Forbes, Matthew So, David Martin, Rohit Sridhar, Gururaj Deshpande, Sam S. Sepah, Sahir Shahryar, Khushi Bhardwaj, Tyler Kwok, Daksh Sehgal, Saad Hassan, Bill Neubauer, Sofia Anandi Vempala, Alec Tan, Jocelyn Heath, Unnathi Kumar, Priyanka Mosur, Tavenner Hall, Rajandeep Singh, Christopher Cui, Glenn Cameron, Sohier Dane, and Garrett Tanzer.
2023. PopSign ASL v1.0: An Isolated American Sign Language Dataset Collected via Smartphones.
Neural Information Processing Systems. ↩