Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition. However, existing methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints, resulting in biased and incomplete feature representations. We propose ASMa, a novel combination of masking strategies that learns the full spectrum of spatio-temporal dynamics in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion frames, and another that masks low-degree joints and high-motion frames. This ensures balanced and comprehensive skeleton representation learning.
We introduce a learnable feature alignment module to effectively align representations learned from both masked views. To facilitate deployment in resource-constrained settings, we compress the learned representation into a lightweight model using knowledge distillation. Extensive experiments demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7–4.4% in fine-tuning and up to 5.9% in transfer learning. Our distilled model achieves 91.4% parameter reduction and 3× faster inference on edge devices while maintaining competitive accuracy.
Key Observation: Low-degree joints (e.g., hands, feet) exhibit high motion across actions, while high-degree joints (e.g., spine) act as structural stabilizers with minimal motion. A balanced approach should incorporate both types of motion dynamics.
ASMa Framework: (a) Asymmetric masking strategies for joints and frames. (b) Three-stream architecture with anchor, spatial, and temporal streams. (c) Encoders trained with Barlow Twins loss. (d) Feature alignment module for downstream tasks. (e) Knowledge distillation for efficient deployment.
Two encoders learn complementary representations:
Why asymmetric? Different body parts exhibit different motion patterns. By training separate encoders on complementary views, the model learns a more complete representation of human actions.
Bi-directional cross-attention mechanism fuses diverse representations from both encoders for downstream classification tasks.
Key insight: Rather than simply averaging features, cross-attention allows each encoder to selectively attend to relevant information from the other, creating a richer combined representation.
Lightweight student model learns from the combined teacher representation, achieving 91.4% parameter reduction with minimal accuracy drop.
Practical benefit: Enables deployment on resource-constrained devices like Raspberry Pi with 3× faster inference while maintaining competitive accuracy.
ASMa demonstrates that what you mask matters. By strategically masking complementary spatio-temporal patterns, we achieve better representations than random or uniform masking strategies.
We evaluate ASMa on three standard skeleton action recognition benchmarks:
56,578 sequences
60 action classes
40 subjects
Evaluated on cross-subject (xsub) and cross-view (xview) splits
113,945 sequences
120 action classes
106 subjects
Evaluated on cross-subject (xsub) and cross-setup (xset) splits
28,443 sequences
51 action classes
Part I: Clean data for pretraining
Part II: Noisy data for transfer learning evaluation
| Method | NTU-60 xsub | NTU-60 xview | NTU-120 xsub | NTU-120 xset | ||||
|---|---|---|---|---|---|---|---|---|
| Lin. | FT. | Lin. | FT. | Lin. | FT. | Lin. | FT. | |
| 3s-PSTL | 79.1 | 87.1 | 83.8 | 93.9 | 69.2 | 81.3 | 70.3 | 82.6 |
| SCD-Net | 86.6 | - | 91.7 | - | 76.9 | - | 80.1 | - |
| STJD-CL | - | 89.3 | - | 94.8 | - | 83.5 | - | 86.8 |
| 3s-ASMa (Ours) | 87.3 | 92.0 | 91.9 | 96.8 | 80.1 | 87.9 | 81.0 | 88.8 |
| 3s-ASMa-Distill | - | 91.7 | - | 95.9 | - | 86.9 | - | 88.3 |
| Model | Params (M) | FLOPs (G) | Time (ms) | Memory (MB) | FPS |
|---|---|---|---|---|---|
| 3s-ASMa (Teacher) | 6.3 | 4.5 | 59.16 | 314.47 | 16.90 |
| 3s-ASMa-Distill | 0.54 | 1.26 | 21.38 | 173.28 | 46.52 |
| Improvement | ↓ 91.4% | ↓ 72% | ↓ 63.8% | ↓ 44.8% | ↑ 63.6% |
Performance on Raspberry Pi 4B (2GB RAM, CPU only)
| Method | NTU60 → PKU-II | NTU120 → PKU-II | PKU-I → PKU-II |
|---|---|---|---|
| SkeletonMAE | 58.4 | 61.0 | 62.5 |
| S-JEPA | 71.4 | 74.2 | 70.9 |
| 3s-PSTL | 72.4 | 70.1 | 69.1 |
| 3s-ASMa (Ours) | 77.2 | 77.0 | 76.8 |
Complementary masking strategies (high-degree + low-motion vs. low-degree + high-motion) consistently outperform symmetric or random masking by 2-4%.
Cross-attention based feature alignment improves accuracy by 1-2% over simple averaging, showing the importance of adaptive fusion.
Knowledge distillation achieves 91.4% parameter reduction and 3× speedup with only 0.3-1% accuracy drop, enabling real-time edge inference.
Students distilled from linear-probed teachers outperform their teachers by 5.2%, producing more compact and generalizable representations.
ASMa achieves 4-6% improvement on noisy datasets (PKU-MMD Part II), demonstrating superior generalization to challenging real-world scenarios.
ASMa masking improves existing SSL methods (MS²L, AimCLR, CrossCLR) by 1-2.3%, showing it works as a plug-and-play augmentation.
Asymmetric pairings (high-degree + low-motion, low-degree + high-motion) consistently outperform symmetric or random masking.
Left: Temperature sensitivity. Right: Model depth sensitivity. 5 ST-GCN layers with τ=8-9 provides optimal trade-off.
Student distilled from linear-probed teacher shows more compact clusters and outperforms its teacher by 5.2% on average.
@article{
anand2026asma,
title={{ASM}a: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning},
author={Aman Anand and Amir Eskandari and Elyas Rashno and Farhana Zulkernine},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=kIFo1q3VMS},
note={}
}