Skeleton Action Recognition (SAR) has attracted significant interest for its efficient representation of the human skeletal structure. Despite its advancements, recent studies have raised security concerns in SAR models, particularly their vulnerability to adversarial attacks. However, such strategies are limited to digital scenarios and ineffective in physical attacks, limiting their real-world applicability. To investigate the vulnerabilities of SAR in the physical world, we introduce the Physical Skeleton Backdoor Attacks (PSBA), the first exploration of physical backdoor attacks against SAR. Considering the practicalities of physical execution, we introduce a novel trigger implantation method that integrates infrequent and imperceivable actions as triggers into the original skeleton data. By incorporating a minimal amount of this manipulated data into the training set, PSBA enables the system misclassify any skeleton sequences into the target class when the trigger action is present. We examine the resilience of PSBA in both poisoned and clean-label scenarios, demonstrating its efficacy across a range of datasets, poisoning ratios, and model architectures. Additionally, we introduce a trigger-enhancing strategy to strengthen attack performance in the clean label setting. The robustness of PSBA is tested against three distinct backdoor defenses, and the stealthiness of PSBA is evaluated using two quantitative metrics. Furthermore, by employing a Kinect V2 camera, we compile a dataset of human actions from the real world to mimic physical attack situations, with our findings confirming the effectiveness of our proposed attacks.
We used a Kinect V2 camera to record skeletal data from 3400 instances of action performed in five real-world settings, including expansive indoor and outdoor environments. Each participant was asked to perform three specific actions—nodding, bending sideways, and crossing hands in front—that were previously identified as trigger actions in our research. To simulate scenarios of physical attack, these trigger actions were analyzed in combination with other distinct actions. We incorporated 17 common action categories from datasets such as NTU RGB+D, NTU RGB+D 120, and PKU-MMD—examples include sitting down, jumping up, and kicking. These actions were detailed in Table 1. Each action was performed with varying intensities, demonstrating both minimal and exaggerated motion to reflect the natural variability of human movement. The recorded trigger actions were set aside as a test suite to assess the model's recognition capabilities.
You can download our preprocessed data from Google Drive. Regarding the data processing, we initially recorded the .xef files using a Kinect V2 camera. Subsequently, we extracted the skeleton data into skeleton.txt files using KinectXEFTools. Finally, we consolidated and normalized all the skeleton.txt files and stored them in .npy format.
We have provided a demo code here on how to generate triggers.
@article{zheng2024psba,
author = {Zheng, Qichen and Yu, Yi and Yang, Siyuan and Liu, Jun and Lam, Kwok-Yan and Alex, Kot},
title = {Towards Physical World Backdoor Attacks against Skeleton Action Recognition},
booktitle = {European Conference on Computer Vision},
year = {2024},
organization = {Springer}
}