Cross-Modal Attack Detection and Adaptive Reconstruction Method Based on Uncertainty Estimation
Main Article Content
Abstract
As multimodal fusion applications integrating visual, speech, and language models become widespread in critical domains such as healthcare, transportation, and national defense, the vulnerability of these models to cross-modal adversarial attacks poses a significant threat to system security. Traditional detection methods are typically confined to single-modal signal analysis, struggling to capture subtle inconsistencies across multi-source information. This paper proposes an uncertainty-based cross-modal attack detection and adaptive reconstruction method, aiming to achieve real-time detection and repair through joint modeling of multi-modal consistency. The approach embeds a Bayesian inference module within the Transformer fusion layer to estimate joint uncertainty across modalities, enabling dynamic monitoring of semantic consistency. Upon detecting anomalous uncertainty distributions, the system automatically activates a lightweight reconstruction subnetwork. This subnetwork regenerates perturbed features based on cross-modal correlations, thereby repairing compromised regions. Experiments conducted on the COCO-Multimodal QA and AVSpeech datasets demonstrate that this method improves detection accuracy by 34% and 29% against FGSM and PGD attacks, respectively. Post-attack repair increases model accuracy by 22% with less than 6% increase in inference latency. The findings demonstrate that uncertainty-driven modal consistency estimation effectively enhances the security and reliability of multimodal learning systems in real-world scenarios. This research provides a deployable defense mechanism for multimodal AI systems, applicable to defense surveillance, autonomous driving, and medical image analysis. It aligns with the technical development direction of the U.S. Department of Defense's AI Security Assurance Program and holds practical significance for strengthening the security of critical national AI infrastructure.
Article Details
Section
How to Cite
References
1. R. Stein, "Attention-enhanced cross-modal learning for detecting anomalies in system software," Frontiers in Artificial Intelligence Research, vol. 2, no. 3, pp. 320-332, 2025.
2. Y. Liu, L. Qin, and K. Hao, "SFD-IAFNet: 3D detection method for vehicle small objects based on multi-scale feature enhancement and cross-modal interlaced attention," Measurement Science and Technology, vol. 36, no. 9, p. 095406, 2025. doi: 10.1088/1361-6501/ae0065
3. S. Liu, Z. Tang, and B. Chai, "Robust distribution system state estimation with physics-constrained heterogeneous graph embedding and cross-modal attention," Processes, vol. 13, no. 10, p. 3073, 2025. doi: 10.3390/pr13103073
4. Y. Guo, H. Yu, and L. Ma, "DIE-CDK: A discriminative information enhancement method with cross-modal domain knowledge for fine-grained ship detection," IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 10646-10661, 2024. doi: 10.1109/tcsvt.2024.3407057
5. T. Wang, F. Li, and L. Zhu, "Invisible black-box backdoor attack against deep cross-modal hashing retrieval," ACM Transactions on Information Systems, vol. 42, no. 4, pp. 1-27, 2024. doi: 10.1145/3650205
6. J. U. Kim, S. Park, and Y. M. Ro, "Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1510-1523, 2021. doi: 10.1109/tcsvt.2021.3076466
7. Y. Sun, B. Cao, and P. Zhu, "Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700-6713, 2022. doi: 10.1109/tcsvt.2022.3168279
8. X. Wei, Y. Huang, and Y. Sun, "Unified adversarial patch for visible-infrared cross-modal attacks in the physical world," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2348-2363, 2023. doi: 10.1109/tpami.2023.3330769
9. R. Wang, H. Lin, and Z. Luo, "Meme Trojan: Backdoor attacks against hateful meme detection via cross-modal triggers," In Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 7844-7852., 2025. doi: 10.1609/aaai.v39i8.32845
10. X. Zheng, V. M. Dwyer, and L. A. Barrett, "Rapid vital sign extraction for real-time opto-physiological monitoring at varying physical activity intensity levels," IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 7, pp. 3107-3118, 2023. doi: 10.1109/jbhi.2023.3268240