The convenient characteristics of the rapid dissemination and production of short videos have led to increasingly significant copyright issues. Existing relevant technologies are difficult to accurately deal with the diversity and complexity of short video content, and at the same time, there are limitations in terms of technological sensitivity, real-time requirements, and the ease of being circumvented. We propose a short video copyright authentication method using spatial–temporal feature fusion to improve detection efficiency and accuracy. Spatial features are extracted from the key frames of a short clip video using a deep residual learning model, and dynamic temporal features are obtained by the changes between consecutive frames. A unique fingerprint is created by a similarity matrix. In addition, the model is trained using adversarial samples to ensure accurate identification of plagiarized content under perturbation. Finally, a "teacher" model is trained into a lighter "student" model through the knowledge distillation. The experimental results demonstrate that the proposed model has good generality and performance, and its mAP value reaches 0.946, showing better performance than other video detection models, verifying the effectiveness and feasibility.
Li et al. (Mon,) studied this question.