What type of study is this?

This is a Experimental Study study.

September 18, 2025Open Access

Multimodal neural network processing of video lectures using multi-agent systems

Key Points

An architecture for a multi-agent system has been developed to facilitate effective processing of video lectures.
The orchestrator-worker pattern, using a large language model, enhances coordination and fault tolerance during processing.
Video lectures are categorized into three types, influencing the approach to processing and summarizing information.
Integration of adaptive pipelines allows for flexible data handling according to the specific formats of lecture videos.

Abstract

Subject of research: multimodal processing of video lectures using multi-agent systems. The article focuses on intermediate results of the research, including an overview of the concepts of multimodality, multi-agent systems, and multi-model systems, as well as the development of approaches to processing video data from lectures. Purpose of research: transformation of all relevant information from a video lecture into a text document to form an accompanying lecture summary. The goal is to develop an effective data processing cycle, taking into account differences in video lecture formats. Researchmethods: selection of the «Orchestrator-Performer» pattern (Orchestrator-Worker Pattern) with a large language model (LLM) in the role of the orchestrator. Overview of alternative approaches, namely the peer-to-peer decentralized pattern and the hybrid pattern, with justification for choosing the orchestrator approach to ensure consistent processing and fault tolerance. Integration of pipeline video stream processing into a multi-agent system (hybrid approach). The objects of research in this article are video lectures of three main types, serving as sources of multimodal data for analysis and processing. The first type – «Lecturer and Presentation» – includes video recordings where the lecturer is positioned to the left or right of the accompanying presentation, with an emphasis on the visual combination of the human figure and slides. The second type – «Presentation and Voiceover» – focuses on theoretical material presented on the presentation slides, with explanation off-screen through the audio track. The third type – «Lecturer and Blackboard» – covers recordings where the lecturer writes material on a classic chalk or marker board, emphasizing handwritten input of information. Research findings: An architecture for a multi-agent system has been developed and justified based on the «Orchestrator-Performer» pattern with a hybrid approach, integrating pipeline video processing into a multi-agent environment for effective task distribution and load management. Models and tools have been selected and described, namely orchestrators, audio processing models, OCR, taking into account lecture types for adaptive pipelines. The functioning of agents is described, including initialization, interaction with the orchestrator, parallel audio/video processing, and aggregation of results into a text document with the possibility of downloading/printing.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Milan E. Ismagulov

Journals

Yugra State University Bulletin

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multimodal neural network processing of video lectures using multi-agent systems

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study