Subject of research: multimodal processing of video lectures using multi-agent systems. The article focuses on intermediate results of the research, including an overview of the concepts of multimodality, multi-agent systems, and multi-model systems, as well as the development of approaches to processing video data from lectures. Purpose of research: transformation of all relevant information from a video lecture into a text document to form an accompanying lecture summary. The goal is to develop an effective data processing cycle, taking into account differences in video lecture formats. Researchmethods: selection of the «Orchestrator-Performer» pattern (Orchestrator-Worker Pattern) with a large language model (LLM) in the role of the orchestrator. Overview of alternative approaches, namely the peer-to-peer decentralized pattern and the hybrid pattern, with justification for choosing the orchestrator approach to ensure consistent processing and fault tolerance. Integration of pipeline video stream processing into a multi-agent system (hybrid approach). The objects of research in this article are video lectures of three main types, serving as sources of multimodal data for analysis and processing. The first type – «Lecturer and Presentation» – includes video recordings where the lecturer is positioned to the left or right of the accompanying presentation, with an emphasis on the visual combination of the human figure and slides. The second type – «Presentation and Voiceover» – focuses on theoretical material presented on the presentation slides, with explanation off-screen through the audio track. The third type – «Lecturer and Blackboard» – covers recordings where the lecturer writes material on a classic chalk or marker board, emphasizing handwritten input of information. Research findings: An architecture for a multi-agent system has been developed and justified based on the «Orchestrator-Performer» pattern with a hybrid approach, integrating pipeline video processing into a multi-agent environment for effective task distribution and load management. Models and tools have been selected and described, namely orchestrators, audio processing models, OCR, taking into account lecture types for adaptive pipelines. The functioning of agents is described, including initialization, interaction with the orchestrator, parallel audio/video processing, and aggregation of results into a text document with the possibility of downloading/printing.
Building similarity graph...
Analyzing shared references across papers
Loading...
Milan E. Ismagulov
Yugra State University Bulletin
Building similarity graph...
Analyzing shared references across papers
Loading...
Milan E. Ismagulov (Wed,) studied this question.
www.synapsesocial.com/papers/68d463e931b076d99fa6351c — DOI: https://doi.org/10.18822/byusu20250340-47