May 16, 2026

How Do Large Language Models Perform in Deep Learning Code Generation? An Empirical Study

Key Points

Key points are not available for this paper at this time.

Abstract

Deep learning (DL) has become a cornerstone of modern artificial intelligence, driving significant advances in fields such as computer vision and natural language processing. However, DL programming remains challenging for less-experienced developers due to the inherent complexity of graph-based structures and implicit semantic constraints, which differ significantly from those in general-purpose code. In light of these challenges, Large language models (LLMs) have recently demonstrated strong potential in code generation, leading to considerable efforts to evaluate their performance. However, existing studies, such as HumanEval, primarily focus on general code and fail to capture the distinct challenges inherent in DL code generation. As a result, the capabilities of LLMs for DL code generation remain unclear. To fill this gap, we conducted an empirical study to evaluate the performance of eight LLMs under four prompting methods. This study focuses on three critical dimensions: code syntax, semantics, and executability, which correspond to the syntactic quality, semantic accuracy, and runtime executability of the generated code, respectively. To support this study, we constructed DeepEval , a benchmark consisting of 100 DL code generation tasks. Based on the code generated by LLMs on DeepEval , we further taxonomized and analyzed numerous failure cases to uncover common weaknesses of LLMs across these dimensions. Our analysis yielded 14 findings, with representative examples as follows: (i) in syntax, LLMs frequently fail to manage imports and variables due to limited awareness of global context and a tendency to hallucinate the elements of DL APIs; (ii) in semantics, LLMs struggle to organize API calls into corresponding DL model structures, particularly when multiple API call sequences are required; (iii) in executability, LLMs are more prone to inter-layer errors (e.g., tensor shape mismatch) than intra-layer errors. These errors primarily stem from the numerical, type, and structural constraints that are inherently implicit in DL code. Building on these findings, we propose a set of implications and guidelines for DL practitioners and LLM developers to support best practices and uncover potential opportunities.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Ma et al. (Wed,) studied this question.

www.synapsesocial.com/papers/6a080a29a487c87a6a40bfe0 — DOI: https://doi.org/10.1145/3816024

Authors

Xiangyue Ma

Xiaoting Du

Chenglong Li

Journals

ACM Transactions on Software Engineering and Methodology

Actions

Institutions

Beihang University

Beijing University of Technology

Beijing University of Posts and Telecommunications

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

How Do Large Language Models Perform in Deep Learning Code Generation? An Empirical Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion