Key points are not available for this paper at this time.
Deep learning (DL) has become a cornerstone of modern artificial intelligence, driving significant advances in fields such as computer vision and natural language processing. However, DL programming remains challenging for less-experienced developers due to the inherent complexity of graph-based structures and implicit semantic constraints, which differ significantly from those in general-purpose code. In light of these challenges, Large language models (LLMs) have recently demonstrated strong potential in code generation, leading to considerable efforts to evaluate their performance. However, existing studies, such as HumanEval, primarily focus on general code and fail to capture the distinct challenges inherent in DL code generation. As a result, the capabilities of LLMs for DL code generation remain unclear. To fill this gap, we conducted an empirical study to evaluate the performance of eight LLMs under four prompting methods. This study focuses on three critical dimensions: code syntax, semantics, and executability, which correspond to the syntactic quality, semantic accuracy, and runtime executability of the generated code, respectively. To support this study, we constructed DeepEval , a benchmark consisting of 100 DL code generation tasks. Based on the code generated by LLMs on DeepEval , we further taxonomized and analyzed numerous failure cases to uncover common weaknesses of LLMs across these dimensions. Our analysis yielded 14 findings, with representative examples as follows: (i) in syntax, LLMs frequently fail to manage imports and variables due to limited awareness of global context and a tendency to hallucinate the elements of DL APIs; (ii) in semantics, LLMs struggle to organize API calls into corresponding DL model structures, particularly when multiple API call sequences are required; (iii) in executability, LLMs are more prone to inter-layer errors (e.g., tensor shape mismatch) than intra-layer errors. These errors primarily stem from the numerical, type, and structural constraints that are inherently implicit in DL code. Building on these findings, we propose a set of implications and guidelines for DL practitioners and LLM developers to support best practices and uncover potential opportunities.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ma et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6a080a29a487c87a6a40bfe0 — DOI: https://doi.org/10.1145/3816024
Xiangyue Ma
Xiaoting Du
Chenglong Li
ACM Transactions on Software Engineering and Methodology
Beihang University
Beijing University of Technology
Beijing University of Posts and Telecommunications
Building similarity graph...
Analyzing shared references across papers
Loading...