January 9, 2025Open Access

मॉन्टे-कार्लो सर्च का उपयोग करके ऑनलाइन नीति सुधार

Key Points

Key points are not available for this paper at this time.

Abstract

हम एक मॉन्टे-कार्लो सिमुलेशन एल्गोरिदम प्रस्तुत करते हैं जो एक अनुकूली नियंत्रक की वास्तविक समय नीति सुधार के लिए है। मॉन्टे-कार्लो सिमुलेशन में, प्रत्येक संभव क्रिया का दीर्घकालिक अपेक्षित इनाम सांख्यिकीय रूप से मापा जाता है, जिसमें प्रत्येक कदम पर निर्णय लेने के लिए प्रारंभिक नीति का उपयोग किया जाता है। उस क्रिया को लिया जाता है जो मापे गए अपेक्षित इनाम को अधिकतम करती है, जिससे नीति सुधरती है। हमारा एल्गोरिदम आसानी से समानांतर चलाने योग्य है और इसे IBM SP1 और SP2 पैरालल-RISC सुपरकंप्यूटरों पर लागू किया गया है। हमने इस एल्गोरिदम को बैकगैमन के डोमेन पर लागू करने में प्रेरक प्रारंभिक परिणाम प्राप्त किए हैं। परिणामों को विभिन्न प्रारंभिक नीतियों के लिए रिपोर्ट किया गया है, जो एक यादृच्छिक नीति से लेकर TD-Gammon तक हैं, जो एक अत्यंत मजबूत बहु-परत न्यूरल नेटवर्क है। प्रत्येक मामले में, मॉन्टे-कार्लो एल्गोरिदम बेस खिलाड़ियों की त्रुटि दर में 5 गुना या उससे अधिक की उल्लेखनीय कमी देता है। यह एल्गोरिदम उन कई अन्य अनुकूली नियंत्रण अनुप्रयोगों में भी संभावित रूप से उपयोगी है जहाँ पर्यावरण का सिमुलेशन संभव हो।

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Gerald Tesauro

Gregory R. Galperin

Journals

neural information processing systems

Actions

Institutions

Massachusetts Institute of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Tesauro et al. (Thu,) ने इस प्रश्न का अध्ययन किया।

www.synapsesocial.com/papers/6a0a541e5b6facdebcb4e780 — DOI: https://doi.org/10.48550/arxiv.2501.05407

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network· 1995 · 92 citations
Dynamic Programming and Optimal Control· 1995 · 10,929 citations
Learning to Predict by the Methods of Temporal Differences· 1988 · 3,943 citations
Connectionist Learning of Expert Preferences by Comparison Training· 1988 · 77 citations
Programming a computer for playing chess· 1950 · 777 citations

मॉन्टे-कार्लो सर्च का उपयोग करके ऑनलाइन नीति सुधार

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider