:::info Authors:
(1) Sirui Hong, DeepWisdom and these authors contributed equally to this work;
(2) Yizhang Lin, DeepWisdom and these authors contributed equally to this work;
(3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order;
(4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work;
(5) Binhao Wu, DeepWisdom and these authors contributed equally to this work;
(6) Danyang Li, DeepWisdom and these authors contributed equally to this work;
(7) Jiaqi Chen, Fudan University and these authors contributed equally to this work;
(8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work;
(9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work;
(10) Li Zhang, Fudan University and these authors contributed equally to this work;
(11) Lingyao Zhang, these authors contributed equally to this work;
(12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work;
(13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work;
(14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work;
(15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work;
(16) Wei Tao, Fudan University and these authors contributed equally to this work;
(17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work;
(18) Xiangru Tang, Yale University and these authors contributed equally to this work;
(19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work;
(20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work;
(21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work;
(22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work;
(23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work;
(24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work;
(25) Chenglin Wu, DeepWisdom and a corresponding author.
:::
:::tip Editor's Note: This is Part 4 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below.
:::
Table of Links3 Methodology and 3.1 Dynamic planning with Hierarchical Structure
3.2 Tool utilization and generation
3.3 Enhancing reasoning with verification and experience
4.1 Experimental Setup
4.2 Main Result
4.3 Ablation Study
A. Additional Results
B. Implementation Results
C. Details of Datasets
4.1.1 DATASET
\ MATH dataset
\ The MATH dataset (Hendrycks et al., 2021) comprises 12,500 problems, with 5,000 designated as the test set, covering various subjects and difficulty levels. These subjects include Prealgebra (Prealg), Algebra, Number Theory (N.Theory), Counting and Probability (C.Prob), Geometry, Intermediate Algebra, and Precalculus (Precalc), with problems categorized from levels ”1” to ”5” based on difficulty. Following the setting of Wu et al. (Wu et al., 2023b), we evaluated four typical problem types (C.Prob, N.Theory, Prealg, Precalc), excluding level-5 geometry problems from the test set.
\ ML-Benchmark
\ Given the absence of datasets and evaluation metrics for assessing capabilities in the machine learning domain, we developed a benchmark dataset and corresponding evaluation method known as ML-Benchmark. This dataset encompassed eight representative machine learning tasks categorized into three difficulty levels, ranging from easy (level 1) to most complex (level 3). Each task was accompanied by data, a concise description, standard user requirements, suggested steps, and metrics (see Table 8 in the Appendix). For tasks labeled as “toy”, the data was not divided into training and test splits, which required the framework to perform data splitting during modeling.
\ Open-ended task benchmark
\ To evaluate the ability to generalize to real-world tasks, we developed the Open-ended task benchmark, comprising 20 tasks. Each task required the framework to understand user needs, break down complex tasks, and execute code. They delineated their requirements, foundational data or sources, steps for completion, and specific metrics. The scope was broad, encompassing common needs like Optical Character Recognition (OCR), web search and crawling (WSC), automated email replies (ER), web page imitation (WPI), text-to-image conversion (T2I), image-to-HTML code generation (I2C), image background removal (IBR), and mini-game generation (MGG). We showcase about these tasks in Figure 13, Figure 15, and Figure 16 in the Appendix.
\ 4.1.2 EVALUATION METRICS
\ In the MATH benchmark (Hendrycks et al., 2021), accuracy served as the chosen evaluation metric, aligning with the setting proposed in (Wu et al., 2023b; Hendrycks et al., 2021). Considering the variability in interpreting test results, we manually reviewed the outputs generated by all methods to determine the count of accurate responses. For the ML-Benchmark, three evaluation metrics were utilized: completion rate (CR), normalized performance score (NPS), and comprehensive score (CS). These metrics provided comprehensive insights into the model’s performance and were defined as follows:
\
\
\ Normalized performance score (NPS): In our ML-Benchmark, each task was associated with its evaluation metric, which may vary between tasks, including metrics such as accuracy, F1, AUC and RMSLE, etc. For metrics such as accuracy, F1, and AUC, we presented the raw values to facilitate comparison across identical data tasks. We normalize all performance values s:
\
\ This transformation ensured that loss-based metrics like RMSLE are scaled from 0 to 1, with higher normalized performance score values indicating better performance.
\ Comprehensive score (CS): To simultaneously assess both the completion rate of task requirements and the performance of generated machine learning models, we calculated the weighted sum of CR and NPS as follows:
\ CS = 0.5 × CR + 0.5 × NPS.
\ Considering the lack of unified performance standards for open-ended tasks, we default to NPS = 0 and directly equate CS to CR.
\ 4.1.3 BASELINES AND IMPLEMENTATION DETAILS
\ GPT-4-Turbo (gpt-4-1106-preview) was used in all frameworks to ensure an impartial performance evaluation. To ensure a fair comparison with other frameworks, we kept our experience pool empty to eliminate any prior knowledge. The effect of experience learning is reported in Table 3. MATH dataset: We adopted zero-shot baselines, including MathChat (Wu et al., 2023b) and AutoGen (Wu et al., 2023a) with GPT-4-Turbo as the baseline for a fair comparison. We set N=3 for ACV in the MATH dataset. Considering the variability in interpreting test results, we manually reviewed the outputs generated by all methods to determine the count of accurate responses (Wu et al., 2023b). ML-Benchmark: We selected four typical open-source LLM-based agent frameworks that support data analysis and modeling as baselines: XAgent (Team, 2023), AutoGen (Wu et al., 2023a), OpenInterpreter (Lucas, 2023), and TaskWeaver (Qiao et al., 2023). By default, we set N = 1 for ACV in ML-Benchmark and conducted the experiments before January 2024 for all baseline frameworks. Open-ended task benchmark: We employed AutoGen (Wu et al., 2023a) and OpenInterpreter (Lucas, 2023) as baseline models. Each framework underwent three experiments per task, and we reported the average completion rate. We also set N = 1 for ACV in the open-ended task benchmark by default.
4.2 MAIN RESULTPerformance on math problem solving
\ As illustrated in the Figure 7, the Data Interpreter achieved the best results across all tested categories, reaching 0.81 accuracy in the N.Theory category, which was a 0.15 improvement over AutoGen. In the most challenging category, Precalc, the Data Interpreter obtained an accuracy of 0.28, an increase of 0.16 compared to AutoGen. Notably, the inclusion of ACV resulted in significant improvements across all task categories, with an average improvement of 17.29% relative improvement compared to the version without ACV. On average, the ACV strategy showed 26% relative improvement compared to AutoGen.
\ \
\ \ \
\ Performance on machine learning
\ In Table 1, the Data Interpreter achieved a comprehensive score of 0.95 across the seven tasks, compared to an average score of 0.86 by AutoGen, marking a significant 10.3% improvement. It was the only framework with a comprehensive score exceeding 0.9 on Titanic, House Prices, SCTP, and ICR. The Data Interpreter outperformed other frameworks and gained a significant advantage on corresponding datasets, showing a notable improvement of 24.7% and 21.2% over AutoGen in ICR and SVPC, respectively. The Data Interpreter completed all mandatory processes on every dataset and consistently maintained superior performance, more details can be found in Table 5 in the Appendix.
\ Performance on open-ended tasks
\ Table 2 illustrates that the Data Interpreter achieved a completion rate of 0.97, marking a substantial 112% improvement compared to AutoGen. For the IBR task, all three frameworks achieved a 1.0 completion score. In OCR-related tasks, the Data Interpreter achieved an average completion rate of 0.85, outperforming AutoGen and OpenInterpreter by 26.8% and 70.0%, respectively. In tasks requiring multiple steps and utilizing multimodal tools/interfaces, such as WPI, I2C, and T2I, the Data Interpreter emerged as the sole method to execute all steps. AutoGen and OpenInterpreter failed to log in and obtain the status for the ER task, resulting in a lower completion rate. The Data Interpreter can dynamically adjust the task and achieve a 0.98 score in completion rate.
4.3 ABLATION STUDYAblation on core modules
\ To assess the performance of various modules, we conducted ablation experiments with three additional configurations on the ML-Benchmark. The initial setup entailed the ReAct (Yao et al., 2022) framework with simplified prompt phrases that allow code execution. The second configuration integrated dynamic planning, encompassing hierarchical planning and dynamic plan management following each step to facilitate real-time adjustments. The third configuration incorporated the utilization and generation functionalities of tools, which defaulted to the Data Interpreter.
\ \
\
\ As indicated by Table 3, dynamic planning yielded a significant improvement of 0.48. It helped prepare the dataset and track changes to the data in real-time, resulting in better performance, especially in terms of completion rate. Furthermore, using tools resulted in an additional improvement of 9.84%, bringing the comprehensive score to 0.94.
\ Ablation on LLM backbones
\ In machine learning tasks, more extensive LLM backbones such as Qwen-72B-Chat (Bai et al., 2023) and Mixtral-8x7B (Jiang et al., 2024) exhibited performance comparable to GPT-3.5-Turbo, while smaller LLMs experienced performance degradation.
\ As shown in Figure 8, our Data Interpreter, when paired with smaller models such as Yi-34BChat (01-ai, 2023), Qwen-14B-Chat (Bai et al., 2023), Llama2-13B-Chat (Touvron et al., 2023), and even DeepSeek-7B-Chat (Bi et al., 2024), effectively handled tasks such as data loading and analysis. However, these models faced limitations when executing tasks requiring advanced coding proficiency, which can lead to incomplete processes. In open-ended tasks, Mixtral-8x7B achieved high completion rates in three tasks but encountered challenges in the WSC task due to difficulty accurately outputting complete results to CSV files. Similar to machine learning tasks, smaller LLMs encountered execution failures due to their restricted coding abilities while acquiring images or parsing webpage results. (See Figure 8).
\ Ablation on experience learning To evaluate experience learning, we conducted experiments on five tasks with varying experience pool sizes, measuring task efficiency by debugging attempts and cost. Increasing the pool size from 0 to 200 significantly reduced debugging attempts from 1.48 to 0.32 per task, with costs decreasing from $0.80 to $0.24. This highlights substantial efficiency gains from experience learning. Notably, at a pool size of 80, debugging attempts decreased, especially in ER, Titanic, and House Prices tasks, by 1.25, 1, and 1, respectively. This underscored the To evaluate experience learning, we conducted experiments on five tasks with varying experience pool sizes, measuring task efficiency by debugging attempts and cost. Increasing the pool size from 0 to 200 significantly reduced debugging attempts from 1.48 to 0.32 per task, with costs decreasing from $0.80 to $0.24. This highlights substantial efficiency gains from experience learning. Notably, at a pool size of 80, debugging attempts decreased, especially in ER, Titanic, and House Prices tasks, by 1.25, 1, and 1, respectively. This underscored the efficiency enhancement even with a modest pool size, indicating the sensitivity of LLMs to context and effectiveness in code-centric problem-solving.
\ \
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.