Saved in:
Bibliographic Details
Main Authors: Falcão, Rodrigo, Schweitzer, Stefan, Calvet, Emily
Format: Recurso digital
Language:
Published: Zenodo 2025
Online Access:https://doi.org/10.5281/zenodo.15913264
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <p>This experimentation package contains all materials used and results generated in an empirical study on the effectiveness of LLM-based interoperability strategies. In this experiment, LLMs are used to implement two strategies (DIRECT and CODEGEN) to convert data from an unknown source format to a desired target format.</p> <p><em>Note: An alternative version of this experimentation package is provided <a href="https://doi.org/10.5281/zenodo.18263326" target="_blank" rel="noopener">here</a>. The new version simplifies the operational setup for reproducing the experiment.</em></p> <h1>Organization</h1> <p>The package is organized as follows:</p> <table style="border-collapse: collapse; width: 100%; height: 666.188px;"><colgroup><col style="width: 50.0473%;"><col style="width: 50.0473%;"></colgroup> <tbody> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;"><strong>Directory name</strong></td> <td style="height: 19.5938px;"><strong>Description</strong></td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;"> /experimentation-package</td> <td style="height: 19.5938px;">Root directory</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">      /datasets</td> <td style="height: 39.1875px;">Datasets used in the experiment. They concern field data of an agricultural scenario.</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">           /v1</td> <td style="height: 39.1875px;">Version 1 of the dataset. In this version, the target representation has field boundaries in GeoJSON.</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">           /v2</td> <td style="height: 39.1875px;">Version 2 of the dataset. In this version, the target representation has field boundaries in GeoJSON and the field id.</td> </tr> <tr style="height: 58.7812px;"> <td style="height: 58.7812px;">           /v3</td> <td style="height: 58.7812px;">Version 3 of the dataset. In this version, the target representation has field boundaries in GeoJSON, the field id, and the field area in hectares (whereas the source representation also has field boundaries in hectares).</td> </tr> <tr style="height: 58.7812px;"> <td style="height: 58.7812px;">           /v4</td> <td style="height: 58.7812px;">Version 4 of the dataset. In this version, the target representation has field boundaries in GeoJSON, the field id, and the field area in acres (whereas the source representation has field boundaries in hectares).</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">      /procedure</td> <td style="height: 19.5938px;">Experimentation procedure directory</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">           /step-1_call-llms</td> <td style="height: 19.5938px;">First step</td> </tr> <tr style="height: 58.7812px;"> <td style="height: 58.7812px;">               ⚙️1_call-llms.bat</td> <td style="height: 58.7812px;">Batch script that uses the evaluation program to call the selected LLMs using the two implemented strategies. It generates multiple files with the raw results of each LLM call.</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">           /step-2_post-process-llm-results</td> <td style="height: 19.5938px;">Second step</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">               ⚙️2_post-process-llm-results.bat</td> <td style="height: 39.1875px;">Batch script that uses the evaluation program to consolidate all result files generated by step 1 into csv files.</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">           /step-3_analyze-post-processed_results.bat</td> <td style="height: 19.5938px;">Third step</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">                /aux_dataset_model_strategy_script</td> <td style="height: 39.1875px;">Directory with auxiliary scripts for each combination "dataset version"-"model"-"strategy"</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">                /aux_scripts</td> <td style="height: 39.1875px;">Directory with scripts containing auxiliary functions to load the data and calculate the results.</td> </tr> <tr style="height: 39.1875px;"> <td style="height: 39.1875px;">                main.R</td> <td style="height: 39.1875px;">R script that does the complete statistical analysis. At the end, it generates a file named "output_YYYY-MM-DD-HH-mm-SS.txt" containing all the results.</td> </tr> <tr> <td>           /template-results-directory</td> <td>Directory with no files; when the procedure described in the folder "procedure" is followed, the newly generated results are placed in this directory.</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">      /program</td> <td style="height: 19.5938px;">Evaluation program directory</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">           /grain-eval.jar</td> <td style="height: 19.5938px;">Evaluation program executable</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">      /program-src</td> <td style="height: 19.5938px;">Source code directory (anonymized)</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">           ️grain-anonymous-main.zip</td> <td style="height: 19.5938px;">Compacted version of the anonymized source code of the evaluation program.</td> </tr> <tr> <td>      /prompts</td> <td>Prompt templates that were used to interact with the LLMs.</td> </tr> <tr style="height: 19.5938px;"> <td style="height: 19.5938px;">      /results</td> <td style="height: 19.5938px;">Directory with the raw results of the realized experiment.</td> </tr> <tr> <td>      /util</td> <td>Additional information</td> </tr> <tr> <td>          checking_llms_previous_knowledge_about_JD_representation.md</td> <td>Results of our investigation on the possibility that the models had previous knowledge about the John Deere API's representation for field boundaries.</td> </tr> </tbody> </table> <p>    </p> <h1>Our results</h1> <p>To see the raw results of our experiments, navigate to the folder /experimentation-package/results. It contains subfolders organizing the results per dataset, strategy, and model.</p> <p> </p> <h1>The evaluation program</h1> <p>The evaluation program "grain-eval.jar" implements both strategies (DIRECT and CODEGEN). The typical usage includes two steps:</p> <ol> <li>Call a certain LLM using a certain strategy and a certain dataset.</li> <li>Export the results of the previous step into a consolidated format (e.g., csv or Markdown).</li> </ol> <p>To see all options provided by the program, start the program without any additional parameters:</p> <blockquote> <p><code>java -jar grain-eval.jar --help</code></p> </blockquote> <p> </p> <h1>How to reproduce the experiment</h1> <p>The simplest way to reproduce the experiment is to follow the steps described in the folder /experimentation-package/procedure, adjusting the scripts as needed. Alternatively, anyone can manually call the selected LLMs, consolidate the collected output data, and analyze them to calculate the effectiveness of the models (pass@1) and make comparisons of the results using the two-proportion Z-test.</p> <p>Steps 1 and 2 can be called using the command line:</p> <blockquote> <p><code>1_call-llms.bat</code></p> <p><code>2_post-process-llm-results.bat</code></p> </blockquote> <p>For Step 3, the script main.R can be called using <a href="https://posit.co/products/open-source/rstudio/?sid=1">RStudio</a>, an open-source IDE for executing scripts in the <a href="https://www.r-project.org/">R language</a>. Disclaimer: some R scripts have been generated with the help of AI-based tools.</p> <h2>Requirements</h2> <p>As the scripts provided in steps 1 and 2 are batch files, they run on Windows-based computers. For Unix-like systems, it is necessary to convert them into Shell scripts.</p> <p>The scripts 1 and 2 call the evaluation program "grain-eval.jar". For executing this program, it is necessary to have JDK 21 and Python 3 installed on the machine and included on the system path. For step 3, R is required, as previously mentioned. The evaluation program depends on a running instance of <a href="https://ollama.com/">Ollama</a> to execute the models, whose URL is provided to the program as an input parameter. If this parameter is not provided, the evaluation program expects to find a running instance of Ollama on the local machine.</p>