
Agricultural University of Athens
School of Applied Economics and Social Sciences
Department of Agricultural Economics and Rural Development
Evaluation of the Performance of Artificial Intelligence Tools ChatGPT 3.5, Copilot and Gemini in Beekeeping
A thesis submitted to the
AGRICULTURAL UNIVERSITY OF ATHENS
In partial fulfillment of the requirements for the Integrated Master’s Degree of
AGRICULTURAL ECONOMICS AND RURAL DEVELOPMENT
By
IRINI KONTALI
Student ID: 416020
Committee Members
Costopoulou Constantina, Professor (Supervisor)
Karetsos Sotirios, Assistant Professor
Malliapis Michael, Teaching Staff
Athens, September 2024
ABSTRACT
The rapid development in the field of artificial intelligence (AI), particularly concerning chatbots widely known as "chatbots" based on large language models (LLMs), has piqued the interest of the research community regarding how they can be utilized as auxiliary tools for research purposes at their current level. These specific tools-applications excel in understanding and generating linguistic content, with potential usefulness in transforming university education and the way research is conducted in every field, including agricultural science. However, it is essential to evaluate the performance of such AI models on a variety of topics to highlight their capabilities, identify errors, and possible limitations. Therefore, this study aims to evaluate the performance of ChatGPT (GPT-3.5), Copilot, and Gemini. Specifically, twenty (20) exam topics from the university course of beekeeping conducted by the corresponding laboratory of the Department of Plant Production Science of the Agricultural University of Athens were selected. The AI tools were tasked with answering the topics in two languages, Greek and English. The validity of the answers was judged by the research staff of the beekeeping laboratory. The evaluation results showed that Copilot achieved a score of 16/20 (80%), followed by Gemini (10/20, 50%), and GPT-3.5 (7/20, 35%) concerning the answers in Greek. In the answers in English, all three applications achieved an equal score (14/20, 70%), answering different questions correctly.
Keywords: Artificial intelligence, Large language models, Evaluation, Test exams.
Comments