EMPIRICAL SOFTWARE ENGINEERING

Prioritizing test cases for deep learning-based video classifiers
Li Y, Dang X, Ma L, Klein J and Bissyandé TF
The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76% 46.51% on natural datasets and 4.26% 53.56% on noisy datasets.
Impact of log parsing on deep learning-based anomaly detection
Khan ZA, Shin D, Bianculli D and Briand LC
Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep learning models include a pre-processing step called log parsing. However, understanding the impact of log parsing on the accuracy of anomaly detection techniques has received surprisingly little attention so far. Investigating what are the key properties log parsing techniques should ideally have to help anomaly detection is therefore warranted. In this paper, we report on a comprehensive empirical study on the impact of log parsing on anomaly detection accuracy, using 13 log parsing techniques, seven anomly detection techniques (five based on deep learning and two based on traditional machine learning) on three publicly available log datasets. Our empirical results show that, despite what is widely assumed, there is no strong correlation between log parsing accuracy and anomaly detection accuracy, regardless of the metric used for measuring log parsing accuracy. Moreover, we experimentally confirm existing theoretical results showing that it is a property that we refer to as distinguishability in log parsing results-as opposed to their accuracy-that plays an essential role in achieving accurate anomaly detection.
The effect of data complexity on classifier performance
Eberlein J, Rodriguez D and Harrison R
The research area of Software Defect Prediction (SDP) is both extensive and popular, and is often treated as a classification problem. Improvements in classification, pre-processing and tuning techniques, (together with many factors which can influence model performance) have encouraged this trend. However, no matter the effort in these areas, it seems that there is a ceiling in the performance of the classification models used in SDP. In this paper, the issue of classifier performance is analysed from the perspective of data complexity. Specifically, data complexity metrics are calculated using the Unified Bug Dataset, a collection of well-known SDP datasets, and then checked for correlation with the defect prediction performance of machine learning classifiers (in particular, the classifiers C5.0, Naive Bayes, Artificial Neural Networks, Random Forests, and Support Vector Machines). In this work, different domains of competence and incompetence are identified for the classifiers. Similarities and differences between the classifiers and the performance metrics are found and the Unified Bug Dataset is analysed from the perspective of data complexity. We found that certain classifiers work best in certain situations and that all data complexity metrics can be problematic, although certain classifiers did excel in some situations.
App review driven collaborative bug finding
Tang X, Tian H, Kong P, Ezzini S, Liu K, Xia X, Klein J and Bissyandé TF
Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the BugRMSys approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys 's implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying BugRMSys to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.
Search-based Automatic Repair for Fairness and Accuracy in Decision-making Software
Hort M, Zhang JM, Sarro F and Harman M
Decision-making software mainly based on Machine Learning (ML) may contain fairness issues (e.g., providing favourable treatment to certain people rather than others based on sensitive attributes such as gender or race). Various mitigation methods have been proposed to automatically repair fairness issues to achieve fairer ML software and help software engineers to create responsible software. However, existing bias mitigation methods trade accuracy for fairness (i.e., trade a reduction in accuracy for better fairness). In this paper, we present a novel search-based method for repairing ML-based decision making software to simultaneously increase both its fairness and accuracy. As far as we know, this is the first bias mitigation approach based on multi-objective search that aims to repair fairness issues without trading accuracy for binary classification methods. We apply our approach to two widely studied ML models in the software fairness literature (i.e., Logistic Regression and Decision Trees), and compare it with seven publicly available state-of-the-art bias mitigation methods by using three different fairness measurements. The results show that our approach successfully increases both accuracy and fairness for 61% of the cases studied, while the state-of-the-art always decrease accuracy when attempting to reduce bias. With our proposed approach, software engineers that previously were concerned with accuracy losses when considering fairness, are now enabled to improve the fairness of binary classification models without sacrificing accuracy.
AI support for data scientists: An empirical study on workflow and alternative code recommendations
Ramasamy D, Sarasua C and Bernstein A
Despite the popularity of AI assistants for coding activities, there is limited empirical work on whether these coding assistants can help users complete data science tasks. Moreover, in data science programming, exploring alternative paths has been widely advocated, as such paths may lead to diverse understandings and conclusions (Gelman and Loken 2013; Kale et al. 2019). Whether existing AI-based coding assistants can support data scientists in exploring the relevant alternative paths remains unexplored. To fill this gap, we conducted a mixed-methods study to understand how data scientists solved different data science tasks with the help of an AI-based coding assistant that provides explicit alternatives as recommendations throughout the data science workflow. Specifically, we quantitatively investigated whether the users accept the code recommendations, including alternative recommendations, by the AI assistant and whether the recommendations are helpful when completing descriptive and predictive data science tasks. Through the empirical study, we also investigated if including information about the data science step (e.g., data exploration) they seek recommendations for in a prompt leads to helpful recommendations. In our study, we found that including the data science step in a prompt had a statistically significant improvement in the acceptance of recommendations, whereas the presence of alternatives did not lead to any significant differences. Our study also shows a statistically significant difference in the acceptance and usefulness of recommendations between descriptive and predictive tasks. Participants generally had positive sentiments regarding AI assistance and our proposed interface. We share further insights on the interactions that emerged during the study and the challenges that our users encountered while solving their data science tasks.
Assessing the adoption of security policies by developers in terraform across different cloud providers
Verdet A, Hamdaqa M, Silva LD and Khomh F
Cloud computing has become popular thanks to the widespread use of Infrastructure as Code (IaC) tools, allowing the community to manage and configure cloud infrastructure using scripts. However, the scripting process does not automatically prevent practitioners from introducing misconfigurations, vulnerabilities, or privacy risks. As a result, ensuring security relies on practitioners' understanding and the adoption of explicit policies. To understand how practitioners deal with this problem, we perform an empirical study analyzing the adoption of scripted security best practices present in Terraform files, applied on AWS, Azure, and Google Cloud. We assess the adoption of these practices by analyzing a sample of 812 open-source GitHub projects. We scan each project's configuration files, looking for policy implementation through static analysis (Checkov and Tfsec). The category emerges as the most widely adopted in all providers, while presents the most neglected policies. Regarding the cloud providers, we observe that AWS and Azure present similar behavior regarding attended and neglected policies. Finally, we provide guidelines for cloud practitioners to limit infrastructure vulnerability and discuss further aspects associated with policies that have yet to be extensively embraced within the industry.
Machine learning-based test smell detection
Pontillo V, Amoroso d'Aragona D, Pecorelli F, Di Nucci D, Ferrucci F and Palomba F
Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.
On the effects of program slicing for vulnerability detection during code inspection
Papotti A, Tuma K and Massacci F
Slicing is a fault localization technique that has been proposed to support debugging and program comprehension. Yet, its empirical effectiveness during code inspection by humans has received limited attention. The goal of our study is two-fold. First, we aim to define what it means for a code reviewer to identify the vulnerable lines correctly. Second, we investigate whether reducing the number of to-be-inspected lines by method-level slicing supports code reviewers in detecting security vulnerabilities. We propose a novel approach based on the notion of a -neighborhood (intuitively based on the idea of the context size of the command git  diff) to define correctly identified lines. Then, we conducted a multi-year controlled experiment (2017-2023) in which MSc students attending security courses ( ) were tasked with identifying vulnerable lines in original or sliced Java files from Apache Tomcat. We provide perfect seed lines for a slicing algorithm to control for confounding factors. Each treatment differs in the pair (Vulnerability, Original/Sliced) with a balanced design with vulnerabilities from the OWASP Top 10 2017: A1 (Injection), A5 (Broken Access Control), A6 (Security Misconfiguration), and A7 (Cross-Site Scripting). To generate smaller slices for human consumption, we used a variant of intra-procedural thin slicing. We report the results for which corresponds to exactly matching the vulnerable ground truth lines, and which represents the scenario of identifying the vulnerable area. For both cases, we found that slicing helps in 'finding something' (the participant has found at least some vulnerable lines) as opposed to 'finding nothing'. For the case of analyzing a slice and analyzing the original file are statistically equivalent from the perspective of lines found by those who found something. With slicing helps to find more vulnerabilities compared to analyzing an original file, as we would normally expect. Given the type of population, additional experiments are necessary to be generalized to experienced developers.
Simplifying software compliance: AI technologies in drafting technical documentation for the AI Act
Sovrano F, Hine E, Anzolut S and Bacchelli A
The European AI Act has introduced specific technical documentation requirements for AI systems. Compliance with them is challenging due to the need for advanced knowledge of both legal and technical aspects, which is rare among software developers and legal professionals. Consequently, small and medium-sized enterprises may face high costs in meeting these requirements. In this study, we explore how contemporary AI technologies, including ChatGPT and an existing compliance tool (DoXpert), can aid software developers in creating technical documentation that complies with the AI Act. We specifically demonstrate how these AI tools can identify gaps in existing documentation according to the provisions of the AI Act. Using open-source high-risk AI systems as case studies, we collaborated with legal experts to evaluate how closely tool-generated assessments align with expert opinions. Findings show partial alignment, important issues with ChatGPT (3.5 and 4), and a moderate (and statistically significant) correlation between DoXpert and expert judgments, according to the Rank Biserial Correlation analysis. Nonetheless, these findings underscore the potential of AI to combine with human analysis and alleviate the compliance burden, supporting the broader goal of fostering responsible and transparent AI development under emerging regulatory frameworks.
An empirical study of fault localisation techniques for deep neural networks
Humbatova N, Kim J, Jahangirova G, Yoo S and Tonella P
With the increased popularity of Deep Neural Networks (DNNs), increases also the need for tools to assist developers in the DNN implementation, testing and debugging process. Several approaches have been proposed that automatically analyse and localise potential faults in DNNs under test. In this work, we evaluate and compare existing state-of-the-art fault localisation techniques, which operate based on both dynamic and static analysis of the DNN. The evaluation is performed on a benchmark consisting of both real faults obtained from bug reporting platforms and faulty models produced by a mutation tool. Our findings indicate that the usage of a single, specific ground truth (e.g. the human-defined one) for the evaluation of DNN fault localisation tools results in pretty low performance (maximum average recall of 0.33 and precision of 0.21). However, such figures increase when considering alternative, equivalent patches that exist for a given faulty DNN. The results indicate that DeepFD is the most effective tool, achieving an average recall of 0.55 and a precision of 0.37 on our benchmark.
A comprehensive study of machine learning techniques for log-based anomaly detection
Ali S, Boufaied C, Bianculli D, Branco P and Briand L
Growth in system complexity increases the need for automated techniques dedicated to different log analysis tasks such as Log-based Anomaly Detection (LAD). The latter has been widely addressed in the literature, mostly by means of a variety of deep learning techniques. However, despite their many advantages, that focus on deep learning techniques is somewhat arbitrary as traditional Machine Learning (ML) techniques may perform well in many cases, depending on the context and datasets. In the same vein, semi-supervised techniques deserve the same attention as supervised techniques since the former have clear practical advantages. Further, current evaluations mostly rely on the assessment of detection accuracy. However, this is not enough to decide whether or not a specific ML technique is suitable to address the LAD problem in a given context. Other aspects to consider include training and prediction times as well as the sensitivity to hyperparameter tuning, which in practice matters to engineers. In this paper, we present a comprehensive empirical study, in which we evaluate a wide array of supervised and semi-supervised, traditional and deep ML techniques w.r.t. four evaluation criteria: detection accuracy, time performance, sensitivity of detection accuracy and time performance to hyperparameter tuning. Our goal is to provide much stronger and comprehensive evidence regarding the relative advantages and drawbacks of alternative techniques for LAD. The experimental results show that supervised traditional and deep ML techniques fare similarly in terms of their detection accuracy and prediction time on most of the benchmark datasets considered in our study. Moreover, overall, sensitivity analysis to hyperparameter tuning with respect to detection accuracy shows that supervised traditional ML techniques are less sensitive than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.
Analyzing and mitigating (with LLMs) the security misconfigurations of Helm charts from Artifact Hub
Minna F, Massacci F and Tuma K
Helm is a package manager that allows defining, installing, and upgrading applications with Kubernetes (K8s), a popular container orchestration platform. A Helm chart is a collection of files describing all dependencies, resources, and parameters required for deploying an application within a K8s cluster. This study aimed to mine and empirically evaluate the security of Helm charts, comparing the performance of existing tools in terms of misconfigurations reported by policies available by default, and measuring to what extent LLMs could be used for removing misconfigurations. For these reasons, we proposed a pipeline to mine Helm charts from Artifact Hub, a popular centralized repository, and analyze them using state-of-the-art open-source tools like Checkov and KICS. First, the pipeline runs several chart analyzers and identifies the common and unique misconfigurations reported by each tool. Secondly, it uses LLMs to suggest a mitigation for each misconfiguration. Finally, the LLM refactored chart previously generated is analyzed again by the same tools to see whether it satisfies the tool's policies. We also performed a manual analysis on a subset of charts to evaluate whether there are false positive misconfigurations from the tool's reporting and in the LLM refactoring. We found that (i) there is a significant difference between LLMs, (ii) providing a snippet of the YAML template as input might be insufficient compared to all resources, and (iii) even though LLMs can generate correct fixes, they may also delete other irrelevant configurations that break the application.
Test-based patch clustering for automatically-generated patches assessment
Martinez M, Kechagia M, Perera A, Petke J, Sarro F and Aleti A
Previous studies have shown that Automated Program Repair (apr) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by apr tools need to be validated by human programmers, which can be very costly, and prevents apr tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from xTestCluster gives developers a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java apr tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by apr tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, xTestCluster can complement other patch assessment techniques that help detect patch misclassifications.
Understanding security tactics in microservice APIs using annotated software architecture decomposition models - a controlled experiment
Genfer P, Serbout S, Simhandl G, Zdun U and Pautasso C
While microservice architectures have become a widespread option for designing distributed applications, designing secure microservice systems remains challenging. Although various security-related guidelines and practices exist, these systems' sheer size, complex communication structures, and polyglot tech stacks make it difficult to manually validate whether adequate security tactics are applied throughout their architecture. To address these challenges, we have devised a novel solution that involves the automatic generation of security-annotated software decomposition models and the utilization of security-based metrics to guide software architectures through the assessment of security tactics employed within microservice systems. To evaluate the effectiveness of our artifacts, we conducted a controlled experiment where we asked 60 students from two universities and ten experts from the industry to identify and assess the security features of two microservice reference systems. During the experiment, we tracked the correctness of their answers and the time they needed to solve the given tasks to measure how well they could understand the security tactics applied in the reference systems. Our results indicate that the supplemental material significantly improved the correctness of the participants' answers without requiring them to consult the documentation more. Most participants also stated in a self-assessment that their understanding of the security tactics used in the systems improved significantly because of the provided material, with the additional diagrams considered very helpful. In contrast, the perception of architectural metrics varied widely. We could also show that novice developers benefited most from the supplementary diagrams. In contrast, senior developers could rely on their experience to compensate for the lack of additional help. Contrary to our expectations, we found no significant correlation between the time spent solving the tasks and the overall correctness score achieved, meaning that participants who took more time to read the documentation did not automatically achieve better results. As far as we know, this empirical study is the first analysis that explores the influence of security annotations in component diagrams to guide software developers when assessing microservice system security.
On Refining the SZZ Algorithm with Bug Discussion Data
Rani P, Petrulio F and Bacchelli A
Researchers testing hypotheses related to factors leading to low-quality software often rely on historical data, specifically on details regarding when defects were introduced into a codebase of interest. The prevailing techniques to determine the introduction of defects revolve around variants of the SZZ algorithm. This algorithm leverages information on the lines modified during a bug-fixing commit and finds when these lines were last modified, thereby identifying bug-introducing commits.
Semantic matching in GUI test reuse
Khalili F, Mariani L, Mohebbi A, Pezzè M and Terragni V
Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across Android apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with Test Reuse approaches on migrating test cases across Android apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the Semantic Matching Algorithm. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall Test Reuse approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.
HyperPUT: generating synthetic faulty programs to challenge bug-finding tools
Felici R, Pozzi L and Furia CA
As research in automatically detecting bugs grows and produces new techniques, having suitable collections of programs with known bugs becomes crucial to reliably and meaningfully compare the effectiveness of these techniques. Most of the existing approaches rely on collecting manually curated real-world bugs, or synthetic bugs seeded into real-world programs. Using real-world programs entails that extending the existing benchmarks or creating new ones remains a complex time-consuming task. In this paper, we propose a complementary approach that automatically generates programs with seeded bugs. Our technique, called HyperPUT, builds C programs from a "seed" bug by incrementally applying program transformations (introducing programming constructs such as conditionals, loops, etc.) until a program of the desired size is generated. In our experimental evaluation, we demonstrate how HyperPUT can generate buggy programs that can challenge in different ways the capabilities of modern bug-finding tools, and some of whose characteristics are comparable to those of bugs in existing benchmarks. These results suggest that HyperPUT can be a useful tool to support further research in bug-finding techniques-in particular their empirical evaluation.
Reinforcement learning for online testing of autonomous driving systems: a replication and extension study
Giamattei L, Biagiola M, Pietrantuono R, Russo S and Tonella P
In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random search. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
Lightweight precise automatic extraction of exception preconditions in java methods
Marcilio D and Furia CA
When a method throws an exception- its -is a crucial element of the method's documentation that clients should know to properly use it. Unfortunately, exceptional behavior is often poorly documented, and sensitive to changes in a project's implementation details that can be onerous to keep synchronized with the documentation. We present wit, an automated technique that extracts the exception preconditions of Java methods and constructors. wit uses static analysis to analyze the paths in a method's implementation that lead to throwing an exception. wit's analysis is precise, in that it only reports exception preconditions that are correct and correspond to feasible exceptional behavior. It is also lightweight: it only needs the source code of the class (or classes) to be analyzed- without building or running the whole project. To this end, its design uses heuristics that give up some completeness (wit cannot infer all exception preconditions) in exchange for precision and ease of applicability. We ran wit on the JDK and 46 Java projects, where it discovered 30 487 exception preconditions in 24 461 methods, taking less than two seconds per analyzed public method on average. A manual analysis of a significant sample of these exception preconditions confirmed that wit is 100% precise, and demonstrated that it can document the exceptional behavior of Java methods.
Tools and benchmarks evolve: what is their impact on parameter tuning in SBSE experiments?
Golmohammadi A, Zhang M and Arcuri A
In this article, we explore the impact of tool development and its evolution in Search-Based Software Engineering (SBSE) research. As a research tool evolves throughout the years, experiments with novel techniques might require reevaluation of previous studies, especially regarding parameter tuning. These reevaluations also give the opportunity to address the threats to external validity of these previous studies by employing a larger selection of artifacts. To conduct the replicated experiments in this study, the search-based fuzzer EvoMaster is chosen. This SBSE tool has been developed and extended throughout several years (since 2016) and tens of scientific studies. Among the chosen tool's parameters, 6 were carefully selected based on 5 previous studies that we replicate in this article with the latest version of EvoMaster. The replication is applied across an expanded set of artifacts compared to the original replicated studies. Our objective is to validate the robustness and validity of previous findings and to determine the need for parameter tuning in response to the tool's continuous development. Beyond replication, we explored parameter tuning by testing 729 different configurations to find a more performant parameter set, which is later validated through additional rounds of experiments. Additionally, we analyzed the impact of individual parameters on test generation performance using machine learning models, providing insights into their relative effects. Our findings indicate that, although most parameters maintain their efficacy, 2 of them require adjustment. Furthermore, the investigation into the effects of combining different parameter values reveals that carefully optimized configurations can outperform default settings. These findings highlight the importance of regularly reevaluating parameter settings to enhance tool performance in SBSE research.