Page 41 - SMILESENG
P. 41
Intl. Summer School on Search- and Machine Learning-based Software Engineering
BLEU it All Away! Refocusing SE ML on the Homo Sapiens
Leonhard Applis TU Delft [email protected]
Abstract—Many tasks in machine learning for software en- gineering rely on prominent NLP metrics, such as the BLEU score. The metrics are under heavy criticism themselves within the NLP community, but the SE community adapted them for lack of better alternatives. Within this paper, we summarize some of the problems with common metrics at the examples of code and look for alternatives. We argue that our only hope is the worst of all possible options: Humans.
I. INTRODUCTION
In ancient Greece, Hephaistos was accompanied by servant automatons to help around his forge, freeing him to spent his time on true masterpieces. This Hellenic ideal of automation lives up to this day and has its renaissance with software engineers: Tedious tasks such as writing tests [1] or docu- mentation [2] are shifted towards automation to give room for the developers creativity. The narrative is great — the results are often humbling. Presentations for Githubs CoPilot [3] pick cherries, but thorough investigations usually lead to disturbing or amusing results. How did we end up here ?
One issue are the metrics. For this paper we focus on Docu- mentation Generation [2], which is lately often interpreted as a translation task from source code to human language (i.e. English) and draws a lot from NLP research, such as sequence- to-sequence models [4] but also the most common metric BLEU [5]. In recent work, Gehrmann et al. [6] criticized the metric driven approaches and publications in NLP (specifically generation tasks). Among their primary findings are that a people blindly use existing datasets without manual inspection, sampling, etc. b people rarely inspect output manually or involve end-users c all publications use BLEU for lack of better options or for acceptance at a venue. Gehrmann et al. proposition is as compelling as it is easy: Instead of using big data and arguably weak metrics, center the evaluation around a group of expert users.
The remainder of this paper first highlights some flaws with BLEU in documentation generation in Section II and elabo- rates on the proposed solution in Section III. While we cover only one domain briefly, we consider this to be a general critique applicable for most domains. We close in Section IV by arguing that we need to change the course of SE-ML- Metrics, and while the proposal might not be perfect, it is one we haven’t tried in a long time.
II. THE FLAWS
BLEU [5] is a metric to evaluate quality of translation and text-generation techniques. It compares the overlap of n-grams
in a produced text compared to one or more reference texts, where commonly a four-gram is used, as it correlates closest to human acceptance [7]. There is wide criticism on BLEU [8], [9], but we highlight issues specific to the domain of software engineering:
1 While BLEU takes n-grams into account, many pieces of programming language and documentation will produce a solid score despite sometimes contrary meaning. With com- mon tokenization, return ( a + b) == ( b - a ); and return ( a - b) == ( b + a ); scores near perfect in BLEU. Some publications opt for one-gram BLEU, for which the above example gives an optimal match.
2 Eghbali et al. [10] investigated the BLEU-Scores of randomly chosen samples from within different corpora. In a corpus of English literature there was a BLEU of ≈20%, comparing two random elements from Javas scores ≈40%. This is stunning insofar as these numbers form the expected baseline if we could produce random elements that follow the same distribution. In their initial publication, CodeBERT produces a BLEU-Score of 17.65% [2], which is 2.4% worse than drawing random elements.
3 Unlike natural language, programming languages (and their documentation) invent new words frequently. This is known as the open vocabulary problem [11] and is addressed in SE mostly by encodings. Prominent are BytePairEncod- ing [11] and Subword-Splitting [12]. Both increase the number of tokens - hitherto they benefit the BLEU score. It poses two primary issues: 3a it is harder to evaluate and compare the
metric if evermore strings become attached. 3b the research field itself becomes overburdened in experiment-complexity only for the sake of metrics.
III. THE OPTIONS
One approach to address the issues is to blame the metric; The BLEU is dead, long live the BLEU. One can easily stitch together ”MetaBLEU” that combines normal BLEU for language-representation and stopword-cleaned BLEU for content-coverage. Similar fixes for BLEU have been pro- posed [13], [10], [14], mostly ductaping the underlying prob- lems. These are not done in bad faith, on the contrary they fit perfectly in the current paradigm of ML publications — more data, more features and better tuned models can be used with the same benchmark and promise a safe academic voyage. But as a research field, we will hit dead ends by the need for ever more data, a cacophony of metrics and hungry computation.
29