Automatic Syllabification in European Languages: A Comparison of Data-driven Methods

Although automatic syllabification is an important component in several natural language tasks, little has been done to compare the results of data-driven methods on a wider set of languages. This thesis compares the results of four data-driven syllabification algorithms (IB1, the Look-up Procedure, Liang's algorithm, and Syllabification by Analogy) on nine European languages (Basque, Dutch, English, French, Frisian, German, Italian, Norwegian, and Spanish). Three questions are investigated: which algorithm performs best, which domain (spelling or pronunciation) is easier for automatic syllabification, and which languages are more straightforward to syllabify. Firstly, findings show that Syllabification by Analogy performs better than the other algorithms tested with a mean word accuracy of 96.84\%. Secondly, contrary to claims in the field, no significant difference was found between automatic syllabification performance in the two domains. Finally, the ranking of the languages in terms of syllabic complexity matches the results of previous work using alternate approaches.