The term “machine translation” can mean many things: free online tools, software plugin services or highly specialized translation engines. But what do these technologies actually entail? What progress has taken place in machine translation and why does it affect digital markets in Europe and worldwide? Katrin Marheinecke explains.
Humanity’s desire to overcome language barriers and effortlessly understand foreign languages is almost as old as humanity itself. While in ancient times, people held out hope for a miracle, from mid-19th century and right up until the Second World War, language experts (some of them self-proclaimed) tried to establish various artificial languages globally. At the time of the Cold War, military interests in the USA, Soviet Union and Japan put machine translation back on the agenda and provided the necessary financial funding. It was, however, a long time before viable achievements could be seen in the development of machine translation. It is only in the last 15 years that we’ve observed a true breakthrough in terms of machine translation quality. By now, machine translation (MT) has become a part of everyday life for the lay user as well.
Google Translate and free translation apps create an ease of use for non-academics—translation is nowadays available to everyone at the push of a button.
But how does machine translation actually work? And can it be used for any language or text type?
First of all, we need to differentiate between rule-based and statistical translation systems. The older of the two is based on grammatical rules, as the name suggests. The core of the translation system forms a set of rules for every language as well as a dictionary with detailed grammatical information on each entry. In addition, more general language independent rules can be specified. When translating, the engine analyzes the source text sentence and “deconstructs” it into its grammatical components. Then, the MT system “looks up” the words and phrases in the dictionary and combines them with the rules stored for the target language. From these elements the target language translation is pieced together. So far, so good. However, the rule-based machine translation can only deal with phenomena for which it has rules in place. Languages do not always conform to their own rules though, and these translations often sound too literal, if not altogether wrong. Aside from that, preparing the translation engines with rules and dictionaries is extremely tedious and time-consuming. One advantage of the rule-based method is that the translation results are mostly grammatically well-formed and require less follow-up corrections (post-editing).
The second method is based on statistics. It needs huge amounts of high-quality data. To prepare the engine adequately the MT experts need millions of translated words in both source and target language. The machine is trained to recognize generic corresponding groups of words. This way, it learns the order that words usually appear in. A general rule of thumb for statistical machine translation is: The larger the amount of quality training data and the higher the similarity between reference and translation, the better the translation results. The translation is then carried out on the basis of statistical frequency of the corresponding occurrence of words, word groups and sentences.
The key benefit of this method is that texts can be translated without theoretical knowledge of the languages involved. Moreover, the results are often very similar to natural language. The main disadvantage is the need for huge amounts of bilingual data. This is the reason why Google or Microsoft are ahead of the competition: They have gigantic, multilingual volumes of data available which can be easily prepared for machine translation. The results for certain language pairs are, therefore, often surprisingly good, especially for combinations with very large bilingual text corpora. Problems arise with language pairs for which the number of translated texts is limited— as for some Slavic or the Baltic languages. What’s more is that these languages are among the so-called inflectional languages—they change according to case, number and grammatical gender. Traditionally, statistic MT has problems to handle this. A further disadvantage of the statistical approach is that it is more difficult to interfere with a system that has been trained and to change the results.
This is exactly where the research project QT21 comes in. This project, financed by the EU and supervised by the German Research Center for Artificial Intelligence (DFKI), aims to develop statistically-based translation models for the “difficult” language groups as well as to build technology for automatic pre-processing and post-processing of data sets. By means of systematic error-analyses and evaluation methods, the availability of data and the scientific knowledge on these languages will be sustainably improved. The consortium of this project is composed of renowned research institutes, university departments and language service providers.
text&form takes care of the evaluation and quality assessment of the translation results in this project. Language experts examine thousands of machine-translated sentences, improve them, if necessary, and classify the errors according to linguistic criteria. These analyses, it is hoped, will give scientists information on where MT systems work well and where optimization is needed. Typical problems encountered with statistic machine translation can be zeroed in on. A further goal of this extensive post-processing is the creation of a systematic evaluated text corpus that can be used outside the boundaries of this project as well for further research and development purposes.
But why is the EU committed to this research field in the first place? A barrier-free Digital Single Market in Europe is one of the 10 priorities devised by the EU. One of the main obstacles to equal Internet trade is still the language barrier. Markets such as Great Britain and France are at an obvious linguistic advantage. The key to more equal opportunities in the digital field could lie in machine translation—if its performance is scientifically supported and if it is readily available. The availability of accessible and affordable translation solutions for all European language pairs can not only open up new markets but also increase business volume on the Internet. It will grant access to key information for all Europeans — not only for speakers of the most widespread languages. Projects like QT21 make an important contribution to the harmonization of the Digital Single Market and to a balanced participation in e-commerce and to usage of information services.
As Quality Manager at text&form, Katrin Marheinecke focuses on enhancing the quality of translations and continuously improving translation processes. For her, observing the latest trends in the translation landscape is both business and pleasure.
Need more information?