Machine Translation as Applicable to Russian Companies

April 18, 2015 admin 0 Comments

Before discussing any specific MT tools, I would like to briefly describe the current state of the MT technology and its general applicability to technical translation services.

Current State of MT Technology

Due to advances in high-speed processing of considerable data volumes, including big data statistical analysis, machine translation is finally moving from the domain of linguistic research where it has dwelled since early days of digital computers to mainstream translation business.

Furthermore, MT popularity has grown due to universal implementation of CAT (computer-aided translation) tools based on TM (translation memories) such as DejaVu, MemoQ, SDL Trados, etc., as they have MT function integrated on the “pre-translate” level (with connection to a web-based engine).

Almost all actively supported MT engines are now available online, many of them for free (at least, for trial purposes).

Statistical and Rule-Based MT Engines

In general, there are two types of MT engines – statistical (SMT) and rule-based (RBMT).

The rule-based approach to MT, also known as “classical”, is based on processing linguistic information about source and target languages retrieved from dictionaries and grammars. The principal assumption here is that translation can be condensed to a set of rules applied in morphological, syntactic and semantic analysis of both the source and the target languages involved in a concrete translation task.

The most well-known examples that support the Russian language are Systran and Prompt, the latter being developed specifically by a Russian company with over 15 years of experience in the industry and well-suited for dealing with translations from and into Russian (see online-translator.com).

Rule-based systems provide full control over translations as the rules are programmed manually and may be adjusted for debugging purposes. They offer good reusability, are domain independent and flexible in use.

However, they have multiple shortcomings. Building new dictionaries and rule sets is extremely labor-intensive and, hence, expensive. Any natural language is intrinsically ambiguous, so addressing idiomatic expressions, contextual meanings of words and phrases, as well as such stylistic devices as metaphors or irony is problematic.

In actuality, due to advances in statistical MT (see below) all current RBMT engines are hybrid SMT / RBMT.

In statistical MT engines, translations are generated based on statistical models built by text corpora analysis. All best-known systems in use today such as Google Translate and Microsoft Translator (part of the Bing service) are statistical.

Such systems have multiple advantages over RBMTs – they are much easier to train as they do not require rule creation – rules are generated automatically based on the statistic model and available input (parallel text corpora). The system is not tailored to a specific language – any language can be translated into any other language provided that there are parallel texts available. Translations are less literal and more natural.

However, translated output requires considerable post-editing as statistical matching can often result in nonsensical texts and obvious errors, especially for inflection languages like Russian. Besides, translations between languages with different word orders may be a problem.

Non-disclosure Considerations

In the enterprise setting, there is another important issue to consider. Statistical translation systems often use cloud services that, in theory, make translated texts available to a third party (engine supplier, such as Google or Microsoft) for data mining, which may potentially be undesirable in case of sensitive information handling and/or an NDA with a client.

To avoid this situation, it is possible to purchase an SMT engine and train it using company texts in a contained environment or to integrate such SMT engine in proprietary software (most technologies of statistical translation are open-source and in public domain, so such proprietary system can be implemented by an expert in natural language processing, as it was done in eBay.com, among others).

Another option is to triage texts for translation and select ones that have to be translated manually, with the remainder being machine-translated and post-edited.

Implementation Considerations

Based on the above, two general approaches can be used – short term and medium/long term, depending on sensitivity of translated documents and availability of natural language processing talent.

First of all, if technical translation is most pertinent to the nature of a company, this type of translation is one of the best suited for MT due to strict adherence to terminology, avoidance of stylistic devices and ambiguity. Hence, MT will greatly increase productivity of translators who can post-edit much larger volumes of text than translate it, subject to their willingness to master this new approach to translation and availability of necessary tools.

In the short term, it is possible to implement a translation memory tool with access to web-based machine translation, such as MemoQ or Trados. These tools provide a repository of translated and automatically re-useable texts and can access web services (APIs – application programming interfaces) to pre-translate texts using either Microsoft Translator (Bing) or Google Translate.

NB: due to availability of open source statistical engines, there are dozens of SMT services available now that may be tested for their applicability to a company’s texts. However, it is recommended to look at the two above-mentioned industry leaders first as they possess the most well-developed text corpora which greatly improves results of statistical machine translation.

However, translators have to be aware that highly sensitive documents may not be machine-translated via web-services and have to be handled manually but in the CAT tool environment.

As an alternative solution, PROMT offline translation suite may be purchased (an enterprise LAN-based version available with client-server architecture), which is an example of rule-based and statistical translation approach best adapted for translation into Russian. This system also offers functions of a translation memory tool.

NB: Due to inflectional nature of the Russian language, some experts consider rule-based tools like Systran and Promt to be better suited niche solutions, especially if they are combined with statistical models in a hybrid model, which is the case with Promt.

In the medium to long term, an open-source statistical engine such as Moses SMT may be implemented and trained with a company text corpus. Ultimately, this solution may not only provide best quality translations but become a basis for a proprietary MT and TM tool to be purchased from the by other companies working in the same field.

Conclusion

Implementation of machine translation tools is recommended to companies with a specific nature of translations (technical) and current availability and advances in machine translation technology. Statistical and/or hybrid MT tools, preferably integrated into translation memory tools (MemoQ, Trados), are to be considered and tested, with the most popular web-based tools (Google Translate and Microsoft Translator) being the first choice for the MT part of the solution, subject to sensitivity of translated information. At the same time Promt Translation Suite may be considered as a viable alternative solution that can work in the offline mode.

In the medium to long term, Moses SMT or any other open-source statistical engine may be used as a basis for a proprietary machine translation tool to be deployed by a company and possibly supplied to other enterprises operating in the same or similar industries.