Whether you’re logging on from the US, Brazil, Borneo, or France, Facebook can translate virtually any written content published on its platform into the local language using automated machine translation. In fact, Facebook provides around 20 billion translations everyday for its News Feed alone. However these systems typically use English as an intermediary step that is, translating from Chinese to French actually goes Chinese to English to French. This is done because data sets of translations to and from English are massive and widely available but putting English in the middle reduces the overall translation accuracy while making the entire process more complex and cumbersome than it needs to be. That’s why Facebook AI has developed a new MT model that can bidirectionally translate directly between two languages (Chinese to French and French to Chinese) without ever using English as a crutch and which outperforms the English-centric model by 10 points on BLEU metrics.
“The major challenge is really, how do we take the translation systems we have, and then actually meet the demand of people around the world, Angela Fan, a research associate at Facebook AI, told Engadget. “So you are translating into all of the languages and across all of the directions that people actually want. For example, there’s plenty of regions in the world where people speak multiple languages, none of which are English, but the existing translation systems rely heavily on English-only data.” Of the billions of posts published daily in 160 languages on Facebook’s platform, two-thirds are in a language other than English, she noted.
Dubbed M2M-100, Facebook claims that it is the first multilingual machine translation model (MMT) that can directly translate back and forth between any pair out of a set of 100 languages. In all, FBAI has constructed an enormous data set consisting of 7.5 billion sentences for 100 languages. Using that, the research team trained a universal translation model with more than 15 billion parameters “that captures information from related languages and reflects a more diverse script of languages and morphology,” according to a Facebook blog post Monday.
To do this, Facebook had to collect a whole slew of publicly available data from around the world using a variety of novel techniques. “A lot of this is really building upon work that we’ve done for many years at research at Facebook, which are like all of the different Lego pieces that we kind of put together to build the system today,” Fan explained.
To start, the team employed CommonCrawl, which maintains an open repository of web crawl data, to collect text examples from around the web. Then they set about identifying the language that text is in using FastText, a text classification system Facebook developed and open sourced a few years back, “It basically looks at some tests and it tries to decide what language it’s written in,” Fan said. “So we partition a bunch of texts from the web into all of these different languages and then our goal is to identify sentences that would be translation.”
“Traditionally, people use human translators to create translation data,” she continued. “This is difficult at scale because it’s hard, for example, to find someone who speaks English and Tamil, but it’s even harder to find someone who speaks French and Tamil together, because non-English translation is still an area that needs improvement.”
To mine that necessary data at scale, Fan’s team relied heavily on the LASER system. “It reads sentences, takes the text and creates a mathematical representation of that text, such that sentences that have the same meaning map to the same thought,” she said. “So if I have one sentence in Chinese and French, and they’re saying the same thing, they will kind of overlap like a Venn diagram the overlapping area is the kind of text that we think are aligned sentences.”
Of course, not all languages have a large amount of written content available on the internet. In those situations, Fan’s team turned to monolingual data, which is just data written in a single language. Using the Chinese to French example, Fan explained “So if my goal is to translate from Chinese to French, but for some reason, I don’t get good quality, then I’m going to try and improve this by taking texts monolingual data in French. And what I do is train a reverse of the system: I go from French to Chinese. I take all of my French, for example, from Wikipedia, and I translate it into Chinese.”
Doing so produces a slew of machine generated “synthetic” data, Fan continued. “So I’ve created this synthetic Chinese based on my back-translated French, then I’m going to add it again to the forward model. So instead of going from Chinese to French, I have Chinese plus my supplemented synthetic Chinese, all going into French. And because this adds a bunch of new examples on both the input side and the output side the model will be much stronger.”
Whether this will lead to a digital Babel Fish capable of losslessly translating between the world’s 6,200-odd spoken languages remains to be seen. Fan notes that the ultimate success of this project depends on the amount of resources the AI can leverage. For major languages like French, Chinese, German, Spanish, and Hindi, those resources are vast. “People write tons of text on the web in these languages,” Fan noted. “They were really able to help a lot of data, and our models can use this data to get better.”
“I personally identify a lot of areas that we might need improvement in for the very low resource languages,” she continued. “For African languages, we’re pretty good at Swahili and Afrikaans, we could use a lot of improvement on languages like Zulu, and these languages have additional research challenges that we need to confront.”
Facebook is releasing the data set, model, training and evaluation setups as open source to the research community to help spur on further advancements. The company also plans to continue developing the system independently and eventually working the technology into its daily operations.