Facebook researchers recently published a paper based on Schwenk (2018a) which proposes an architecture for learning joint multilingual sentence representation in 93 languages using a single BiLSTM encoder and BPE vocabulary shared by all languages. There have been other researches in this area as well. Still, all of them have somewhat been limited in terms of performance primarily because they work on a separate model for each language and a cross connection between different languages is barred.

Facebook researchers are interested in the representation of sentence vectors common to both the input language and NLP tasks. The aim of this research is to help languages with limited resources, to achieve zero-shot migration of NLP models and to implement code conversion. What makes this research different from any other NLPs is that this research sets out to study the joint sentence representation in 93 different languages. In contrast, the common NLP focuses on two languages at the most.

Facebook Language Training Samples — *Figure 1: 75 out of 93 languages used to train the proposed model*

The study covers a huge number of 34 languages and 28 different writing systems. This herculean task is achieved through the use of zero-shot cross-language natural language inference (XNLI datasets), classification (MLDoc datasets), bitext mining (BUCC datasets), and multilingual similarity searches (Tatoeba datasets). The new data obtained from the research based on Tatoeba Corpus acts as the baseline results for 122 languages.

The architecture of the study works in an encoding-decoding manner. Once a sentence is embedded, it is linearly transformed to initialize the LSTM decoder. There is only one encoder and decoder in the system and the researchers have used a joint byte-pair encoding vocabulary which will make the encoder learn language independent representations. The encoder is limited to 1-5 layers and each layer of every dimension is limited to 512 dimensions. The decoder generates meaning using the language ID and has a 2048 dimensional layer.

Facebook Multi Language Sentence Embedding — *Figure 2: Architecture of our system to learn multilingual sentence embeddings.*

Moses statistical machine translation system is used for the pre-processing except for Chinese and Japanese texts for which Jieba and Mecab are used to split the texts respectively.

Facebook In Developing Single Encoder Based Technology to Translate 93 Languages

Meta is Launching “Our Feed Our Future,” a Youth Advisory Network

Meta Slapped With a Fine of $1.3 Billion Due to Eu-Us Data Transfers

Meta Announces Another Set of Layoffs Across Facebook, Whatsapp, and Instagram

Semrush

OptinMonster

WP Engine

Tidio

WhatsApp is working on the ‘Share your Status to Facebook Stor...

NERA Debug News Publishers Claim that Facebook is Benefiting Unfairl...

Meta is Stepping In With Samsung and LG Display to Provide Panels fo...

Facebook down? Users report eerie snags

Metaverse investments may not thrive until the 2030s says, Zuckerber...

Facebook In Developing Single Encoder Based Technology to Translate 93 Languages

.tdi_190{margin-bottom:10px!important}@media (max-width:767px){.tdi_190{margin-bottom:4px!important}}OptinMonster

.tdi_203{margin-bottom:10px!important}@media (max-width:767px){.tdi_203{margin-bottom:4px!important}}WP Engine

.tdi_216{margin-bottom:10px!important}@media (max-width:767px){.tdi_216{margin-bottom:4px!important}}Tidio

OptinMonster

WP Engine

Tidio