Interpreting Hinglish Conversations

Sayan Biswas
inspiringbrilliance
10 min readOct 12, 2020

--

Photo by Kelly Sikkema on Unsplash

Recently, we came across an interesting use case for interpreting conversational data in Hinglish. We were given sample chat transcripts between two transacting parties and we had to extract some insights from these conversations, e.g. what are the items of discussion, delivery address, price of the items and so on. Now, if the language was pure English or pure Hindi, then many of these are actually solved problems, but for a mixed language like Hinglish, things are not so trivial.

So, what is Hinglish? Hinglish is a common tongue found in casual conversations where a combination of Hindi and English phrases are used together in the same context. In NLP parlance, it’s called code-mixing. An example would be: jaldi karo guys, or we’ll be late for the movie. It means: let’s hurry up guys, or we’ll be late for the movie. The first clause (jaldi karo guys) is in Hindi, while the rest is in English. Code mixing is very common in multilingual societies, especially in India. This is also steadily becoming a norm thanks to social media. Unfortunately, handling code-mixed languages is usually harder than handling a pure language, because the former is a melting pot of vocabularies, sentence structures, grammatical rules etc. borrowed from multiple languages. Also, because of the more casual nature of the conversations, correct spellings are not always the highest priority in code-mixed languages like Hinglish, and that require special handling like spelling normalization when building a language model on top of it.

So why not just call google translation apis from our application and be done with it? Well, in this case it was not possible to do so because some organizations are more conservative about calling certain 3rd party services. This is even more critical for secure applications as those would need to be exposed to the internet. Also, google translation is not always accurate on a code mixed corpus. Small spelling variations can easily throw it off balance as per our experience. So, that’s not a foolproof solution.

Okay, so now that we have some idea about what we’re up to, let me guide you through the journey we had, show you where we failed, and what we learned from those mistakes.

Approach 1: Translate Hinglish to Hindi

Almost all the core problems that needed solving could be broken down into sub-problems such as classification, Named Entity Recognition (NER), Co-reference Resolution and so on. But the catch was, all of these models were trained in English, and very few were available for Indic Languages like Hindi. Naturally, our first intuition was to convert the Hinglish sentences to Hindi, then translate the Hindi to English and perform the rest of the analysis on English itself. If we could pull off the transliteration of Hinglish to Devanagari script, then the rest of it was comparatively straightforward. We could leverage some of the existing Hindi to English Neural Machine Translation (NMT) modules. The logical flow would’ve looked something like the following.

One such NMT library that we tried out was MarianNMT, developed by the Microsoft translation team. We even saw amazing results when we fine tuned the model on our Hindi to English parallel dataset. But unfortunately the Hinglish to Hindi transliteration phase gave us quite a lot of trouble, and eventually we had to take a different route, more on that later.

The major challenge that we faced, was to catch the spelling nuances in Hinglish. A single vowel difference in a Hindi word might mean totally different things. E.g. The two words kahane and kahana are quite close in their meaning, but they can be used in totally different contexts (kahane ke lie kuch baki nahin hai vs. kahana kya chahte ho?). To make things even more complicated a single word can be written in multiple ways. E.g. The word janamdin means birthday. People might write it as jamndin, janmadin etc. Handling such variations becomes tricky even with good spell checkers such as Hunspell.

Approach 2: Treat Hinglish as a standalone language

When the first approach didn’t work so well, we decided on taking an alternate route, i.e. to treat Hinglish as a standalone language. With that, we could train a transformer based NMT model with Hinglish as the input and English as the output, and completely bypass the conversion from Hinglish to Hindi. Again, the flow would look something like the following.

Although this approach seems to have less overhead than the one before, the real problem is the lack of availability of Hinglish-English parallel corpus in the public domain. Even after spending considerable amount of time searching through the internet, we could barely find any sizable clean corpus. So to bypass that we had to take hold of some Hindi-English parallel corpus which were fortunately less scarce in the public domain. We ran a python code snippet to call google translate api on the Hindi sentences in Devanagari script and extracted their romanized versions.

We gathered about 1.5 lacs parallel sentences in this process. But now, a different issue came to our notice. We found out that many of the romanized Hindi words contained redundant vowels. E.g. the word usako (him/her) contains an ‘a’ which is redundant. Similarly, words like usane (usne), jabaki (jabki), badalate (badalte) contain extra vowels when transliterated from the Devanagari script. In the actual conversation people won’t be using these extra vowels. Also, proper nouns like names of cities, countries etc. which don’t always have a Hindi version get phonetically transliterated with incorrect spelling. E.g. Germany becomes jarmanee, Spain becomes spen, police station becomes pulis steshan and so on. All these mistakes need correction, and it’s a time consuming process. So, we’re in the process of crowd-sourcing it within our organization, and hopefully we’ll get a clean dataset soon enough. We’ll most likely be making the dataset publicly available through Kaggle, so that anybody can use it to train their own NMT models, or for any other relevant NLP task.

What we’ve got so far

While the cleaned dataset is on our way, we managed to perform some experiments to validate our hypothesis on the messy vanilla dataset. We took a subset of size 100k from the full Hinglish-English parallel corpus, and ran an attention based NMT model using Tensorflow. The original ipython notebook can be found in the Tensorflow github page. We made some minor changes to this notebook to make Hinglish as our input language instead of Spanish. It took around 3.5 hours on a p3.2xlarge node on AWS to run the model with 50 epochs. The results are actually quite far from satisfactory, but at least it shows that we’re moving in the right direction. Let’s look at some of the predictions made by our model.

Input: <start> fon bohot baar baja . <end>
Predicted translation: phone times . <end>
Input: <start> vo mera intazaar kar rahee hai <end>
Predicted translation: she s going to go to my life . <end>
Input: <start> yahaan par ek hotal hai <end>
Predicted translation: there is a hotel . <end>
Input: <start> mujhe kaam karna hoga <end>
Predicted translation: i care for work <end>
Input: <start> pulis ne us ko pakad liya . <end>
Predicted translation: the police agreed . <end>
Input: <start> ye film sahee hai ! <end>
Predicted translation: this is easy to the movie ! <end>

Usually an NMT model needs to get trained on millions of parallel sentences to achieve human level performance, so this was kind of expected. On the bright side, we think the base model will perform much better once we get the cleaned up data. We had trained the same model with 25k, 50k data as well, and what we saw was reassuring. We found the translation quality improved quite a bit with increased variations and volume of the input sentences.

Data augmentation

Once we get the cleaned up data, it’ll have to be augmented to incorporate more variations in the sentence structures. This increase in the training data volume should lead to better synthetic text generation. One possible way to achieve this is to make use of GPT-2, which is a natural language generation model. So we actually went ahead and tried that on a Google colab notebook with a Tesla T4 gpu. Huggingface has a GPT-2 implementation already in place, that we made use of. We selected 75k Hinglish sentences, with each one ranging between 3 to 20 words. It took us around 10 minutes to train the model for 3 epochs. Below are some of the model generated examples, which were not part of the training dataset:

Seed: usake jaise hee
Generated text: usake jaise hee nazar hain.
Seed: kal yahaan
Generated text: kal yahaan banaane ke lie kahate hain.
Seed: aapko kaise
Generated text: aapko kaise jaana jaata hai?
Seed: mujhe bhookh
Generated text: mujhe bhookh ho gaya.

While the generated texts are not perfect, they’re not so bad either. In the context of casual conversation, I think they’re quite passable. Hopefully, this model would also get better with increased volume and variation in the training data.

Spelling Normalization

We had briefly touched upon spelling normalization in the beginning of this article. Let’s dig a little deeper on that. In the context of this problem, we’re more interested in finding phonetic similarity than word distance based similarity. E.g. if an incoming word is spelled as apake, we need to change it to apke. Similarly steshan needs to change to station and so on. To solve this problem, we used the pyphonetics python library. It has quite a few state of the art algorithms like soundex, metaphone, matching rating approach etc. that are useful in detecting phonetic similarity.

The algorithm that we designed is pretty simple. First, we created a dictionary using the words in our training corpus. Keys in the dictionary represent phonetic codes, and the values are sets of words belonging to the same phonetic code. A sample dictionary using FuzzySoundex algorithm would be something like the following:

'Z537': {'zindagee'},
'J4': {'jal', 'jel', 'jheel', 'jhela'},
'C33': {'chadhate', 'chhodadee', 'chhootata', 'chhootate'},
'S395': {'sadasyon', 'steshan'},
'E7396':{'ek-doosare'},
'U55': {'unhen', 'unhonne'},
'P695': {'parason', 'pareshaan', 'pareshaanee', 'prasann', 'prashn'},
'B719': {'bakavaas'},
'K37': {'kaatoge', 'khidakee'},
'L56': {'lomri'}

It’s easy to see that words in the same key sound quite similar to each other compared to the words in a different key. But it’s not the ideal world, so even in the same key there are differences in pronunciation. For the most part we’ve found that an incorrect word and its normal form usually end with the same letter.

So, for a given word that needs correction, we first find its phonetic code. Then we pull out all the words in the dictionary that have the same phonetic code and end with the same letter. Out of these potential candidates, we keep the one with the least Levenshtein distance. With this approach, it maps the word steshan to station, apako to apko and so on. This algorithm will probably need more fine tuning to handle more complex cases, but as a baseline model it works quite well. One thing to note here is that the sample dictionary you saw above was built on the vanilla dataset as a PoC. So, there are quite a few ‘bad’ words in the dictionary that won’t be present in the cleaned dataset.

Handling clauses in a single language

In regular conversations, we’ve seen that many a time people use a single language in a clause, and then switch language in the following one. E.g. Let’s take the sentence: Sky is so cloudy, laagta hai baarish hogee. These types of sentences are quite common in our daily conversations, and it seems like a simpler variant of the more generic Hinglish to English translation. If we can successfully figure out the boundary word where one language ends and the other begins, then the problem reduces to translating only the Hindi clause and then joining both the clauses in the end. We tried out this approach on some example sentences and the results were quite encouraging.

There’s a python library called textblob that returns the predicted language if we input a phrase. Internally it uses Google apis, but let’s set that aside for now. We’ve found it making mistakes once in a while in detecting the correct language if we provide a single word. But that seems quite fair because without having any context sometimes it’s hard to be 100% accurate. E.g. The word chale can be either Hindi or Bengali, so having only this word as input it’s not possible to predict accurately. What we ended up doing was to take a sentence, feed the words to the module one by one, and also as trigrams. We took both the results, and gave preferences to the outcomes obtained from the trigrams. This turned out to be quite accurate in detecting the language boundaries. Let’s look at the following example:

Input: guys, chale? or we’ll be late. zyaada time nahin hain.Tokens: [‘en’, ‘punc’, ‘hi’, ‘punc’, ‘en’, ‘en’, ‘en’, ‘en’, ‘punc’, ‘hi’, ‘hi’, ‘hi’, ‘hi’, ‘punc’]Boundaries: [(2, 2), (9, 12)]Translation: guys , let’s go ? or we’ll be late . not much time .

Again, the translation is not perfect, but it’s good enough to get the motif across.

Conclusion and Future Work

We’ve covered quite a lot of ground in this article. Almost everything we’ve discussed so far is a work in progress. So things will break, and hopefully improve over time as we gain more experience. I’ll update this article with a link to the cleaned training data once we are done with the preparation and are reasonably satisfied with our model’s performance on it. Thanks for staying till the end.

References

We’ve made use of the following datasets for building our Hinglish-English parallel corpus:

--

--

Sayan Biswas
inspiringbrilliance

Senior Data Scientist at Walmart Global Tech (India)