Do you have trouble remembering if jälkeen is Swedish, or Finnish in the blink of an eye? As a small town kid who’s only ever been exposed to English, and French do you feel anxious when your friends ask you if بائسکل is written in Urdu?

What if I told you that you could solve all of these problems in like, an hour, tops?

Well, believe it or not, you can do all of this and more using a statistical language model (SLM)!

An SLM forms a probability distribution over a set of variable length sequences and can be used to characterise different sequences based on the different traits that they exhibit.

For the purpose of language identification, we’re concerned primarily with the distribution of characters within a particular sequence. Since different languages tend to use different characters more frequently than others, we can use this heuristic to differentiate between different languages.

Sure, you could use a dictionary-based approach when performing language identification, but, that wouldn’t scale very well. More importantly, what would happen if you encountered a newly-coined word that wasn’t found in the dictionary? Would you just skip the word? What if you encountered a portmanteau, or another lexical hodgepodge?

If you were to completely ignore newly-coined words, and portmanteaus, your language identification model wouldn’t scale very well. It would treat the jargon, and neologisms adopted by different generations of people as artifacts of entirely different languages.

That’s where the statistical language model (SLM) comes in.

Rather than cross-referencing lists of every word in every language, you can simply make note of how words in each language tend to be structured. That is, how sequences of characters are sliced, diced, and stitched together to make words.

That being said, if you had access to several of Aesop’s Fables in English, French, and Spanish, you could probabilistically differentiate between input vectors in each language with increasing certainty as you read each, and every one of Aesop’s Fables to your computer.

Before we dive off of the deep end, let’s talk a little bit about n-grams. An n-gram is a contiguous, fixed-length slice of items collected from a sequence of text, or speech. These slices may exist as letters, words, phonemes, or syllables and can be used to probabilistically differentiate between different languages if they’re used the right way.

For example, given the string bonne soirée and we have the following n-grams:

  • : ['b', 'o', 'n', 'n', 'e', 's', 'o', 'i', 'r', 'é', 'e']
  • : ['bon', 'onn', 'nne', 'oir', 'iré', 'rée']
  • : ['bonne', 'soiré', 'oirée']

While the traits of different languages may be difficult to spot with unigrams (i.e. -grams), the different traits of different languages will be readily apparent in sets of bigrams, trigrams, and quadgrams. That is, when looking at bigger building blocks of each language.

As an example, let’s consider how one could distinguish between English, French, Spanish, German, Swedish, and Finnish using the Leipzig Corpora Collection from the University of Leipzig. This collection consists of multilingual corpora compiled from various sources across the Internet and spans dozens of languages.

While we could follow our initial hypothesis, and construct a naïve classifier using nothing more than Aesop’s Fables, we might as well use a bigger, and better dataset in order to improve the accuracy of our classifier.

For the purpose of distinguishing between these six languages, we’ll be using datasets consisting of approximately ten thousand sentences in six different languages.

$ tree ~/src/tylerjfisher.github.io//_data/
/home/tyler/src/tylerjfisher.github.io//_data/
├── english
│   ├── eng_news_2006_10K-sentences.txt
...
├── german
│   ├── deu_news_2010_10K-sentences.txt
...
├── finnish
│   ├── fin_newscrawl_2011_10K-sentences.txt
...
├── french
│   ├── fra_news_2005-2008_10K-sentences.txt
...
├── spanish
│   ├── spa_news_2010_10K-sentences.txt
...
└── swedish
  ├── swe_news_2007_10K-sentences.txt
...

Each dataset exists as a single tab-delimeted flatfile and consists of sentences aggregated from news sources such as The New York Times, The Huffington Post, and Aftonbladet. These datasets are far from pristine, but, nonetheless can be used for statistical natural language processing after fiddling with a few characters here, and there.

$ head english/*sentences*
1 !,
2 ❤
3 ⬆️⬆️⬇️⬇️⬅️➡️⬅️➡️
4 0.0000 0.00% Asian shares rose on Wednesday, taking early cues from overnight Wall Street gains, while investors' sharper risk appetite lifted U.S. debt yields and supported the dollar.
5 0.0000 0.00% But instead a wide-range of investors collectively spent about $24 billion over the past 18 months trying–and failing–to call a bottom in oil.
6 0.0000 0.00% For sure, a solid majority of the 8.3 million illegal immigrants continue to work in low-skilled service, construction and production jobs.
7 0.0000 0.00% Since Lehman Brothers went bankrupt in September 2008, the world’s central banks have injected more than $12 trillion under QE (Quantitative Easing) programs into financial markets.
...
2789428 サミット開催報告/デルが今ここまで注力する理由とは The Japanese edition of 'CNET' is published under license from CBS Interactive, Inc., San Francisco, CA, USA.
2789429 開発者がオンラインサービスでやりがちなセキュリティ上の3つの失敗 アップル、12.9インチ「iPad Pro」を発表--企業向けの性能をアピール ストレージ「HP 3PAR StoreServ」に新製品--遅延を0.5ミリ秒まで設定 2025年から2045年の世界と情報システム部門の役割 The Japanese edition of 'CNET' is published under license from CBS Interactive, Inc., San Francisco, CA, USA.
2789430 雪人在阿拉伯受到冷眼對待 A 2m tall wooden snowman that appears every year at Christmasland in New Taipei City sits at the venue on Nov. 11 last year.

We can translate each of these flatfiles into a collection of n-grams, and subsequently into a series of statistical language models as follows with the help of hodgepodge, a little helper library that I’ve written. As a first step, we can construct a naïve linear classifier as follows:

import hodgepodge.common.files
import hodgepodge.common.iterables

import itertools
import concurrent.futures
import collections
import math


class LanguageClassifier:
  def __init__(self, n):
    self.n = n
    self.models = collections.defaultdict(lambda: collections.defaultdict(int))

  def train(self, language, data, ignore):
    translator = str.maketrans({key: None for key in ignore})

    with concurrent.futures.ThreadPoolExecutor(max_workers=32) as e:
      ngrams = itertools.chain(*map(
        lambda chunk: hodgepodge.common.iterables.ngrams(
          chunk,
          self.n,
          translator
        ), hodgepodge.common.files.read(data)
      ))
      for ngram, frequency in collections.Counter(ngrams).items():
        self.models[language][ngram] += frequency
      else:
        for ngram, frequency in self.models[language].items():
          self.models[language][ngram] = math.log10(
            frequency / len(self.models[language].values())
          )

  def score(self, fdist, data):
    floor = math.log10(0.01 / len(fdist.values()))
    return sum(map(
      lambda ngram: fdist.get(ngram, floor), hodgepodge.common.iterables.ngrams(data, self.n)
    ))

The purpose of this linear classifier will be to teach our model what each language looks like using a series of labeled training sets. That is, a series of labeled datasets which clearly exhibit a feature, or set of features that we would like to be able to recognise.

In our case, we could like to be able to differentiate between different languages without knowing anything but how different rules in that language are applied (i.e. orthographic rules such as I before E except after C in English).

In the example above, the train function builds the training set to be used by the linear classifier. Similarly, the score function translates an input vector into a numerical measure of how well the training set models the input vector.

As you can see, the train function simply builds a collection of n-grams from the training data, and associates a relative frequency with each n-gram. That is, a measure of how frequently the n-gram appears in the training set. As an optimisation, this relative frequency is represented in log probability space.

Similarly, score is a fitness function which translates an input vector into a series of n-grams, and subsequently, into a numerical representation of how well the training set models the input vector. If the input vector exhibits the same traits as the training data, the score will be higher. If not, the score will be lower. This score measures the fitness of the training data.

In order to differentiate between multiple languages, we can compare the fitness associated with a particular input vector under each language model. That is, we could use the following function to say that بائسکل is likely Urdu, and that jälkeen is likely Swedish:

def guess(self, data):
  best_language, best_score = None, -99e9
  for language, model in self.models.items():
    score = self.score(fdist=self.models[language], data=data)
    if score > best_score:
      best_language = language
      best_score = score
  return best_language

As an example, let’s use this linear classifier to determine which languages the following pick-up lines were written in:

  • You’re like a candy bar: half sweet and half nuts.

  • Je me suis perdu dans tes yeux.

  • Si el agua fuese belleza, tú serías el océano entero.

  • Du är jävligt vacker!

  • Silmäsi ovat kuin tähdet, yhtä kaukana toisistaan

  • Entschuldigung, aber auf welchen Anmachspruch würdest du denn am positivsten reagieren?

import hodgepodge.classifiers.language
import string


if __name__ == "__main__":
  ignore = string.digits + string.punctuation + string.whitespace

  classifier = hodgepodge.classifiers.language.LanguageClassifier(n=3)
  for language, training_dataset in (
      ('english', 'tests/data/english/eng_news_2006_10K-sentences.txt'),
      ('swedish', 'tests/data/swedish/swe_news_2007_10K-sentences.txt'),
      ('finnish', 'tests/data/finnish/fin_newscrawl_2011_10K-sentences.txt'),
      ('spanish', 'tests/data/spanish/spa_news_2010_10K-sentences.txt'),
      ('german', 'tests/data/german/deu_news_2010_10K-sentences.txt'),
      ('french', 'tests/data/french/fra_news_2005-2008_10K-sentences.txt'),
  ):
    classifier.train(language, data, ignore)

  input_vectors = (
    ('english', 'You’re like a candy bar: half sweet and half nuts.'),
    ('french', 'Je me suis perdu dans tes yeux'),
    ('spanish', 'Si el agua fuese belleza, tú serías el océano entero.'),
    ('swedish', 'Du är jävligt vacker'),
    ('finnish', 'Silmäsi ovat kuin tähdet, yhtä kaukana toisistaan'),
    ('german', 'Entschuldigung, aber auf welchen Anmachspruch würdest du denn am positivsten reagieren?'),
  )
  for language, vector in input_vectors:
    guess = classifier.guess(vector)
    print("{} ({}) - guess: {}".format(language, vector, guess))
$ time python3 ngrams.py
INFO:hodgepodge.classifiers.language:Training english language model on /home/tyler/src/tylerjfisher.github.io/_data/english/eng_news_2006_10K-sentences.txt with n=3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| ...
INFO:hodgepodge.classifiers.language:Training french language model on /home/tyler/src/tylerjfisher.github.io/_data/french/fra_news_2005-2008_10K-sentences.txt with n=3
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1269093/1269093 [00:00<00:00, 378912432.11it/s]
...
english (You’re like a candy bar: half sweet and half nuts.) - guess: english
french (Je me suis perdu dans tes yeux) - guess: french
spanish (Si el agua fuese belleza, tú serías el océano entero.) - guess: spanish
swedish (Du är jävligt vacker) - guess: swedish
finnish (Silmäsi ovat kuin tähdet, yhtä kaukana toisistaan) - guess: finnish
german (Entschuldigung, aber auf welchen Anmachspruch würdest du denn am positivsten reagieren?) - guess: german

real  0m6.844s
user  0m6.808s
sys 0m0.028s

And just like that, you can probabilistically differentiate between six different languages.

The ability to differentiate between different languages, and more generally between structured, and unstructured text is of particular importance in the field of cryptanalysis. But, that’s a story for another time.