This article explores how we can use simple statistics to learn these visual fingerprints, the character sequences that most strongly signal which language you’re looking at, across 20 different European languages.
How to Learn Linguistic Fingerprints With Statistics
To learn the visual “fingerprints” of a language, we first need a way to measure how distinctive a given character pattern is. A natural starting point might be to look at the most common character patterns within each language. However, this approach quickly falls short as a character pattern might be very common in one language and also appear very frequently in others. Frequency alone does not capture uniqueness. Instead, we want to ask:
“How much more likely is this pattern to appear in one language compared to all others?”
This is where statistics comes in! Formally, let:
- L be the set of all 20 studied languages
- S be the set of all observed character patterns across those languages
To determine how strongly a given character pattern s∈S identifies a language l∈L, we compute the likelihood ratio:
\[LR_{s,l} = \frac{P(s|l)}{P(s|\neg l)}\]
This compares the probability of seeing character pattern s in language l, versus in any other language. The higher the ratio, the more uniquely tied that pattern is to that language.
Calculating the Likelihood Ratio in Practice
To compute the likelihood ratio for each character pattern in practice, we need to translate the conditional probabilities into quantities we can actually measure. Here’s how we define
the relevant counts:
- cl(s): the number of times character pattern s appears in language l
- c¬l(s): the number of times character pattern s appears in all other languages
- Nl: the total number of character pattern occurrences in language l
- N¬l: the total number of pattern occurrences in all other languages
Using these, the conditional probabilities become:
\[P(s|l) = \frac{c_l(s)}{N_l},~P(s|\neg l)=\frac{c_{\neg l}(s)}{N_{\neg l}}\]
and the likelihood ratio simplifies to:
\[LR_{s,l} = \frac{P(s|l)}{P(s|\neg l)} = \frac{c_l(s)\cdot N_{\neg l}}{c_{\neg l}(s)\cdot N_l}\]
This gives us a numeric score that quantifies how much more likely a character pattern s is to appear in language l versus all others.
Handling Zero Counts
There is sadly a problem with our likelihood ratio formula: what happens when c¬l(s) = 0?
In other words, what if a certain character pattern s appears only in language l and in no others? This leads to a divide-by-zero in the denominator, and a likelihood ratio of infinity.
Technically, this means we’ve found a perfectly unique pattern for that language. But in practice, it’s not very helpful. A character pattern might only appear once in one language and never anywhere else and it would be given an infinite score. Not very useful as a strong “fingerprint” of the language.
To avoid this issue, we apply a technique called additive smoothing. This method adjusts the raw counts slightly to eliminate zeros and reduce the impact of rare events.
Specifically, we add a small constant α to every count in the numerator, and α|S| to the denominator, with |S| being the total number of observed character patterns. This has the effect of assuming that every character pattern has a tiny chance of occurring in every language, even if it hasn’t been seen yet.
With smoothing, the adjusted probabilities become:
\[P'(s|l) = \frac{c_l(s) + \alpha}{N_l + \alpha|S|},~P'(s|\neg l)=\frac{c_{\neg l}(s) + \alpha}{N_{\neg l} + \alpha|S|}\]
And the final likelihood ratio to be maximized is:
\[LR_{s,l} = \frac{P'(s|l)}{P'(s|\neg l)} = \frac{(c_l(s) + \alpha)\cdot(N_{\neg l} + \alpha|S|)}{(N_l + \alpha|S|)\cdot(c_{\neg l}(s) + \alpha)}\]
This keeps things stable and ensures that a rare pattern doesn’t automatically dominate just because it’s exclusive.
The Dataset
Now that we’ve defined a metric to identify the most distinctive character patterns (our linguistic “fingerprints”), it’s time to gather actual language data to analyze.
For this, I used the python library wordfreq, which compiles word frequency lists for dozens of languages based on large-scale sources like Wikipedia, books, subtitles, and web text.
One particularly useful function for this analysis is top_n_list(), which returns a sorted list of the top n highest frequency words in a provided language. For example, to get the top 40 most common words in Icelandic, we would call:
wordfreq.top_n_list("is", 40, ascii_only=False)
The argument ascii_only=False ensures that non-ASCII characters — like Icelandic’s “ð” and “þ” — are preserved in the output. That’s essential for this analysis, since we’re specifically looking for language-unique character patterns, which includes single characters.
To build the dataset, I pulled the top 5,000 most frequent words in each of the following 20 European languages:
Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Spanish, Swedish, and Turkish.
This yields a large multilingual vocabulary of 100,000 total words, rich enough to extract meaningful statistical patterns across languages.
To extract the character patterns used in the analysis, all possible substrings of length 1 to 5 were generated from each word in the dataset. For example, the word language would contain patterns such as l, la, lan, lang, langu, a, an, ang, and so on. The result is a comprehensive set S of over 180,000 unique character patterns observed across the 20 studied languages.
Results
For each language, the top five most distinctive character patterns, ranked by the likelihood ratio, are shown. The smoothing constant was chosen to be α=0.5.
Because the raw likelihood ratios can be quite large, I’ve reported the base-10 logarithm of the likelihood ratio (log10(LR)) instead. For example, a log likelihood ratio of 3 means that the character pattern is 103 = 1,000 times more likely to appear in that language than in any other. Note that due to smoothing, these likelihood ratios are approximate rather than exact, and the extremeness of some scores may be dampened.
Each cell shows a top-ranked character pattern and its log likelihood ratio.
| Language | #1 | #2 | #3 | #4 | #5 |
|---|---|---|---|---|---|
| Catalan | ènc 3.03 | ènci 3.01 | cions 2.95 | ència 2.92 | atge 2.77 |
| Czech | ě 4.14 | ř 3.94 | ně 3.65 | ů 3.59 | ře 3.55 |
| Danish | øj 2.82 | æng 2.77 | søg 2.73 | skab 2.67 | øge 2.67 |
| Dutch | ijk 3.51 | lijk 3.45 | elijk 3.29 | ijke 3.04 | voor 3.04 |
| English | ally 2.79 | tly 2.64 | ough 2.54 | ying 2.54 | cted 2.52 |
| Finnish | ää 3.74 | ään 3.33 | tää 3.27 | llä 3.13 | ssä 3.13 |
| French | êt 2.83 | eux 2.78 | rése 2.73 | dép 2.68 | prése 2.64 |
| German | eich 3.03 | tlic 2.98 | tlich 2.98 | schl 2.98 | ichen 2.90 |
| Hungarian | ő 3.80 | ű 3.17 | gye 3.16 | szá 3.14 | ész 3.09 |
| Icelandic | ð 4.32 | ið 3.74 | að 3.64 | þ 3.63 | ði 3.60 |
| Italian | zione 3.41 | azion 3.29 | zion 3.07 | aggi 2.90 | zioni 2.87 |
| Latvian | ā 4.50 | ī 4.20 | ē 4.10 | tā 3.66 | nā 3.64 |
| Lithuanian | ė 4.11 | ų 4.03 | ių 3.58 | į 3.57 | ės 3.56 |
| Norwegian | sjon 3.17 | asj 2.93 | øy 2.88 | asjon 2.88 | asjo 2.88 |
| Polish | ł 4.13 | ś 3.79 | ć 3.77 | ż 3.69 | ał 3.59 |
| Portuguese | ão 3.73 | çã 3.53 | ção 3.53 | ação 3.32 | açã 3.32 |
| Romanian | ă 4.31 | ț 4.01 | ți 3.86 | ș 3.64 | tă 3.60 |
| Spanish | ción 3.51 | ación 3.29 | ión 3.14 | sión 2.86 | iento 2.85 |
| Swedish | förs 2.89 | ställ 2.72 | stäl 2.72 | ång 2.68 | öra 2.68 |
| Turkish | ı 4.52 | ş 4.10 | ğ 3.83 | ın 3.80 | lı 3.60 |
Discussion
Below are some interesting interpretations of the results. This is not meant to be a comprehensive analysis, just a few observations I found noteworthy:
- Many of the character patterns with the highest likelihood ratios are single characters unique to their language, such as the previously mentioned Icelandic “ð” and “þ”, Romanian’s “ă”, “ț”, and “ș”, or Turkish’s “ı”, “ş”, and “ğ”. Because these characters are essentially absent from all other languages in the dataset, they would have produced infinite likelihood ratios if not for the additive smoothing.
- In some languages, most notably Dutch, many of the top results are substrings of one another. For example, the top pattern “ijk” also appears within the next highest-ranking patterns: “lijk”, “elijk”, and “ijke”. This shows how certain combinations of letters are reused frequently in longer words, making them even more distinctive for that language.
- English has some of the least distinctive character patterns of all the languages analyzed, with a maximum log likelihood ratio of only 2.79. This may be due to the presence of English loanwords in many other languages’ top 5,000 word lists, which dilutes the uniqueness of English-specific patterns.
- There are several cases where the top character patterns reflect shared grammatical structures across languages. For example, the Spanish “-ción”, Italian “-zione”, and Norwegian “-sjon” all function as nominalization suffixes, similar to the English “-tion” — turning verbs or adjectives into nouns. These endings stand out strongly in each language and highlight how different languages can follow similar patterns using different spellings.
Conclusion
This project started with a simple question: What makes a language look like itself? By analyzing the 5000 most common words in 20 European languages and comparing the character patterns they use, we uncovered unique ‘fingerprints’ for each language — from accented letters like “ş” and “ø” to recurring letter combinations like “ijk” or “ción”. While the results aren’t meant to be definitive, they offer a fun and statistically grounded way to explore what sets languages apart visually, even without understanding a single word.
See my GitHub Repository for the full code implementation of this technique.
Thank you for reading!
References
wordfreq python library:
- Robyn Speer. (2022). rspeer/wordfreq: v3.0 (v3.0.2). Zenodo. https://doi.org/10.5281/zenodo.7199437





