Rendering mathematics
2024-09-03

\(n\)-gram frequencies as multi-dimensional Markov chains?

Why measure \(n\)-gram frequencies when you could use guess them instead?


When carrying out elementary cryptanalysis on some text, you might look at the individual letter frequencies to see what type of cipher it might be - if you've got a bit of a spiky distribution, it's probably a monoalphabetic subsitution cipher, unless "e" is the most common followed by "t" and so on, in which case it's probably a transposition cipher. If it's flat, you've probably got yourself a polyalphabetic substitution cipher.

You might also look at bigram frequencies, since they might provide more statistically rich information than letter frequencies, which can be useful when carrying out some sort of automated attack that aims to maximise the textual fitness of some decryption. You might also look at the trigrams and so on. But to do this, you need to measure these frequencies, which requires writing code to do it for you, generally. It's not difficult to do so, but it is a bit tedious.

So, what happens if you're a mathematician looking for some fun in these preliminary stages of cryptanalysis? I think I may have a solution.

Let's say we have an alphabet \(a\) of length \(k\) (for English, \(k=26\)). Then we might represent the letter frequencies of a text as a column vector \(V\) of height \(k\), where \(V_i\) is the relative frequency of the \(i\)th letter of the alphabet (sadly not zero-indexed). Note that \(\sum_{i=1}^kV_i=1\). Then we might represent bigram frequencies as a \(k\times k\) matrix \(M\), where \(M_{i,j}\) is the probability of \(a_j\) being followed by \(a_i\). Then we can divide each element of the matrix by the sum of the elements in its column, giving a new \(k\times k\) matrix \(T_2\), \(\;(T_2)_{i,j}=\frac{M_{i,j}}{\sum_{p=1}^kM_{p,j}}\). Note that the sum of the elements in each column of \(T_2\) is 1 and thus this is a left-stochastic matrix.

We can therefore think of these bigram frequencies as a Markov chain, where each letter has a particular probability of moving to some other letter. If we begin with some arbitrary letter distribution \(V\), then we can plug it into this Markov chain to predict what the distribution of the letters following these letters might be. If we keep applying this, then we'll eventually come to a stationary state which we can plug into the network and get exactly the same distrubution back out. The stationary state of a Markov chain is typically denoted as \(\pi\) so we will adopt that notation too. Note that \(T_2\pi=\pi\). This stationary state must be the letter distribution of our alphabet. \(\pi\) is also any one column of \(\lim_{s\in\mathbb{Z}^+\to\infty}T_2^s\).

Now let's reconsider \(T_2\) as a function (since matrix multiplication is just a function [which is why it's distributive]):

\[ T_2\!:\ \mathbb{R}^k\mapsto\mathbb{R}^k,\ T_2{(v)}_i=\left\langle v,\left(\begin{matrix} (T_2)_{i,1} \\ \vdots \\ (T_2)_{i,k} \end{matrix}\right)\right\rangle, \]

where \(\langle a,\ b\rangle\) denotes the inner product of two vectors. We can extend this to some \(n\)-dimensional stochastic tensor \(T_n\in\mathbb{R}^{k^n}\),

\[ T_n\!:\ \mathbb{R}^{k^{(n-1)}}\mapsto\mathbb{R}^k,\ T_n{(\gamma)}_i=\frac{1}{k^{n-2}}\cdot\left\langle \gamma,\left(\begin{matrix} (T_n)_{i,\ \ldots,\ 1} & \cdots & \\ \vdots & \ddots & \\ & & (T_n)_{i,\ \ldots,\ k} \end{matrix}\right)\right\rangle, \]

whereby some input \((n-1)\)-dimensional tensor \(\gamma\) is mapped over the first \((n-1)\) dimensions of \(T_n\), the inner product calculated and placed into the output vector, and the output vector divided by \(m^{n-2}\) in order to maintain the property that the sum of all elements in the vector is equal to 1. If we define a 'stretching' function \(\Gamma_q:\ \mathbb{R}^k\mapsto\mathbb{R}^{k^q}\) which repeats the input vector over the 'columns' of a tensor, then we define the stationary state \(\pi\) of the \(n\)-tensor as being the vector for which \(T_n\circ\Gamma_n{(\pi)}=\pi\).

A proof that such a solution exists, and is unique, will be left as an exercise for the reader (mainly since I don't have such a proof, but I'm somewhat confident that it works). Such a tensor may be constructed in a similar fashion to how the stochastic matrix for bigrams was constructed. We could think of this tensor as a multi-dimensional Markov chain, where we have a network that represents the probability of \(a_\alpha a_\beta\) being followed by \(a_\gamma\). I'm not sure how such a network could be represented graphically.

But what if we're not interested in letter frequencies? What if we have a burning desire to derive bigram frequencies from a 5-gram frequency distribution? Or, indeed, any \(p\)-gram distribution from some \(n\)-gram distribution? Don't worry, we can do that. Just define \(T_{n,p}\in\mathbb{R}^{k^n}\) as

\[ T_{n,p}\!:\ \mathbb{R}^{k^{(n-p)}}\mapsto\mathbb{R}^{k^p},\ T{(\gamma)}_{i_1,\ \ldots,\ i_n}=\frac{1}{k^{n-p-1}}\cdot\left\langle \gamma,\left(\begin{matrix} (T_n)_{i,\ \ldots,\ 1} & \cdots & \\ \vdots & \ddots & \\ & & (T_n)_{i,\ \ldots,\ k} \end{matrix}\right)\right\rangle,\quad p\leq n-p \]

and then \(T_n=T_{n,1}\). Then adapt our stretching function to \(\Gamma_{r,q}:\ \mathbb{R}^{k^r}\mapsto\mathbb{R}^{k^q},\ \Gamma_q=\Gamma_{1,q}\) and we find that the \(p\)-gram distribution \(\pi_p\) from some \(n\)-gram distribution is such that \(T_{n,p}\circ\Gamma_{(n-p),p}(\pi_p)=\pi_p\).

Or rather, I think it works. I haven't actually tried that last bit out properly but I'm reasonably certain it works. Certainly more than 50% certain, but I will make no more advances on that. I think it looks pretty reasonable, although I will admit that the dimensions of each bit aren't particular clear and I should really find a nice way of notating that... Actually all of the notation in this post has been a bit unclear I think! Well done if you've managed to decipher my inane babbling.

So, what was the point of all this again? I'm not too sure, but I think it's kind of fun. Initially I came up with this concept when I was doing some research into zero-plaintext attacks for Hill ciphers, but couldn't find a practical use for it. Let me know if you do, or if you manage to figure out that one of the described techniques definitely works or doesn't.