Predictive text and language models have transformed human-machine interaction by enabling machines to express and predict human languages, further enhancing communication, accessibility, and efficiency in the world. The mathematical techniques and algorithms are the basis of how machines can do text understanding and generation in an “intelligent” manner. This essay discusses the underlying mathematical and computational principles inspired by probability theory, neural networks, attention mechanisms, and language modeling techniques behind predictive text and language models.
The idea of language modeling, or basically the basis of predictive text, is to find the probability of a word sequence. The language model predicts the probability of the next word in the sequence given a sequence of words. Concretely, the application explicitly calculates the probability of each word appearing in context with prior words in the sequence.
Early models made this problem computationally feasible by making the Markov assumption, in which the model is further simplified by assuming the probability of each word to depend only on a fixed number of previous words. In an n-gram model, the probability of a word depends only on some number of words before it, not the entire history of the sequence. This is a simplifying assumption that allows language models which are computationally relatively efficient. Still, they have serious shortcomings in the features related to long-term dependencies and context. The limitation called for further development into more sophisticated approaches like neural networks.
Neural networks, particularly RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory), are designed to process and handle data in sequences, which makes them very well-suited for language modeling. These models learn the representation of sequences by including information about previous words through their hidden states and improve a model’s prediction for future words.
RNNs are a family of neural networks whose connections between units contain directed cycles, which allows information to persist across steps in a sequence. If the input is a sequence, an RNN updates its hidden state at every time step while it takes in the token. However, standard RNNs face something called the vanishing gradient problem: they’re not able to preserve information across long sequences. This created a need for LSTMs.
Long Short-Term Memory networks are a variation of RNNs that use a special gating mechanism in order to control the flow of information, allowing them to keep relevant information longer. The LSTM cell consists of three gates: the forget gate, which determines what information to discard in the cell state, the input gate, which selects what value to update in the cell state, and the output gate, which selects what part of the cell state to expose. This process of gating allows LSTMs to break free from the vanishing gradient problem, hence extracting dependencies across longer sequences of text.
The representation of words in a machine-readable format is a very integral part of the whole modern language modeling process. Word embeddings are basically dense vector representations of words that capture the semantic similarities between them. Methods like Word2Vec and GloVe place semantically similar words closer together in vector space to facilitate better modeling of word relationships.
Word2Vec generates word embeddings using either of two models: the Continuous Bag of Words (CBOW) model or the Skip-Gram model. In CBOW, a target word is predicted based on the context words that surround it. On the other hand, Skip-Gram predicts the context words for a given target word. Both make use of neural networks to create vectors that capture semantic relationships and thereby allow the model to do analogical reasoning such as “king” – “man” + “woman” = “queen”.
Recent development in the language model is based on Transformer architecture, an introduction to attention mechanisms. Transformers overcome the limitation of RNN and LSTMs by the mechanism of self-attention, which lets them tell the importance of each word of a sequence with all the other words present in it without considering the position.
Self-attention is a mechanism that computes a weight for every word with respect to all other words in the sequence, enabling it to model the relative importance of all words when it processes each word. For an input given sequence, it computes the representation vectors for each word: Query, Key, and Value, and computes an attention score for each pair of words. The score will decide how much emphasis should be given to each word with respect to the entire sequence for better capturing of the context.
In the Transformer architecture, self-attention and feed-forward neural networks come together in an entire stack. As a result of having multiple layers, each can attend to other features of the input sequence, which could mean that a more nuanced understanding of the language is fully captured. Transformers are highly parallelizable and, hence, very efficient to train on large datasets. This has led to breakthroughs in language modeling.
While the coming of age of large-scale data and computational resources was at their finest, the dominant models appeared to be those that were pre-trained, including but not limited to GPT, which stands for Generative Pretrained Transformer, BERT, which stands for Bidirectional Encoder Representations from Transformers, and T5, which stands for Text-To-Text Transfer Transformer. The large models are pre-trained on large datasets and then fine-tuned on downstream tasks.
GPT is, however, an autoregressive model that predicts the next word in sequence.
Because BERT is a deep bidirectional model that considers context from both sides of a word, it turns out to be outstandingly effective for a wide range of tasks that involve question answering and sentiment analysis.
T5 views NLP tasks from a unified perspective: It rephrases all of them as a Text-to-Text task. A model in such a paradigm can be made capable of several tasks in a single architecture.
Training Objectives and Loss Functions
These models are optimized during training on specific loss functions. Cross-entropy loss can be applied to predictive text, measuring a difference between the probabilities one has predicted versus the actual outcome of this probability. One should ensure that this loss is minimized so that with time, the model will start aligning its predictions to actual data.
While large language models have enjoyed tremendous success, they nonetheless face challenges in the form of bias, interpretability, and data efficiency. They can pick up biases from training datasets that contain various biases in society, and true interpretability is hard to achieve with deep neural networks. Few-shot learning, model interpretability, and ethical AI are a few of the active areas of research when trying to solve these issues.
References
- Vaswani, A., et al. (2017). “Attention is All You Need.” Advances in Neural Information Processing Systems, 30, 5998-6008.
- Mikolov, T., et al. (2013). “Efficient Estimation of Word Representations in Vector Space.” arXiv preprint arXiv:1301.3781.
- Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735-1780.
- Devlin, J., et al.