temperature
, top_p
, etc, that allow the users to tweaked probabilities to alter their outputs. Temperature controls for how deterministic the LLM outputs are.Three defining moments in the history of Artificial Intelligence are:
The field of Natural Language Processing (NLP) involves finding mathematical representations of language to capture statistical regularities in text. Some well-known language-related tasks that NLP algorithms are involved in are
There are different ways of mathematically representing text depending on the smallest unit of a sequence one chooses to model. This unit can be a letter, a word or a sequence of words, also known as “tokens”.
Some definitions around language models:
Autoregressive language models are models that are trained on a corpus of text and use word representations to predict the next best thing to say based on the underlying distribution of words.
A count-based language model is the simplest approach to building an autoregressive language model. It involves creating a giant lookup table of words from a text with their location and frequency stored in it. This lookup table is then used to calculate the probabilities of the next best thing to say.
Neural language models map text onto a mathematical representation using neural networks such that text that occurs together or has similar meaning is encoded to representations that exist nearby.
The “counting words” approach to language models runs into two issues: the curse of dimensionality and lack of generalizability.
Language models today use neural networks and do not attempt to learn the exact distribution of words in a corpus of text. Rather they learn an approximate distribution in a computationally effective manner.
Language models that use neural networks are able to generalize to unseen instances in the text. Because they rely on the underlying semantic representations, they can assign non-zero probabilities to text they haven’t been exposed to.
Language models that use neural networks are effective in mitigating the curse of dimensionality as they compress the text they’re trained on to a smaller set of parameters. This means that they can assign zero probabilities at times to text that exists in the training corpus.