pipeline() functionpipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer. pipeline() function are feature extraction, named entity recognition, sentiment analysis, summarization and text generation.from transformers import pipelineclassifier = pipeline("sentiment-analysis")classifier([sample_text_sequence_1,sample_text_sequence_2,])
bert-base-uncased is a checkpoint. Tokenization is an essential step in preprocessing text in natural language processing (NLP) tasks. Tokenizers break down a stream of textual data into words, subwords, or symbols (like punctuation) that can then be converted into numbers or vectors to be processed by algorithms.
A tokenizer in a transformer model is responsible for splitting the input text into tokens, mapping each token to an integer and adding additional inputs that may be useful to the model.
.from_pretrained() methodThe .from_pretrained() method can be used to load and save a pretrained transformer model. The AutoTokenizer, AutoProcessor, and AutoModel classes allow one to load tokenizers, processors and models, respectively, for any model architecture.
from transformers import AutoModel, AutoTokenizercheckpoint = 'pretrained-model-you-want'tokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModel.from_pretrained(checkpoint)
Decoder models employ various strategies in next token generation, which can be adjusted by the user. These include n-gram penalties, which prevent token sequences of n length from repeating; sampling, which chooses the next token at random from among a collection of likely next tokens; and temperature, which adjusts the predictability of the randomly selected next token, with higher temperatures producing less predictable output.
When decoder models simply select the most probable next token for their output, it’s called “greedy search.” When the model projects several tokens further into the potential output and chooses the most probable multi-token sequence from among several candidates, it’s known as “beam search.”