# Transformer

> The neural network architecture that powers modern AI language models, using self-attention mechanisms to process sequences of data in parallel.

Category: Architecture

Source: https://chipp.ai/ai/glossary/transformer

The transformer is the neural network architecture that powers virtually all modern large language models. Introduced in the landmark 2017 paper "Attention Is All You Need" by Google researchers, it revolutionized AI by enabling parallel processing of sequences through self-attention mechanisms.

Before transformers, language models used recurrent neural networks (RNNs) that processed text sequentially — one word at a time. This was slow and made it difficult to learn relationships between distant words. Transformers process entire sequences in parallel, using attention to relate every word to every other word simultaneously.

Key transformer components: self-attention (each token attends to all other tokens to understand context), multi-head attention (multiple parallel attention computations for richer understanding), positional encoding (information about word order, since parallel processing loses sequence information), feed-forward layers (additional processing after attention), and layer normalization (stabilizing training of deep networks).

The transformer architecture comes in three variants: encoder-only (BERT — good for understanding and classification), decoder-only (GPT, Claude — good for generation), and encoder-decoder (T5, original transformer — good for translation and summarization). Modern LLMs predominantly use the decoder-only architecture.

Why transformers dominate AI: they scale efficiently (performance improves predictably with more data and compute), they parallelize well (training on thousands of GPUs simultaneously), they learn rich representations (capturing nuanced language patterns), and they transfer well (pre-trained models adapt to many tasks).

Every AI agent conversation is powered by transformer inference — the model processes the conversation context through its transformer layers to generate each response token.

## Related Terms

- [Attention Mechanism](https://chipp.ai/ai/glossary/attention-mechanism.md): A technique in neural networks that allows the model to focus on the most relevant parts of input data when generating each part of the output.
- [Neural Network](https://chipp.ai/ai/glossary/neural-network.md): A computing system inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers that process information and learn patterns.
- [Large Language Model (LLM)](https://chipp.ai/ai/glossary/large-language-model.md): A neural network trained on massive text datasets that can understand and generate human-like language, powering modern AI assistants and agents.
- [Deep Learning](https://chipp.ai/ai/glossary/deep-learning.md): A subset of machine learning using neural networks with many layers (deep networks) to learn complex patterns from large amounts of data.

---

This term is part of the [Chipp AI Glossary](https://chipp.ai/ai/glossary), a reference of AI concepts written for builders and businesses.

Build AI agents with no code at https://chipp.ai.