alecor.net

Search the site:

2026-02-07

Core Concepts Behind Modern AI Systems

Summary:

Modern AI systems may look diverse on the surface, but under the hood they rely on a small set of recurring architectural and training ideas. This article distills foundational concepts—ranging from tokenization and decoding to RAG, diffusion models, and LoRA—that every ML engineer should understand to design, debug, and reason about real-world AI systems.

Most modern AI products are built on the same small set of foundational ideas. If you understand these ideas, you’ll start to recognize them everywhere — from large language models to image and video generation systems.

This post is written as a conceptual summary who want a compact mental model of how today’s AI systems are assembled.


1. Tokenization

Neural networks like large language models (LLMs) cannot work with raw text directly. Instead, text is first broken down into smaller units called tokens.

A tokenizer:

  • Splits text into pieces (tokens)
  • Maps each token to an integer ID
  • Produces a sequence of integers that the model can process
Raw Text
   |
   v
+------------------+
|   Tokenizer     |
+------------------+
   |
   v
["walk", "ing"]          → Tokens
   |
   v
[  5231,   987 ]         → Token IDs

One of the most common tokenization algorithms is Byte Pair Encoding (BPE).

How BPE works

  • Starts with very small units (often bytes or characters)
  • Repeatedly merges the most frequent adjacent pairs
  • Over time, common fragments like ing or tion become single tokens

For example, the word walking might be split as:

walk + ing

This approach balances vocabulary size, generalization, and efficiency.

Product intuition

  • Models don’t "see words", they see pieces of words.
  • Pricing, latency, and context limits are all based on tokens, not characters or sentences.

Why this matters

  • Long prompts and verbose outputs increase cost.
  • Poorly structured text (logs, JSON, markup) can consume tokens faster than expected.
  • Different languages tokenize very differently—this can impact multilingual products.

2. Text Decoding

An LLM does not output text directly.

Previous Tokens
["The", "cat", "is"]
        |
        v
+------------------------+
|        LLM             |
+------------------------+
        |
        v
Probability Distribution
{ "sleeping": 0.42,
  "running": 0.21,
  "hungry":  0.10,
  ... }
        |
        v
Decoding Strategy
(Greedy / Top-p / Temp)
        |
        v
Next Token  "sleeping"

At each step, it produces a probability distribution over the vocabulary for the next token.

A decoding algorithm:

  • Chooses one token from that distribution
  • Appends it to the sequence
  • Repeats until a full response is produced

Common decoding strategies

Greedy decoding

  • Always selects the most likely token
  • Useful for deterministic tasks
  • Produces bland or repetitive output for creative tasks

Sampling-based decoding

  • Introduces controlled randomness
  • Improves diversity and expressiveness

A widely used method is top-p (nucleus) sampling:

  • Selects the smallest set of tokens whose probabilities sum to p
  • Samples the next token from that set

Decoding is a critical design choice that directly shapes model behavior.

Product intuition

  • Decoding is not about accuracy - it's about personality.
  • Low temperature / greedy decoding → cautious, repetitive, safe.
  • Higher temperature / sampling → creative, diverse, but less predictable.

Why this matters

  • Customer support bots usually want boring and correct.
  • Brainstorming or marketing tools want diverse and surprising.

3. Prompt Engineering

Vague prompts usually lead to vague answers.

Prompt engineering is the practice of shaping instructions and context to steer a model’s behavior without modifying its weights.

+----------------------+
|      Prompt          |
|----------------------|
| Instructions         |
| Constraints          |
| Examples (optional)  |
+----------------------+
           |
           v
+----------------------+
|        LLM           |
+----------------------+
           |
           v
Structured / Targeted Output

A strong prompt:

  • Clearly states the task
  • Defines constraints
  • Specifies the expected output format

Common techniques

Few-shot prompting

  • Provide a small number of examples
  • The model imitates the demonstrated style and structure

Chain-of-thought prompting

  • Ask the model to reason step by step
  • Improves performance on multi-step tasks like math and coding

Prompt engineering is popular because it is fast, cheap, and highly iterative compared to training or fine-tuning.

Product intuition

Prompting is product design.

For many AI products, the prompt is the business logic.

It encodes:

  • Rules
  • Brand voice
  • Safety constraints
  • Output structure

This is why prompt changes often have outsized impact compared to model changes. Prompts are versioned assets.


4. Multi-Step AI Agents

An LLM by itself can only generate text. It cannot:

  • Browse the web
  • Call APIs
  • Run code
  • Interact with external systems

Multi-step agents wrap an LLM in a control loop with access to:

  • Tools
  • Memory
  • External environments
        +----------------+
        |      Goal      |
        +----------------+
                 |
                 v
        +----------------+
        |      LLM       |
        |  (Planner)    |
        +----------------+
                 |
        +--------+--------+
        |                 |
        v                 v
   Call Tool         Generate Text
        |
        v
+--------------------+
| External System    |
| (API / Code / DB)  |
+--------------------+
        |
        v
   Observation
        |
        +-----> back to LLM

The agent:

  1. Plans a step
  2. Calls a tool
  3. Observes the result
  4. Decides what to do next

This loop continues until the agent reaches a goal, exhausts its budget, or determines it cannot make further progress.

Product intuition

  • Agents trade predictability for capability.
  • More autonomy → more things can go wrong.
  • Tool access introduces security and cost considerations.

Debugging failures often requires replaying the agent's reasoning steps.

  • Agents are best for bounded workflows, not open-ended autonomy.
  • Clear stop conditions and budgets are critical.

5. Retrieval-Augmented Generation (RAG)

A plain LLM answers questions using only what is stored in its weights. This makes it prone to:

  • Hallucinations
  • Outdated knowledge
  • Missing private or proprietary data

Retrieval-Augmented Generation (RAG) pairs an LLM with a retrieval system backed by an external knowledge store.

How RAG works

  • A retriever fetches relevant passages from documents, PDFs, or databases
  • The LLM uses those passages as context to generate an answer
User Query
    |
    v
+----------------+
|   Retriever   |
+----------------+
    |
    v
Relevant Chunks
[Doc A, Doc C]
    |
    v
+----------------+
|      LLM       |
| (Query + Docs) |
+----------------+
    |
    v
Grounded Answer

RAG grounds model outputs in verifiable external evidence, making systems more reliable and auditable.

Product intuition

RAG is how organizations:

  • Keep proprietary data private.
  • Update knowledge without retraining models.
  • Provide audit trails ("this answer came from these documents").

RAG failures often look like reasoning failures, but are actually retrieval failures.


6. Reinforcement Learning from Human Feedback (RLHF)

The success of ChatGPT was driven in large part by Reinforcement Learning from Human Feedback (RLHF).

RLHF works as follows:

Prompt
  |
  v
+----------------+
|   Base LLM     |
+----------------+
  |
  v
Multiple Outputs
(A, B, C)
  |
  v
+----------------+
| Reward Model   |
| (Human prefs)  |
+----------------+
  |
  v
Scores
(A: 0.9, B: 0.4)
  |
  v
Policy Update
(Higher reward ↑)
  • The model generates multiple candidate responses
  • A separate reward model scores them
  • The training algorithm updates the model’s weights
  • Higher-scoring behaviors become more likely

Why RLHF matters

The reward model is trained using human preference data, often from pairs of responses where annotators choose the better one.

This pushes models toward outputs that humans consistently rate as:

  • More helpful
  • Clearer
  • Safer

Not merely statistically probable.

Product insight

RLHF is why models:

  • Refuse certain requests.
  • Explain themselves.
  • Apologize or hedge.

These behaviors are learned preferences—not hard-coded rules.


7. Variational Autoencoders (VAEs)

A Variational Autoencoder (VAE) is a generative model that learns a probability distribution over data.

A VAE consists of:

Input Data
   |
   v
+-----------+
| Encoder   |
+-----------+
   |
   v
Latent Space (z)
(mean, variance)
   |
   v
+-----------+
| Decoder   |
+-----------+
   |
   v
Reconstructed / Generated Data
  • An encoder that maps inputs to a low-dimensional latent space
  • A decoder that reconstructs data from latent vectors

Training and generation

  • Training optimizes a reconstruction objective
  • New data can be generated by sampling from the latent space

In modern text-to-image and text-to-video systems (e.g. OpenAI’s Sora), VAEs are often used as latent compressors, enabling downstream models to operate efficiently in smaller, structured spaces.

Product intuition

A VAE learns a compressed language for images or videos. Instead of working with pixels, downstream models work with concepts.


8. Diffusion Models

Diffusion models generate data by learning to reverse a gradual noising process.

Training

Real Image
    |
    v
x₀ → x₁ → x₂ → ... → xₜ
(add noise)
  • Start with real samples (such as images)
  • Add noise over many time steps
  • Train a model to predict the noise given:
  • The noisy input
  • The time step
  • Optional conditioning (e.g. text)

Inference

Pure Noise
    |
    v
xₜ → xₜ₋₁ → xₜ₋₂ → ... → x₀
(remove noise)
    |
    v
Generated Image / Video
  • Start from pure noise
  • Iteratively apply the learned denoising process
  • Converge toward a clean sample

Diffusion models power many state-of-the-art image and video generation systems.

Product intuition

Diffusion models are like sculptors:

  • Start with a block of noise.
  • Gradually remove randomness.
  • Shape the output step by step.

9. Low-Rank Adaptation (LoRA)

Large models are general-purpose but often underperform in specialized domains.

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique that adapts a pre-trained model without updating all of its parameters.

How LoRA works

Frozen Base Weights (W)
        |
        +----------------------+
        |                      |
        v                      v
   Low-Rank A             Low-Rank B
   (trainable)            (trainable)
        \                      /
         \                    /
          +---- Added Delta ---+
                   |
                   v
            Adapted Behavior
  • Base model weights remain frozen
  • Small low-rank trainable matrices are injected into specific layers
  • Domain-specific behavior is learned with far fewer parameters

LoRA enables fast, cheap, and scalable specialization.

Product intuition

LoRA enables:

  • Faster experiments.
  • Lower infrastructure cost.
  • Multiple specialized versions of the same base model.

This is why teams can ship domain-specific AI features without training from scratch.


How these concepts fit together in real products

Most production AI systems are composites, not single models:

  • Prompting + decoding shape behavior.
  • RAG supplies trusted knowledge.
  • Agents orchestrate tools.
  • LoRA customizes domain behavior.

Understanding these building blocks helps product teams:

  • Ask better questions.
  • Diagnose failures faster.
  • Make informed tradeoffs between cost, quality, and speed.

Final Thoughts

These concepts form the backbone of most modern AI systems. They are the recurring building blocks behind today’s most capable AI models — and very likely tomorrow's.

Nothing you read here should be considered advice or recommendation. Everything is purely and solely for informational purposes.