alecor.net - Core Concepts Behind Modern AI Systems

Core Concepts Behind Modern AI Systems

Most modern AI products are built on the same small set of foundational ideas. If you understand these ideas, you’ll start to recognize them everywhere — from large language models to image and video generation systems.

This post is written as a conceptual summary who want a compact mental model of how today’s AI systems are assembled.

1. Tokenization

Neural networks like large language models (LLMs) cannot work with raw text directly. Instead, text is first broken down into smaller units called tokens.

A tokenizer:

Splits text into pieces (tokens)
Maps each token to an integer ID
Produces a sequence of integers that the model can process

Raw Text
   |
   v
+------------------+
|   Tokenizer     |
+------------------+
   |
   v
["walk", "ing"]          → Tokens
   |
   v
[  5231,   987 ]         → Token IDs

One of the most common tokenization algorithms is Byte Pair Encoding (BPE).

How BPE works

Starts with very small units (often bytes or characters)
Repeatedly merges the most frequent adjacent pairs
Over time, common fragments like ing or tion become single tokens

For example, the word walking might be split as:

walk + ing

This approach balances vocabulary size, generalization, and efficiency.

Product intuition

Models don’t "see words", they see pieces of words.
Pricing, latency, and context limits are all based on tokens, not characters or sentences.

Why this matters

Long prompts and verbose outputs increase cost.
Poorly structured text (logs, JSON, markup) can consume tokens faster than expected.
Different languages tokenize very differently—this can impact multilingual products.

2. Text Decoding

An LLM does not output text directly.

Previous Tokens
["The", "cat", "is"]
        |
        v
+------------------------+
|        LLM             |
+------------------------+
        |
        v
Probability Distribution
{ "sleeping": 0.42,
  "running": 0.21,
  "hungry":  0.10,
  ... }
        |
        v
Decoding Strategy
(Greedy / Top-p / Temp)
        |
        v
Next Token → "sleeping"

At each step, it produces a probability distribution over the vocabulary for the next token.

A decoding algorithm:

Chooses one token from that distribution
Appends it to the sequence
Repeats until a full response is produced

Common decoding strategies

Greedy decoding

Always selects the most likely token
Useful for deterministic tasks
Produces bland or repetitive output for creative tasks

Sampling-based decoding

Introduces controlled randomness
Improves diversity and expressiveness

A widely used method is top-p (nucleus) sampling:

Selects the smallest set of tokens whose probabilities sum to p
Samples the next token from that set

Decoding is a critical design choice that directly shapes model behavior.

Product intuition

Decoding is not about accuracy - it's about personality.
Low temperature / greedy decoding → cautious, repetitive, safe.
Higher temperature / sampling → creative, diverse, but less predictable.

Why this matters

Customer support bots usually want boring and correct.
Brainstorming or marketing tools want diverse and surprising.

3. Prompt Engineering

Vague prompts usually lead to vague answers.

Prompt engineering is the practice of shaping instructions and context to steer a model’s behavior without modifying its weights.

+----------------------+
|      Prompt          |
|----------------------|
| Instructions         |
| Constraints          |
| Examples (optional)  |
+----------------------+
           |
           v
+----------------------+
|        LLM           |
+----------------------+
           |
           v
Structured / Targeted Output

A strong prompt:

Clearly states the task
Defines constraints
Specifies the expected output format

Common techniques

Few-shot prompting

Provide a small number of examples
The model imitates the demonstrated style and structure

Chain-of-thought prompting

Ask the model to reason step by step
Improves performance on multi-step tasks like math and coding

Prompt engineering is popular because it is fast, cheap, and highly iterative compared to training or fine-tuning.

Product intuition

Prompting is product design.

For many AI products, the prompt is the business logic.

It encodes:

Rules
Brand voice
Safety constraints
Output structure

This is why prompt changes often have outsized impact compared to model changes. Prompts are versioned assets.

4. Multi-Step AI Agents

An LLM by itself can only generate text. It cannot:

Browse the web
Call APIs
Run code
Interact with external systems

Multi-step agents wrap an LLM in a control loop with access to:

Tools
Memory
External environments

        +----------------+
        |      Goal      |
        +----------------+
                 |
                 v
        +----------------+
        |      LLM       |
        |  (Planner)    |
        +----------------+
                 |
        +--------+--------+
        |                 |
        v                 v
   Call Tool         Generate Text
        |
        v
+--------------------+
| External System    |
| (API / Code / DB)  |
+--------------------+
        |
        v
   Observation
        |
        +-----> back to LLM

The agent:

Plans a step
Calls a tool
Observes the result
Decides what to do next

This loop continues until the agent reaches a goal, exhausts its budget, or determines it cannot make further progress.

Product intuition

Agents trade predictability for capability.
More autonomy → more things can go wrong.
Tool access introduces security and cost considerations.

Debugging failures often requires replaying the agent's reasoning steps.

Agents are best for bounded workflows, not open-ended autonomy.
Clear stop conditions and budgets are critical.

5. Retrieval-Augmented Generation (RAG)

A plain LLM answers questions using only what is stored in its weights. This makes it prone to:

Hallucinations
Outdated knowledge
Missing private or proprietary data

Retrieval-Augmented Generation (RAG) pairs an LLM with a retrieval system backed by an external knowledge store.

How RAG works

A retriever fetches relevant passages from documents, PDFs, or databases
The LLM uses those passages as context to generate an answer

User Query
    |
    v
+----------------+
|   Retriever   |
+----------------+
    |
    v
Relevant Chunks
[Doc A, Doc C]
    |
    v
+----------------+
|      LLM       |
| (Query + Docs) |
+----------------+
    |
    v
Grounded Answer

RAG grounds model outputs in verifiable external evidence, making systems more reliable and auditable.

Product intuition

RAG is how organizations:

Keep proprietary data private.
Update knowledge without retraining models.
Provide audit trails ("this answer came from these documents").

RAG failures often look like reasoning failures, but are actually retrieval failures.

6. Reinforcement Learning from Human Feedback (RLHF)

The success of ChatGPT was driven in large part by Reinforcement Learning from Human Feedback (RLHF).

RLHF works as follows:

Prompt
  |
  v
+----------------+
|   Base LLM     |
+----------------+
  |
  v
Multiple Outputs
(A, B, C)
  |
  v
+----------------+
| Reward Model   |
| (Human prefs)  |
+----------------+
  |
  v
Scores
(A: 0.9, B: 0.4)
  |
  v
Policy Update
(Higher reward ↑)

The model generates multiple candidate responses
A separate reward model scores them
The training algorithm updates the model’s weights
Higher-scoring behaviors become more likely

Why RLHF matters

The reward model is trained using human preference data, often from pairs of responses where annotators choose the better one.

This pushes models toward outputs that humans consistently rate as:

More helpful
Clearer
Safer

Not merely statistically probable.

Product insight

RLHF is why models:

Refuse certain requests.
Explain themselves.
Apologize or hedge.

These behaviors are learned preferences—not hard-coded rules.

7. Variational Autoencoders (VAEs)

A Variational Autoencoder (VAE) is a generative model that learns a probability distribution over data.

A VAE consists of:

Input Data
   |
   v
+-----------+
| Encoder   |
+-----------+
   |
   v
Latent Space (z)
(mean, variance)
   |
   v
+-----------+
| Decoder   |
+-----------+
   |
   v
Reconstructed / Generated Data

An encoder that maps inputs to a low-dimensional latent space
A decoder that reconstructs data from latent vectors

Training and generation

Training optimizes a reconstruction objective
New data can be generated by sampling from the latent space

In modern text-to-image and text-to-video systems (e.g. OpenAI’s Sora), VAEs are often used as latent compressors, enabling downstream models to operate efficiently in smaller, structured spaces.

Product intuition

A VAE learns a compressed language for images or videos. Instead of working with pixels, downstream models work with concepts.

8. Diffusion Models

Diffusion models generate data by learning to reverse a gradual noising process.

Training

Real Image
    |
    v
x₀ → x₁ → x₂ → ... → xₜ
(add noise)

Start with real samples (such as images)
Add noise over many time steps
Train a model to predict the noise given:
The noisy input
The time step
Optional conditioning (e.g. text)

Inference

Pure Noise
    |
    v
xₜ → xₜ₋₁ → xₜ₋₂ → ... → x₀
(remove noise)
    |
    v
Generated Image / Video

Start from pure noise
Iteratively apply the learned denoising process
Converge toward a clean sample

Diffusion models power many state-of-the-art image and video generation systems.

Product intuition

Diffusion models are like sculptors:

Start with a block of noise.
Gradually remove randomness.
Shape the output step by step.

9. Low-Rank Adaptation (LoRA)

Large models are general-purpose but often underperform in specialized domains.

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique that adapts a pre-trained model without updating all of its parameters.

How LoRA works

Frozen Base Weights (W)
        |
        +----------------------+
        |                      |
        v                      v
   Low-Rank A             Low-Rank B
   (trainable)            (trainable)
        \                      /
         \                    /
          +---- Added Delta ---+
                   |
                   v
            Adapted Behavior

Base model weights remain frozen
Small low-rank trainable matrices are injected into specific layers
Domain-specific behavior is learned with far fewer parameters

LoRA enables fast, cheap, and scalable specialization.

Product intuition

LoRA enables:

Faster experiments.
Lower infrastructure cost.
Multiple specialized versions of the same base model.

This is why teams can ship domain-specific AI features without training from scratch.

How these concepts fit together in real products

Most production AI systems are composites, not single models:

Prompting + decoding shape behavior.
RAG supplies trusted knowledge.
Agents orchestrate tools.
LoRA customizes domain behavior.

Understanding these building blocks helps product teams:

Ask better questions.
Diagnose failures faster.
Make informed tradeoffs between cost, quality, and speed.

Final Thoughts

These concepts form the backbone of most modern AI systems. They are the recurring building blocks behind today’s most capable AI models — and very likely tomorrow's.

posted 2026-02-07 00:00 · Data Science · Machine Learning data science AI ML LLMs Generative AI MLOps Deep Learning