Why are Transformers replacing CNNs?
Why are Transformers replacing CNNs?
Why does a Transformer classify this cat as a cat… while a ResNet calls it a macaw?
In this video we break down one of the biggest shifts in computer vision: why Transformers replaced Convolutional Neural Networks (CNNs) — even though CNNs were designed for images and Transformers for language.
We’ll compare convolution vs self-attention, explore CNNs’ inductive biases (locality, translation invariance, hierarchical features), and see why self-attention is strictly more expressive than convolution. You’ll also learn how attention can exactly implement convolutional kernels using relative positional encodings.
📚 Resources:
- On the Relationship between Self-Attention and Convolutional Layers: https://arxiv.org/abs/1911.03584
- Backpropagation Applied to Handwritten Zipcode Recognition: http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf
- AlexNet (the paper that popularized CNNs in deep learning): https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- The Transformer: https://arxiv.org/abs/1706.03762
00:00 Intro
01:30 The convolution operation
03:34 Convolutional Neural Networks (CNNs)
05:51 The inductive bias in CNNs
07:22 Self-attention
10:39 Self-attention can implement convolutions
14:17 Computational power & multi-modality
16:03 ChatGPT can be funny
I asked them to show me their RAG pipeline...
RAG (Retrieval-Augmented Generation) is a widely adopted technique that gives LLMs access to external documents. We briefly discuss its roots in Information Retrieval and the transition from sparse to dense retrieval.
Continua AI shares how they leverage HyDE (a query augmentation technique) to enhance their RAG pipeline in the context of group conversations / social AI.
📖 HyDE paper: https://arxiv.org/abs/2212.10496
🔗 Continua AI: https://continua.ai/
🎤 Interview with David Petrou, CEO of Continua: https://www.patreon.com/posts/interview-with-143106016
🔗 Olga Dorabiala on LinkedIn: https://www.linkedin.com/in/olga-dorabiala-140930151/
00:00 What is RAG?
02:02 RAG is rooted in Information Retrieval
03:50 RAG challenges
05:44 Continua AI uses HyDE for their RAG pipeline
Transformers & Diffusion LLMs: What's the connection?
Diffusion-based LLMs are a new paradigm for text generation; they progressively refine gibberish into a coherent response. But what's their connection to Transformers?
In this video, I unpack how Transformers evolved from a simple machine translation tool into the universal backbone of modern AI — powering everything from auto-regressive models like GPT to diffusion-based models like LLaDA.
We’ll go step-by-step through:
• How the Transformer architecture actually works (encoder, decoder, attention)
• Why attention replaced recurrence in natural language processing
• How GPT training differs from diffusion-based text generation
• How BERT’s masked language modeling inspired diffusion LLMs
• A concrete walkthrough of LLaDA’s masked diffusion process
If you’re new here, check out my previous videos for an intuition-driven introduction to diffusion models and how physical diffusion inspired them: https://youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie&si=RcTREWUyVSAZRriv
📚 Free slide deck: https://patreon.com/juliaturc
📚 Papers:
• Original GPT: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
• BERT: https://arxiv.org/abs/1810.04805
• LLaDA: https://arxiv.org/abs/2502.09992
▶️ My previous video on Transformers: https://youtu.be/LE3NfEULV6k?si=SAaHbw6jD14nc7IM
00:00 Intro
01:25 The Transformer origin story
03:52 The alignment problem & attention
06:26 The architecture: encoder vs decoder
11:25 Auto-regressive LLMs & GPT
16:09 Text classification & BERT
18:51 Diffusion LLMs & LLaDA
24:17 Outro
Text diffusion: A new paradigm for LLMs
Text diffusion is a new paradigm for LLMs. As opposed to mainstream auto-regressive models like GPT, Claude or Gemini (which predict one token at a time), diffusion-based LLMs draft an entire response and refine it progressively. This leads to 10x faster inference.
Models like Gemini Diffusion, Mercury Coder from Inception Labs and Seed Diffusion from ByteDance are already competitive on coding benchmarks.
Inspired by physical diffusion, such models make use of Markov chains to model data generation as a particle hopping through discrete states.
📖 Papers:
Full reading list: https://www.patreon.com/c/JuliaTurc
D3PM: https://arxiv.org/abs/2107.03006
LLaDA: https://arxiv.org/abs/2502.09992
Scaling up Masked Diffusion Models on Text: https://arxiv.org/abs/2410.18514
▶️ The physics behind diffusion models: https://youtu.be/R0uMcXsfo2o?si=OqdGg4TPefSNTK3t
00:00 Intro
01:04: Auto-regressive vs diffusion LLMs
02:06: Why bother with diffusion for text?
06:30: The probability landscape
07:57: Diffusion in latent embedding space
11:00: Diffusion in token embedding space
12:13: Diffusion in text token space
13:49: Markov chains
16:46: Paper study: D3PM
19:42: Paper study: LLaDA
22:30: Evaluation
Hierarchical Reasoning Model: Substance or Hype?
📚 Free resources (reading list + visuals): https://www.patreon.com/c/JuliaTurc
📃 HRM paper: https://arxiv.org/abs/2506.21734
▶️ Yacine's YouTube channel: https://www.youtube.com/@deeplearningexplained
In this video, we dive into the Hierarchical Reasoning Model (HRM), a new architecture from Sapient Intelligence that challenges scaling as the only way to advance AI. With only 27M parameters, 1000 training examples, and no pretraining, HRM still manages to place on the notoriously difficult ARC-AGI leaderboard, right next to models from OpenAI and Anthropic.
Together with Yacine Mahdid (neuroscience researcher & ML practitioner), we’ll explore:
• Why vanilla Transformers plateau on tasks like Sudoku and Maze solving
• How latent recurrence and hierarchical loops give HRM more reasoning depth
• The neuroscience inspiration (theta–gamma coupling in the hippocampus 🧠)
• HRM’s controversial evaluation on ARC-AGI: was it a breakthrough or bending the rules?
• What this means for the future of reasoning in AI models
Timestamps:
00:00 Introducing HRM
01:23 Why Sudoku breaks Transformers
03:07 Recurrence via Chain-of-Thought
04:22 HRM: bird's eye view
06:30 Latent recurrence
08:23 The neuroscience backing
11:43 The H and L modules
12:32 Backprop-through-time approximation
13:48 The outer loop
19:31 Training data augmentation
22:59 Evaluation on Sudoku
24:07 Evaluation on ARC-AGI
The physics behind diffusion models
Full reading list: https://www.patreon.com/posts/physics-behind-136741238
Diffusion models build on the same mathematical framework as physical diffusion. In this video, we get to the core of the connection between the physics of motion and generative AI.
Topics covered:
• The intuition of probability landscapes (data as peaks, noise as valleys)
• Forward diffusion: how real data is gradually noised into chaos
• Brownian motion, Wiener processes, and the physics of particle motion
• Stochastic differential equations (SDEs) and the noise schedule
• Training a score function model (a “compass” in the probability landscape)
• Reverse diffusion and Anderson’s reverse SDE (sampling from noise to data)
• Probability flow ODEs for faster, deterministic sampling
🔗 Main resources:
• Full reading list: https://www.patreon.com/posts/physics-behind-136741238
• DDPM: Denoising Diffusion Probabilistic Models (https://arxiv.org/abs/2006.11239)
• Score-Based Generative Modeling through Stochastic Differential Equations (https://arxiv.org/abs/2011.13456)
00:00 Intro
01:06 Diffusion as a time-variant probability landscape
04:03 Where diffusion fits in the life of a model
04:34 Forward diffusion (training data generation)
06:25 The physics of diffusion
08:23 The forward SDE (Stochastic Differential Equation)
10:24 Case study: DDPM and noise schedules
13:17 The ML model as a local compass
14:43 Reverse diffusion and the reverse SDE
16:15 Samplers
17:27 Probability-flow ODE (Ordinary Differential Equation)
19:26 Outro
Reverse-engineering GGUF | Post-Training Quantization
The first comprehensive explainer for the GGUF quantization ecosystem.
GGUF quantization is currently the most popular tool for Post-Training Quantization. GGUF is actually a binary file format for quantized models, sitting on top of GGML (a lean PyTorch alternative) and llama.cpp (an LLM inference engine).
Due to its ad-hoc open-source nature, GGUF is poorly documented and misunderstood. Currently, information is scattered across Reddit threads and GitHub pull requests.
📌 Main topics covered in this video:
- The ecosystem: GGML, llama.cpp, GGUF
- Legacy quants vs K-quants vs I-quants
- The importance matrix
- Mixed precision (_S, _M, _L, _XL variants)
If you enjoyed this video, watch my entire series on model quantization: https://www.youtube.com/playlist?list=PL4bm2lr9UVG0HvePBXvsceO4yuLC8HhUh
📬 Have feedback or spotted an error? Contribute to the GitHub repo or leave a comment!
https://github.com/iuliaturc/gguf-docs
00:00 Intro
01:36 The stack: GGML, llama.cpp, GGUF
04:05 End-to-end workflow
05:29 Overview: Legacy, K-quants, I-quants
06:03 Legacy quants (Type 0, Type1)
10:57 K-quants
13:43 I-quants
17:42 Importance Matrix
22:51 Recap
23:35 Mixed precision (_S, _M, _L, _XL)
Training models with only 4 bits | Fully-Quantized Training
Can you really train a large language model in just 4 bits? In this video, we explore the cutting edge of model compression: fully quantized training in FP4 (4-bit floating point). While quantization has traditionally focused on inference, new research pushes the limits of training efficiency — reducing memory, compute, and cost.
🧠 We cover:
✅ NVIDIA TensorCores for mixed precision training
✅ Micro-scaling (MX) data formats
✅ Modeling tricks for 4-bit gradients (e.g. Stochastic Rounding)
📎 Resources:
🔵 Main paper: https://arxiv.org/abs/2505.19115
🔵 US congressional report on DeepSeek: https://selectcommitteeontheccp.house.gov/sites/evo-subsites/selectcommitteeontheccp.house.gov/files/evo-media-document/DeepSeek%20Final.pdf
🔵 Slide deck and full reading list: https://www.patreon.com/c/JuliaTurc
Watch the entire quantization series here: https://youtube.com/playlist?list=PL4bm2lr9UVG0HvePBXvsceO4yuLC8HhUh&si=xLu7vxMfNdJxkB0S
00:00 Intro
01:00 Motivation (training is expensive)
03:06 Mixed precision
05:40 Hardware support: FP4 in NVIDIA Blackwell
13:51 Microscaling formats (MXFP4 & NVFP4)
17:45 Why not INT4?
19:51 Modeling tricks: Stochastic Rounding
22:26 Outro
The myth of 1-bit LLMs | Quantization-Aware Training
Are 1-bit LLMs the future of efficient AI? Or just a catchy Microsoft metaphor? In this video, we break down BitNet, the so-called “1-bit LLM” that isn’t really 1-bit—but still delivers massive speed and memory gains through extreme quantization.
🔍 What you’ll learn:
• What fractional (1.58) bits are
• How BitNet works under the hood (BitLinear, ELUT, TL1/TL2)
• The role of quantization-aware training (QAT) and the Straight-Through Estimator (STE)
• Optimizations for ternary matrix multiplication
• How 1-bit LLMs scale with parameter count
📄 Main paper: https://arxiv.org/abs/2402.17764
Get my full paper reading list from here 👉 https://www.patreon.com/posts/130059217
👉 Watch the full Model Quantization Series: https://youtube.com/playlist?list=PL4bm2lr9UVG0HvePBXvsceO4yuLC8HhUh&si=Wd5vK6B2HQNAL67J
00:00 Intro
01:05 Inspiration and motivation
05:20 BitNet model architecture
10:21 Quantization-Aware Training
15:21 Storing fractional bits: bitpacking & ELUT
18:12 Open-weights models in HuggingFace
19:52 Ternary matrix multiplication
21:20 Demo & evaluation
23:59 Outro
How LLMs survive in low precision | Quantization Fundamentals
In this video, we discuss the fundamentals of model quantization, the technique that allows us to run inference on massive LLMs like DeepSeek-R1 or Qwen.
Among others, we'll discuss:
⚆ What quantization really means (hint: it’s more than just rounding)
⚆ Why integers are faster than floats (with a deep dive into their internal structure)
⚆ How quantization preserves model accuracy
⚆ When to quantize: during training vs after training (PTQ vs QAT)
⚆ A hands-on explanation of scale, zero point, clipping ranges, and fixed-point math
If you enjoyed this, consider subscribing for upcoming videos on:
⚆ Post-training quantization (PTQ)
⚆ Quantization-aware training (QAT)
⚆ Training in low precision (e.g., FP4)
⚆ 1-bit LLMs
#Quantization #MachineLearning #AIOptimization #LLM #NeuralNetworks #QAT #PTQ #DeepLearning #EdgeAI #FixedPoint #BFloat16 #TensorRT #ONNX #AIAccelerators
00:00 Intro
00:50 What
02:10: Why
03:50: Integer vs floating point formats
06:45 When
09:21 How
14:40 Fixed point arithmetic
18:00 Matrix multiplications
20:07 Outro
Knowledge Distillation: How LLMs train each other
In this video, we break down knowledge distillation, the technique that powers models like Gemma 3, LLaMA 4 Scout & Maverick, and DeepSeek-R1. Distillation was prominently discussed at LlamaCon 2025.
You’ll learn:
• What knowledge distillation really is (and what it’s not)
• How it helps scale LLMs without bloating inference cost
• The origin story from ensembles and model compression (2006) to Hinton’s "dark knowledge" paper (2015)
• Why "soft labels" carry more information than one-hot targets
• How companies like Google, Meta, and DeepSeek apply distillation differently
• The true meaning behind terms like temperature, behavioral cloning, and co-distillation
Whether you’re building, training, or just trying to understand modern AI systems, this video gives you a deep but accessible introduction to how LLMs teach each other.
👉 Slide deck and paper list available for free on Patreon: https://www.patreon.com/c/juliaturc
00:00 – Intro
00:45 – Why distillation matters for scaling
02:26 – The 2006 origins: ensembles and model compression
05:45 – Hinton's 2015 paper: soft labels & dark knowledge
08:26 – What temperature really means
09:37 – Distillation in modern LLMs (Gemma, LLaMA, DeepSeek)
10:53 – Proper distillation vs. behavioral cloning
13:18 – Computational costs of distillation
14:16 – Co-distillation explained
15:32 – Outro
Mixture of Experts: How LLMs get bigger without getting slower
Mixture of Experts (MoE) is everywhere: Meta / Llama 4, DeepSeek, Mistral. But how does it actually work? Do experts specialize? Why does this design scale better than dense models?
In this video, we go deep:
🔹 Walk through the full history of MoE—from vowel recognition in 1991 to trillion-parameter models
🔹 Reproduce the original paper live in Colab
🔹 Dissect modern architectures like Switch Transformer, DeepSeek-MoE, and Mixtral
🔹 Explain why sparsity works, how gating networks operate, and whether experts actually specialize
🔹 Explore training tricks like noise injection and load balancing
🔹 Discuss expert specialization.
Whether you’re an ML researcher, engineer, or just LLM-curious—you'll find value in this video.
🧠 Free resources (slides, reading list, Colab) are available on my Patreon for free 👉 https://www.patreon.com/c/juliaturc
00:00 Intro & Motivation
01:00 The Scaling Problem
01:49 The Original MoE Paper (1991)
03:43 Colab Repro of Original Paper
09:54 Sparse MoE Revival (2017)
16:03 Switch Transformer & K=1 (2019)
20:28 Modern Open-Source MoEs (Mixtral, DeepSeek, LLaMA 4)
23:02 Do experts specialize?
25:41 Parallelization
