CS2109S Tutorial 10
Unsupervised Learning with Neural Network
+
Review (Last Tutorial!)
(AY 25/26 Semester 2)
April 16, 2026
(Prepared by Benson)
1 / 14
Contents
Unsupervised Learning with Neural Networks
Q1. Image Pre-training Strategies
Q2. Next-Token Prediction
Review Questions
Q3. Neural Networks
Q4. Convolutional Neural Networks
Course Review
2 / 14
Q1. Image Pre-training Strategies
We have collected a massive dataset of 1.3 million high-resolution images from the
ImageNet collection. This dataset contains a wide variety of everyday objects, nat-
ural scenes, and vehicles, but it does not have any class labels. Our downstream task is
to build a classifier for the CIFAR-10 dataset. CIFAR-10 is a standard benchmark
containing 60,000 small, low-resolution images sorted into 10 common categories, such
as airplanes, cars, and ships. However, we only have a very small labeled data- set for
these categories.
(a) Is rotation prediction useful here as a pre-training task? How does this benefit the
downstream task?
Yes. The model must learn the natural orientation of the world (e.g. wheels belong
on the ground, trees grow upward, etc.) to correctly predict whether an ImageNet
photo has been rotated.
3 / 14
Q1. Image Pre-training Strategies
We have collected a massive dataset of 1.3 million high-resolution images from the
ImageNet collection. This dataset contains a wide variety of everyday objects, nat-
ural scenes, and vehicles, but it does not have any class labels. Our downstream task is
to build a classifier for the CIFAR-10 dataset. CIFAR-10 is a standard benchmark
containing 60,000 small, low-resolution images sorted into 10 common categories, such
as airplanes, cars, and ships. However, we only have a very small labeled data- set for
these categories.
(a) Is rotation prediction useful here as a pre-training task? How does this benefit the
downstream task?
Yes. The model must learn the natural orientation of the world (e.g. wheels belong
on the ground, trees grow upward, etc.) to correctly predict whether an ImageNet
photo has been rotated.
3 / 14
Q1. Image Pre-training Strategies
(b) Instead of rotation prediction, we decide to use contrastive learning. How do we
construct the training pairs? How does this benefit the downstream task?
Positive pair: Two different augmented views of the same ImageNet photo.
Negative pair: Pair original image with an image of a completely different object
from the dataset.
The model is trained to pull the embeddings of positive pairs together in the latent
space while pushing negative pairs apart. This forces the model to learn consistency.
4 / 14
Q1. Image Pre-training Strategies
(b) Instead of rotation prediction, we decide to use contrastive learning. How do we
construct the training pairs? How does this benefit the downstream task?
Positive pair: Two different augmented views of the same ImageNet photo.
Negative pair: Pair original image with an image of a completely different object
from the dataset.
The model is trained to pull the embeddings of positive pairs together in the latent
space while pushing negative pairs apart. This forces the model to learn consistency.
4 / 14
Q1. Image Pre-training Strategies
(b) Instead of rotation prediction, we decide to use contrastive learning. How do we
construct the training pairs? How does this benefit the downstream task?
Positive pair: Two different augmented views of the same ImageNet photo.
Negative pair: Pair original image with an image of a completely different object
from the dataset.
The model is trained to pull the embeddings of positive pairs together in the latent
space while pushing negative pairs apart. This forces the model to learn consistency.
4 / 14
Q1. Image Pre-training Strategies
(b) Instead of rotation prediction, we decide to use contrastive learning. How do we
construct the training pairs? How does this benefit the downstream task?
Positive pair: Two different augmented views of the same ImageNet photo.
Negative pair: Pair original image with an image of a completely different object
from the dataset.
The model is trained to pull the embeddings of positive pairs together in the latent
space while pushing negative pairs apart. This forces the model to learn consistency.
4 / 14
Q1. Image Pre-training Strategies
(c) Now, we decide to use image inpainting. How is the training data for this task
created, and how does it help the model learn about objects?
”Mask” or cut out a random patch
Model predicts the pixel content of the missing area based on the surrounding
context.
Model learns natural arrangement of object features like matching sides, how parts
connect and continuing patterns
5 / 14
Q1. Image Pre-training Strategies
(c) Now, we decide to use image inpainting. How is the training data for this task
created, and how does it help the model learn about objects?
”Mask” or cut out a random patch
Model predicts the pixel content of the missing area based on the surrounding
context.
Model learns natural arrangement of object features like matching sides, how parts
connect and continuing patterns
5 / 14
Q1. Image Pre-training Strategies
(c) Now, we decide to use image inpainting. How is the training data for this task
created, and how does it help the model learn about objects?
”Mask” or cut out a random patch
Model predicts the pixel content of the missing area based on the surrounding
context.
Model learns natural arrangement of object features like matching sides, how parts
connect and continuing patterns
5 / 14
Q1. Image Pre-training Strategies
(c) Now, we decide to use image inpainting. How is the training data for this task
created, and how does it help the model learn about objects?
”Mask” or cut out a random patch
Model predicts the pixel content of the missing area based on the surrounding
context.
Model learns natural arrangement of object features like matching sides, how parts
connect and continuing patterns
5 / 14
Q2. Next-Token Prediction
A tech company trains a massive transformer model entirely on the unsupervised task
of next-token prediction using a massive dataset of unlabelled internet text (articles,
forums, code, books).
(a) The model is trained solely for next-token prediction. Explain in detail how it can
be used to complete a full sentence.
6 / 14
Q2. Next-Token Prediction
(b) An engineer provides the following input to the Base Model:
Input: “What is the capital of France”
Instead of answering the question with “Paris”, the model outputs:
Output: “and what is the currency of France?”
Explain why this behavior is possible.
It’s only trained for text completion. It extends the question based on common
textual patterns seen during pre-training.
(c) To increase the probability that the model will output the correct answer without
any fine-tuning, a friend suggests using the following as the input:
“Question: What is the capital of Japan? Answer: Tokyo.
Question: What is the capital of Australia? Answer: Canberra.
Question: What is the capital of France? Answer:”
Discuss why this input format may help the model generate the correct answer.
There is a repeating structural pattern, encouraging the model to continue this
structure. This technique is known as in-context learning.
7 / 14
Q2. Next-Token Prediction
(b) An engineer provides the following input to the Base Model:
Input: “What is the capital of France”
Instead of answering the question with “Paris”, the model outputs:
Output: “and what is the currency of France?”
Explain why this behavior is possible.
It’s only trained for text completion. It extends the question based on common
textual patterns seen during pre-training.
(c) To increase the probability that the model will output the correct answer without
any fine-tuning, a friend suggests using the following as the input:
“Question: What is the capital of Japan? Answer: Tokyo.
Question: What is the capital of Australia? Answer: Canberra.
Question: What is the capital of France? Answer:”
Discuss why this input format may help the model generate the correct answer.
There is a repeating structural pattern, encouraging the model to continue this
structure. This technique is known as in-context learning.
7 / 14
Q2. Next-Token Prediction
(b) An engineer provides the following input to the Base Model:
Input: “What is the capital of France”
Instead of answering the question with “Paris”, the model outputs:
Output: “and what is the currency of France?”
Explain why this behavior is possible.
It’s only trained for text completion. It extends the question based on common
textual patterns seen during pre-training.
(c) To increase the probability that the model will output the correct answer without
any fine-tuning, a friend suggests using the following as the input:
“Question: What is the capital of Japan? Answer: Tokyo.
Question: What is the capital of Australia? Answer: Canberra.
Question: What is the capital of France? Answer:”
Discuss why this input format may help the model generate the correct answer.
There is a repeating structural pattern, encouraging the model to continue this
structure. This technique is known as in-context learning.
7 / 14
Q3. Neural Networks
(a) If x
1
= 1, x
2
= 1, find ˆy.
f
[1]
=
2 1
3 1
1
1
=
3
2
ˆy = ReLU
2 1
3
2

= ReLU(8) = 8
(b) Find the derivative of the loss w.r.t. W
[1]
11
.
J(W )
W
[1]
11
=
J(W )
ˆy
·
ˆy
f
[2]
·
f
[2]
f
[1]
1
·
f
[1]
1
W
[1]
11
=
1
n
(ˆy y) ·
(
1 if f
[2]
> 0
0 otherwise
· W
[2]
11
· x
1
x
1
x
2
P
g
[1]
P
g
[1]
W
[1]
11
W
[1]
12
W
[1]
21
W
[1]
22
P
g
[2]
W
[2]
11
W
[2]
21
ˆy
W
[1]
=
2 3
1 1
, W
[2]
=
2
1
g
[1]
: Identity, g
[2]
: ReLU
x
1
x
2
y ˆy
1 0 3 7
0 1 2 0
8 / 14
Q3. Neural Networks
(a) If x
1
= 1, x
2
= 1, find ˆy.
f
[1]
=
2 1
3 1
1
1
=
3
2
ˆy = ReLU
2 1
3
2

= ReLU(8) = 8
(b) Find the derivative of the loss w.r.t. W
[1]
11
.
J(W )
W
[1]
11
=
J(W )
ˆy
·
ˆy
f
[2]
·
f
[2]
f
[1]
1
·
f
[1]
1
W
[1]
11
=
1
n
(ˆy y) ·
(
1 if f
[2]
> 0
0 otherwise
· W
[2]
11
· x
1
x
1
x
2
P
g
[1]
P
g
[1]
W
[1]
11
W
[1]
12
W
[1]
21
W
[1]
22
P
g
[2]
W
[2]
11
W
[2]
21
ˆy
W
[1]
=
2 3
1 1
, W
[2]
=
2
1
g
[1]
: Identity, g
[2]
: ReLU
x
1
x
2
y ˆy
1 0 3 7
0 1 2 0
8 / 14
Q3. Neural Networks
(a) If x
1
= 1, x
2
= 1, find ˆy.
f
[1]
=
2 1
3 1
1
1
=
3
2
ˆy = ReLU
2 1
3
2

= ReLU(8) = 8
(b) Find the derivative of the loss w.r.t. W
[1]
11
.
J(W )
W
[1]
11
=
J(W )
ˆy
·
ˆy
f
[2]
·
f
[2]
f
[1]
1
·
f
[1]
1
W
[1]
11
=
1
n
(ˆy y) ·
(
1 if f
[2]
> 0
0 otherwise
· W
[2]
11
· x
1
x
1
x
2
P
g
[1]
P
g
[1]
W
[1]
11
W
[1]
12
W
[1]
21
W
[1]
22
P
g
[2]
W
[2]
11
W
[2]
21
ˆy
W
[1]
=
2 3
1 1
, W
[2]
=
2
1
g
[1]
: Identity, g
[2]
: ReLU
x
1
x
2
y ˆy
1 0 3 7
0 1 2 0
8 / 14
Q3. Neural Networks
(a) If x
1
= 1, x
2
= 1, find ˆy.
f
[1]
=
2 1
3 1
1
1
=
3
2
ˆy = ReLU
2 1
3
2

= ReLU(8) = 8
(b) Find the derivative of the loss w.r.t. W
[1]
11
.
J(W )
W
[1]
11
=
J(W )
ˆy
·
ˆy
f
[2]
·
f
[2]
f
[1]
1
·
f
[1]
1
W
[1]
11
=
1
n
(ˆy y) ·
(
1 if f
[2]
> 0
0 otherwise
· W
[2]
11
· x
1
x
1
x
2
P
g
[1]
P
g
[1]
W
[1]
11
W
[1]
12
W
[1]
21
W
[1]
22
P
g
[2]
W
[2]
11
W
[2]
21
ˆy
W
[1]
=
2 3
1 1
, W
[2]
=
2
1
g
[1]
: Identity, g
[2]
: ReLU
x
1
x
2
y ˆy
1 0 3 7
0 1 2 0
8 / 14
Q3. Neural Networks
(a) If x
1
= 1, x
2
= 1, find ˆy.
f
[1]
=
2 1
3 1
1
1
=
3
2
ˆy = ReLU
2 1
3
2

= ReLU(8) = 8
(b) Find the derivative of the loss w.r.t. W
[1]
11
.
J(W )
W
[1]
11
=
J(W )
ˆy
·
ˆy
f
[2]
·
f
[2]
f
[1]
1
·
f
[1]
1
W
[1]
11
=
1
n
(ˆy y) ·
(
1 if f
[2]
> 0
0 otherwise
· W
[2]
11
· x
1
x
1
x
2
P
g
[1]
P
g
[1]
W
[1]
11
W
[1]
12
W
[1]
21
W
[1]
22
P
g
[2]
W
[2]
11
W
[2]
21
ˆy
W
[1]
=
2 3
1 1
, W
[2]
=
2
1
g
[1]
: Identity, g
[2]
: ReLU
x
1
x
2
y ˆy
1 0 3 7
0 1 2 0
8 / 14
Q4. Convolutional Neural Networks
Given a 3-channel image of height 3 and width 3.
(a) Using all three channels as input for convolution, apply the 2D convolution
operation as introduced in lecture to produce a 5-channel output using kernels of
height and width 2. What is the total number of weights/parameters across all
kernels in this convolution layer?
2 × 2 × 3 × 5 = 60.
(b) Describe how to generate an output of shape 3 × 3, where each element in the
output is the average of the corresponding elements across the three input
channels.
1 × 1 × 3 kernel with all 3 weights being
1
3
.
Stride: 1. No padding.
9 / 14
Q4. Convolutional Neural Networks
Given a 3-channel image of height 3 and width 3.
(a) Using all three channels as input for convolution, apply the 2D convolution
operation as introduced in lecture to produce a 5-channel output using kernels of
height and width 2. What is the total number of weights/parameters across all
kernels in this convolution layer?
2 × 2 × 3 × 5 = 60.
(b) Describe how to generate an output of shape 3 × 3, where each element in the
output is the average of the corresponding elements across the three input
channels.
1 × 1 × 3 kernel with all 3 weights being
1
3
.
Stride: 1. No padding.
9 / 14
Q4. Convolutional Neural Networks
Given a 3-channel image of height 3 and width 3.
(a) Using all three channels as input for convolution, apply the 2D convolution
operation as introduced in lecture to produce a 5-channel output using kernels of
height and width 2. What is the total number of weights/parameters across all
kernels in this convolution layer?
2 × 2 × 3 × 5 = 60.
(b) Describe how to generate an output of shape 3 × 3, where each element in the
output is the average of the corresponding elements across the three input
channels.
1 × 1 × 3 kernel with all 3 weights being
1
3
.
Stride: 1. No padding.
9 / 14
Q4. Convolutional Neural Networks
Given a 3-channel image of height 3 and width 3.
(a) Using all three channels as input for convolution, apply the 2D convolution
operation as introduced in lecture to produce a 5-channel output using kernels of
height and width 2. What is the total number of weights/parameters across all
kernels in this convolution layer?
2 × 2 × 3 × 5 = 60.
(b) Describe how to generate an output of shape 3 × 3, where each element in the
output is the average of the corresponding elements across the three input
channels.
1 × 1 × 3 kernel with all 3 weights being
1
3
.
Stride: 1. No padding.
9 / 14
Course Review
Environment
Agent
Sensor
Percepts
Actuator
Actions
Maximizes the
performance
10 / 14
Course Review
Environment
Agent
Critic
Performance standard
Learning
Element
feedback
Problem
Generator
learning
goals
Performance
element
changes
knowledge
Sensors
Actuators
11 / 14
Course Review
Training Set
(x
1
, f (x
1
)), . . . , (x
N
, f (x
N
))
Learning Algorithm AHypothesis Class H
Hypothesis hInput x Output ˆy f (x )
12 / 14
Beyond CS2109S
Introductory
CS2109S Intro to AI and ML
Theoretical Foundations
CS3263 Foundations of AI
CS4246 AI Decision Making
CS5340 Uncertainty Modelling
CS3264 Foundations of ML
Sem 1 (Harold): Math-intensive version of
CS2109S (more linear algebra?)
Sem 2 (Bryan): Focuses on theoreticals
e.g. Proving convergence, Bayesian
inference, Reinforcement learning
Applications
CS4243 Computer Vision
CS4248 Natural Language P.
Cool 5k mods
CS5339 Theory and Algo for ML
Algos, e.g. Perceptron, SVM, Kernels
ML Theory, e.g. Conc. Measures, VC-dim
Prerequisite: CS3264
CS5275 Algo Designer’s Toolkit
Math Tools for Algo/ML: Randomized,
Optimization, Info Theory, and more
Prerequisite: CS3230
CS4262 Machine Learning Systems
13 / 14
Beyond CS2109S
Introductory
CS2109S Intro to AI and ML
Theoretical Foundations
CS3263 Foundations of AI
CS4246 AI Decision Making
CS5340 Uncertainty Modelling
CS3264 Foundations of ML
Sem 1 (Harold): Math-intensive version of
CS2109S (more linear algebra?)
Sem 2 (Bryan): Focuses on theoreticals
e.g. Proving convergence, Bayesian
inference, Reinforcement learning
Applications
CS4243 Computer Vision
CS4248 Natural Language P.
Cool 5k mods
CS5339 Theory and Algo for ML
Algos, e.g. Perceptron, SVM, Kernels
ML Theory, e.g. Conc. Measures, VC-dim
Prerequisite: CS3264
CS5275 Algo Designer’s Toolkit
Math Tools for Algo/ML: Randomized,
Optimization, Info Theory, and more
Prerequisite: CS3230
CS4262 Machine Learning Systems
13 / 14
Beyond CS2109S
Introductory
CS2109S Intro to AI and ML
Theoretical Foundations
CS3263 Foundations of AI
CS4246 AI Decision Making
CS5340 Uncertainty Modelling
CS3264 Foundations of ML
Sem 1 (Harold): Math-intensive version of
CS2109S (more linear algebra?)
Sem 2 (Bryan): Focuses on theoreticals
e.g. Proving convergence, Bayesian
inference, Reinforcement learning
Applications
CS4243 Computer Vision
CS4248 Natural Language P.
Cool 5k mods
CS5339 Theory and Algo for ML
Algos, e.g. Perceptron, SVM, Kernels
ML Theory, e.g. Conc. Measures, VC-dim
Prerequisite: CS3264
CS5275 Algo Designer’s Toolkit
Math Tools for Algo/ML: Randomized,
Optimization, Info Theory, and more
Prerequisite: CS3230
CS4262 Machine Learning Systems
13 / 14
Beyond CS2109S
Introductory
CS2109S Intro to AI and ML
Theoretical Foundations
CS3263 Foundations of AI
CS4246 AI Decision Making
CS5340 Uncertainty Modelling
CS3264 Foundations of ML
Sem 1 (Harold): Math-intensive version of
CS2109S (more linear algebra?)
Sem 2 (Bryan): Focuses on theoreticals
e.g. Proving convergence, Bayesian
inference, Reinforcement learning
Applications
CS4243 Computer Vision
CS4248 Natural Language P.
Cool 5k mods
CS5339 Theory and Algo for ML
Algos, e.g. Perceptron, SVM, Kernels
ML Theory, e.g. Conc. Measures, VC-dim
Prerequisite: CS3264
CS5275 Algo Designer’s Toolkit
Math Tools for Algo/ML: Randomized,
Optimization, Info Theory, and more
Prerequisite: CS3230
CS4262 Machine Learning Systems
13 / 14
Beyond CS2109S
Introductory
CS2109S Intro to AI and ML
Theoretical Foundations
CS3263 Foundations of AI
CS4246 AI Decision Making
CS5340 Uncertainty Modelling
CS3264 Foundations of ML
Sem 1 (Harold): Math-intensive version of
CS2109S (more linear algebra?)
Sem 2 (Bryan): Focuses on theoreticals
e.g. Proving convergence, Bayesian
inference, Reinforcement learning
Applications
CS4243 Computer Vision
CS4248 Natural Language P.
Cool 5k mods
CS5339 Theory and Algo for ML
Algos, e.g. Perceptron, SVM, Kernels
ML Theory, e.g. Conc. Measures, VC-dim
Prerequisite: CS3264
CS5275 Algo Designer’s Toolkit
Math Tools for Algo/ML: Randomized,
Optimization, Info Theory, and more
Prerequisite: CS3230
CS4262 Machine Learning Systems
13 / 14
Student Feedback Exercise:
Your Voice Matters!
https://blue.nus.edu.sg/blue/
14 / 14