Advanced Machine Learning Assignment — 2022
Submission
Submit your solution electronically via vUWS. (1) Submit a report as PDF via Turnitin. (2) Create a zip file with your code (use zip, do not use rar), and any other file you want to submit, and upload it to vUWS (to where you got this assignment text), and include the signed and completed cover sheet that you can find at the end of the document.
Miniracer
Figure 1: 4 frames from miniracer. There are three possible values for a pixel: +2 for the car (the dark 2 × 2 square), 0 for drivable track segments (1 × 6 pixels, in white), and +1 for non-drivable terrain (here in grey). When the front of the car bumps into non-drivable terrain, the episode finishes. The rear of the car is allowed to go off road.
In this assignment work with data and a simulation of a simple racing game. The car is represented by the black square in the screenshots above.
In this game, the car remains at the bottom of the screen, and can either move left, right, or keep the current position. At every step, the track scrolls down by one, simulating the driving car. The size of the screenshot is 16 × 16 pixels.
Preparation Download the minirace.py and sprites.py python files. The class Minirace implements the racing game simulation. Running sprites.py will create datasets of screenshots for your first task.
A new racing game can be created like here:
from minirace import Minirace therace = Minirace(level=1)
In this, level sets the information a RL agent gets from the environment. The car is 2 × 2 pixels, and cannot leave the field. The track segments are 6 pixels wide, and have positions from 1 (left) to 5 (right), and the car has 7 different positions (from 0
to 6). The front of the car (in the second row from the bottom, row 1) must remain on drivable terrain at all times. The rear of the car (in the first row from the bottom, row 0) is allowed to come off road with no penalty.
At each step during a race, the agent will get a reward of +1. Once the front of the car comes off road, the episode finishes.
Task 1: Train a CNN to predict a clear road ahead 15 points
The python program sprites.py creates a training and test set of “minirace” scenes, trainingpix.csv (1024 examples) and testingpix.csv (256 examples). Each row represents a 16 × 16 screenshot (flattened in row-major order), plus an extra value of either 0 or 1 that indicates if the car can safely drive straight without going off-road in the immediate next step (i.e., there are 257 columns).
Steps
1. Create the datasets by running the sprites.py code.
2. Create a CNN that predicts the whether the car can safely remain on the current position (i.e., drive straight) without crashing into non-drivable terrain.
(a) Describe (no programming): what is a good loss function for this problem?
(b) Implement and train the CNN on the training set.
(c) Compute the accuracy of your model on the test data set.
Your are free to choose the architecture of your network, but there should be at least one convolutional layer.
Task 2: Train a convolutional autoencoder 10 points
Create a convolutional autoencoder that compresses the racing game screenshots to a small number of bytes (the encoder), and transforms them back to original (in the de- coder part).
Steps
1. Create and train an undercomplete convolutional autoencoder and train it using the training data set from the first task.
2. You can choose the architecture of the network and size of the representation
h = f (x). The goal is to learn a representation that is smaller than the original, and still leads to recognizable reconstructions of the original.
3. (No programming): Explain the difference between an under complete and a de-noising auto-encoder.
4. (No programming): The input images are 16×16 = 256 pixels. What is the size of your hidden representation h = f (x) (the middle layer size of your auto-encoder).
Include your calculation in your report.
Task 3: Create a RL agent for Minirace (level 1) 15 points
The code in minirace.py provides an environment to create an agent that can be trained with reinforcement learning (a complete description at the end of this sheet).
The following is a description of the environment dynamics:
The square represents the car, it is 2 pixels wide. The car always appears in the bottom row, and at each step of the simulation the track scrolls by one row below the car.
The agent can control the steering of the car, by moving it two pixels to the left or right. The agent can also choose to do nothing, in which case the car drives straight. The car cannot be moved outside the boundaries.
The agent will receive a positive reward at each step where the front part of the car is still on track.
An episode is finished when the front of the car hits non-drivable terrain.
In a level 1 version of the game, the observed state (the information made available to the agent after each step) consists of one number: dx. It is the relative position of the middle of the track right in front of the car (i.e., the piece of track in the third row from the bottom of the image). When the track turns left in front of the car, this value will be negative, and when the track turns right, dx is positive. As the track is six pixels wide,
the car can drive either on the left, middle, or right of a piece of track (it does not need to drive in the middle of the road).
For this task, you should initialise the simulation like this: therace = Minirace(level=1)
When you run the simulation, step() returns dx (…, −2, −1, 0, 1, 2, …) for the state.
Steps
1. Manually create a policy (no RL) that successfully plays drives the car, just selecting actions based on the state information. The minirace.py code contains a function mypolicy() that you should modify for this task.
2. (No programming) How many different values for dx are possible in theory (if you ignore that the car may crash)? If you were to create a tabular reinforcement learning agent, what size is your table for this problem (number of rows and columns)?
3. Create a (tabular or deep) TD agent that learns to drive. If you decide to use – greedy action selection, set = 1, initially, and reduce it during your training to a minimum of 0.01.
4. When you run your training, reset the environment after every episode. Store the sum of rewards. After or during the training, plot the total sum of rewards per episode. This plot — the Training Reward plot — indicates the extent to which your agent is learning to improve his cumulative reward. It is your decision when
This means: do not stop just because reached 0.01 – you may want to stop earlier, or you may want to keep going, just do not reduce any further. to stop training. It is not required to submit a perfectly performing agent, but show how it learns.
5. After you decide the training to be completed, run 50 test episodes using your trained policy, but with = 0.0 for all 50 episodes. Again, reset the environment at the beginning of each episode. Calculate the average over sum-of-rewards-per- episode (call this the Test-Average), and the standard deviation (the Test-Standard- Deviation). These values indicate how your trained agent performs.
Task 4: Create a RL agent for Minirace (level 2) 10 points
In a level 2 version of the game, the observed state (the information made available to the agent after each step) consists of two numbers: dx1, dx2. The first value (dx1) is the same as dx in level 1 – the relative position of the (middle of the) track in front of the car. The second value (dx2) is the position of the subsequent track (in row 4), relative to the track in front of the car (in row 3).
A second difference is that the track can be more curved: sometimes the track will only overlap on the left or right edge. This means the agent cannot always drive in the middle of the track, because the car can only move one step to the left or right at a time.
For this task, you can initialise like this: therace = Minirace(level=2)
In the level, step() returns two unnormalised pixel difference values (i.e., two values from …, −2, −1, 0, 1, 2, …).
1. Create a RL agent (using a RL method of your choice) that finds a policy using (all) level 2 state information. A suggested discount factor is γ = 0.95.
2. You can choose the algorithm (a tabular approach, deep TD or deep policy gradient).
3. Try to train an agent that achieves a running reward > 50 (the minirace.py file has an example for how to calculate this).
4. If you use a neural network, not go overboard with the number of hidden layers as this will significantly increase training time. Try one hidden layer.
5. Write a description explaining how your approach works, and how it performs. If some (or all) of your attempts are unsuccessful, also describe some of the things that did not work, and which changes made a difference.