Per group you pick a certain reinforcement learning research paper with associated codebase. This research paper forms the core of the course: you will try to understand it, replicate its experiments, test it in a new application and/or extend/improve the paper with a new idea.
Note that the below papers are merrily suggestions: you are always free to come up with your own paper of interest.
Do make sure that your paper comes with a trustworthy public codebase (ideally from the original authors of the paper) and a smaller toy experiment that is for sure computationally feasible to get some results. Always discuss your choice with the teachers.
Vanilla model-free RL algorithms still form the core of applied reinforcement learning research. These algorithms are relatively stable and well-studied, and your first choice as a benchmark in an applied problem. Consider one of the below algorithms if you want to study one of the Applications.
To deploy reinforcement learning algorithms in the real-world we typically need to ensure certain safety constraints. But how can achieve this?
OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research. Paper. Code.
This benchmark repository contains a range of RL algorithms that try to ensure certain safety bounds during deployment.
Traditionally RL trains on one particular reward/goal specification. However, we ideally want solutions that generalize over tasks, i.e., which can give good zero-shot/few-shot solutions on new task specifications.
Standard RL focuses on a single agent in a given task. However, some tasks cannot be solved centrally by a single-agent (due to latency of state communication), or become too big to solve as a single agent solution. In these cases, multi-agent RL can be a solution.
This benchmark repository contains a range of multi-agent algorithms to test on a range of environments.
Diffusion models (covered here) are a very successful approach to generative modelling. However, the can also be applied to reinforcement learning, as surveyed here.
Diffuser: Paper. Website. Code.
An interesting approach could be the Diffuser, where we bias the generative sampling process based on a reward function, and as such generate entire policies without ever explicitly performing RL credit assignment. You could at first focus on reproducing the 'Maze2D' experiments.