AIL8027: Advanced Reinforcement Learning
(Or AIL821: Special Topics in Machine Learning)
Course Overview: In this course, we will delve into the intricacies of reinforcement learning (RL) by exploring the advanced topics in the field. RL, a sprawling research area, holds promise for applications in diverse real-world domains such as robotics, autonomous driving, smart transportation, finance, supply-chain logistics, training LLMs, games etc. However, the challenges we confront in these domains often do not align with ideal conditions, necessitating a departure from simply applying our preferred off-the-shelf RL algorithms. For instance, we may encounter scenarios with multiple learning agents in the environment, sparse reward structures, multiple dynamic goals, incorporating constraints in the policy optimization etc. We will also cover recent crucial applications of RL in training LLMs.
Grading Scheme: Minor - 30%, Major - 35%, Assignments - 10%, Quizes - 10%, Paper Reading - 15%.
Attendance Policy: Institute default (<75% attendance leads to grade being lowered by one).
Audit Pass Criteria: Marks equivalent to B- or higher, plus >=75% attendance.
Prerequisites: A foundational course in AI or ML; Proficiency in Python; Good knowledge of Probability and Statistics.
Lecture: Location: TBA, Time: Monday & Thursday, 03:30 PM - 5:00 PM
Office Hours: Location: TBA, Time: TBA
Tentative List of Modules:
- Multi-Agent Reinforcement Learning (MARL)
- Constrained Reinforcement Learning (CRL)
- Hierarchical Reinforcement Learning (HRL)
- Reinforcement Learning for LLMs (RL-LLM)
- Unsupervised Reinforcement Learning (URL)
- Distributional Reinforcement Learning (DRL)
- Meta Reinforcement Learning (MRL)
- Imitation Learning (IL)
- Goal Conditioned RL (GCRL)
- Human-in-the-loop Reinforcement Learning
Deadlines :
- Assignment - 1: TBA.
- Quiz - 1: TBA
Lecture Schedule:
Week No. | Lecture Dates | Module | Topics | Reference Materials |
---|---|---|---|---|
1 | - | RL | Course logistics;Review of RL Basics - 1: Markov decision process, value functions, Bellman equations, monte-carlo RL, TD learning, SARSA. | - |
2 | - | RL | Review of RL Basics - 2: Off-policy learning - Q-Learning, DQN, Policy Gradient Methods - REINFORCE, Actor-Critic (AC), Advantage Actor Critic (A2C). | - |
3 | - | MARL | Introduction: multi-agent RL, motivation; challenges in MARL; Dec-POMDP;Solution Method: Single-RL - centralized learning, independent learning, parameter sharing, experience sharing; | - |
4 | - | MARL | Game theoretic solutions: Nash Q-learning, No-regret learning; Training & execution paradigms: Centralized training & execution, decentralized training & execution, centralized training & decentralized execution. | - |
5 | - | MARL | Multi-Agent policy gradient theorem (MADDGP), counterfactual action-value function (COMA); Value Decomposition Method: Linear Value Decomposition (VDN), Monotonic Value Decomposition (QMIX); Many Agent Training: Mean-Field RL | - |
6 | - | CRL | Constrained MDP; Lagrange relaxation technique - Reward Constrained Policy Optimization (RCPO); Trust Region Method - Constrained Policy Optimization (CPO). | - |
7 | - | HRL | State and Temporal Abstractions in Markov Decision Process; Semi-Markov Decision Process; Option Framework - Value Iteration with Options, Option Value and Policy Learning, Option-Critic Arch., Natural Option-Critic. | - |
8 | - | RL-LLM | RL with Human Feedback (RLHF); Preference based learning - Direct preference optimization (DPO), Reward-aware preference optimization (RPO), Group Relative Policy Optimization (GRPO). | - |
9 | - | URL | Reward-Free Pre-Training and Exploration; Intrinsic Motivation; Empowerment; Curiosity Driven Exploration; Unsupervised Skill Discovery; Unsupervised Control. | - |
10 | - | DRL | Learning return distribution; categorical TD-learning; Distributional Bellman operator; distributional value iteration; Distributional RL algorithm with deep neural networks. | - |
11 | - | MRL | Fast RL via slow RL; Learning to RL; Model agnostic meta learning (MAML); Meta Gradient RL; Successor features for transfer in RL. | - |
12 | - | IL | Imitation Learning: Behavior cloning, Dataset Aggregation (DAgger). Inverse RL. Generative Adversarial Imitation Learning (GAIL). | - |
13 | - | GCRL | Goal Augmented MDP, Notion of Goals & Subgoals; Hindsight Experience Replay (HER) | - |
14 | - | HLRL | Human-in-the-loop Reinforcement Learning | - |
Reference Materials :
- F Christianos and S V. Albrecht. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press 2024.
- Marc G. Bellemare, Mark Rowland, and Will Dabney. Distributional Reinforcement Learning MIT Press 2023.
- R Sutton, D Precup, S Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.
- Duan et al. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning, ICLR-2017.
- C Finn, P Abbeel, S Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML), ICML-2017