Advanced Reinforcement Learning

AIL8027: Advanced Reinforcement Learning (4 Credit)

AIL821: Special Topics in Machine Learning (3 Credit)

Course Overview: In this course, we will delve into the intricacies of reinforcement learning (RL) by exploring the advanced topics in the field. RL, a sprawling research area, holds promise for applications in diverse real-world domains such as robotics, autonomous driving, smart transportation, finance, supply-chain logistics, training LLMs, games etc. However, the challenges we confront in these domains often do not align with ideal conditions, necessitating a departure from simply applying our preferred off-the-shelf RL algorithms. For instance, we may encounter scenarios with multiple learning agents in the environment, sparse reward structures, multiple dynamic goals, incorporating constraints in the policy optimization etc. We will also cover recent crucial applications of RL in training LLMs.

Learning Outcome: Develop a comprehensive understanding of a wide range of sophisticated tools and techniques in the field of reinforcement learning. This will empower students to effectively tackle complex problem scenarios that may arise in real-world applications. Gain insight into the latest innovative concepts and pioneering research directions in the field. Students will also explore open research challenges that exist on the cutting edge of this rapidly evolving field, equipping them with the knowledge to contribute to future advancements.

Note: The course is currently offered under two different course IDs.
AIL821: Special Topics in Machine Learning (3 Credit) - For old students only.
AIL8027: Advanced Reinforcement Learning (4 Credit) - For new students only.

Grading Scheme (AIL821 - 3 Credit): Minor - 30%, Major - 35%, Assignments - 10%, Quizes - 10%, Paper Reading - 15%.

Grading Scheme (AIL8027 - 4 Credit): Minor - 30%, Major - 35%, Assignments - 20%, Quizes - 10%, Paper Reading - 15%.

Attendance Policy: Institute default (<75% attendance leads to grade being lowered by one).

Audit Pass Criteria: Marks equivalent to B- or higher, plus >=75% attendance.

Prerequisites: A foundational course in AI or ML; Proficiency in Python; Good knowledge of Probability and Statistics.

Lecture Location: LH521. Students registered with course code AIL821 will also have lectures at LH521.

Lecture Timing: Monday & Thursday, 03:30 PM - 5:00 PM

Office Hours: By appointment.

Tentative List of Modules:

Multi-Agent Reinforcement Learning (MARL)
Constrained Reinforcement Learning (CRL)
Hierarchical Reinforcement Learning (HRL)
Reinforcement Learning for LLMs (RL-LLM)
Unsupervised Reinforcement Learning (URL)
Distributional Reinforcement Learning (DRL)
Meta Reinforcement Learning (MRL)
Imitation Learning (IL)
Goal Conditioned RL (GCRL)
Human-in-the-loop Reinforcement Learning

Deadlines :

Assignment - 1: TBA.
Quiz - 1: TBA

Lecture Schedule:

Week No.	Lecture Dates	Module	Topics	Reference Materials
1	July 24, 28	RL	Course motivation and logistics; Review of RL Basics - 1: Markov decision process, value functions, Bellman equations, monte-carlo RL, TD learning, SARSA.	[1]-Ch1, Ch3, Ch5, Ch6
2	July 31, Aug. 04	RL	Review of RL Basics - 2: Off-policy learning - Q-Learning, DQN, Policy Gradient Methods - REINFORCE, Actor-Critic (AC), Advantage Actor Critic (A2C).	[1]-Ch6, Ch13
3	Aug. 07, 11	MARL	Introduction: multi-agent RL, motivation; challenges in MARL; Dec-POMDP;Solution Method: Single-RL - centralized learning, independent learning, parameter sharing, experience sharing;	[2]-Ch1, Ch3
4	Aug. 14, 18	MARL	Game theoretic solutions: Nash Q-learning, No-regret learning; Training & execution paradigms; Multi-Agent policy gradient (MAPG, MADDPG), counterfactual action-value function (COMA)	-
5	Aug. 21, 25	MARL	Value Decomposition Method: Linear Value Decomposition (VDN), Monotonic Value Decomposition (QMIX), Multi-agent Attention Actor Critic (MAAC); Many Agent Training: Mean-Field MARL; Collective Dec-POMDP	-
6	Aug. 28, Sep. 1st	CRL	Constrained MDP; Lagrange relaxation technique - Reward Constrained Policy Optimization (RCPO); Trust Region Method - Constrained Policy Optimization (CPO).	-
7	Sep. 4, 8	HRL	State and Temporal Abstractions in Markov Decision Process; Semi-Markov Decision Process; Option Framework - Value Iteration with Options, Option Value and Policy Learning, Option-Critic Arch., Natural Option-Critic.	-
8	Sep. 11	RL-LLM	RL with Human Feedback (RLHF); Preference based learning - Direct preference optimization (DPO)	-
-	Sep. 12 - 17	-	Minor Exam	-
8	Sep. 18	RL-LLM	Preference based learning - Reward-aware preference optimization (RPO), Group Relative Policy Optimization (GRPO).	-
9	Sep. 22, 25	URL	Reward-Free Pre-Training and Exploration; Intrinsic Motivation; Empowerment; Curiosity Driven Exploration; Unsupervised Skill Discovery; Unsupervised Control.	-
-	Sep. 28 - Oct. 5	-	Mid-Semester Break	-
10	Oct. 6, 9	DRL	Learning return distribution; categorical TD-learning; Distributional Bellman operator; distributional value iteration; Distributional RL algorithm with deep neural networks.	-
11	Oct. 13, 16	MRL	Fast RL via slow RL; Learning to RL; Model agnostic meta learning (MAML); Meta Gradient RL	-
12	-	IL	Imitation Learning: Behavior cloning, Dataset Aggregation (DAgger). Generative Adversarial Imitation Learning (GAIL).	-
13	-	GCRL	Goal Augmented MDP, Notion of Goals & Subgoals; Hindsight Experience Replay (HER)	-
14	-	HLRL	Human-in-the-loop Reinforcement Learning	-

Reference Materials :

Sutton and Barto, Reinforcement Learning. Second Edition, MIT Press 2018 [PDF]
Christianos et al. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press 2024. [PDF]
Bellemare et al. Distributional Reinforcement Learning MIT Press 2023. [PDF]
Gupta et al. Cooperative Multi-Agent Control Using Deep Reinforcement Learning. AAMAS-2017
Christianos et al. Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning. NeurIPS-2020
Sutton et al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.
Duan et al. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning, ICLR-2017.
Finn et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML), ICML-2017
Warde-Farley et al. Unsupervised Control Through Non-Parameteric Discriminative Rewards. ICLR-2018.
J Achiam et al. Constraint Policy Optimization. ICML-2017.
Duan et al. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning, ICLR-2017.
C Finn et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML), ICML-2017
Christiano et al. Deep reinforcement learning from human preferences. NeurIPS, 2017.
Liu et al. Goal-Conditioned Reinforcement Learning: Problems and Solutions. IJCAI-2022.

AIL8027: Advanced Reinforcement Learning (4 Credit)

AIL821: Special Topics in Machine Learning (3 Credit)

More updates coming soon!