AIL7022/AIL722: Reinforcement Learning
Course Overview: Reinforcement Learning (RL) is a core area of machine learning focused on how intelligent agents learn to make decisions through interaction with an environment. This course will provide a comprehensive introduction to the principles, algorithms, and applications of reinforcement learning, with an emphasis on both theoretical foundations and practical implementation. Students will learn how agents can optimize long-term rewards by balancing exploration and exploitation, modeling problems using Markov Decision Processes (MDPs), and applying value-based and policy-based learning techniques.
Note: The course is currently offered under two different course IDs.
AIL722: Reinforcement Learning (3 Credit) - For old students only.
AIL7022: Reinforcement Learning (4 Credit) - For new students only.
Grading Scheme (AIL722 - 3 Credit): Minor - 30%, Major - 30%, Assignments - 30%, Quizes - 10%.
Grading Scheme (AIL7022 - 4 Credit): Minor - 30%, Major - 30%, Assignments - 40%, Quizes - 10%.
Prerequisites:: Basic knowledge of Probability and Statistics.
Attendance Policy: Institute default (<75% attendance leads to grade being lowered by one).
Audit Pass Criteria: Marks equivalent to B- or higher, plus >=75% attendance.
Lecture Hall & Time: TBA
Office Hours: By appointment only.
Class Communication: Moodle.
Tentative List of Topics:
| Week No. | Lecture Dates | Module | Topics |
|---|---|---|---|
| 1-a | - | Introduction | Course Logistics; Motivation; Connection to Psychology and Neuroscience; Sequential Decision-Making Problem; The RL Problem; Key Challenges |
| 1-b | - | Planning Problem | Deterministic Decision Processes; Markov Decision Process; Partially Observable MDP; Value Functions |
| 2 | - | Planning Problem | Planning by Dynamic Programming - Value Iteration; Policy Iteration; Monte-Carlo Tree Search (MCTS) |
| 3 | - | Monte-Carlo (MC) Methods | MC Prediction; MC Control; Off-policy Prediction via Importance Sampling; Off-policy MC Control |
| 4 | - | Temporal Difference (TD) Methods | TD Prediction; SARSA; Expected SARSA; Off-policy Q-Learning; Q-Learning Convergence - Contraction Mapping, Banach”s Fixed-Point Theorem |
| 5-a | - | Temporal Difference (TD) Methods | Fitted Q-Learning; Double Q-Learning; n-step TD prediction; n-step SARSA |
| 5-b, 6-a | - | Approximate Prediction and Control | Value Function Approximation; Linear Methods; Tile Coding; Non-Linear methods; Off-policy Divergence; The Deadly Triad; |
| 6-b | - | Eligibility Traces | Forward and Backward View; Lambda-Return; TD-Lambda; SARSA-Lambda |
| 7 | - | Policy Gradient Methods | Stochastic and Deterministic Policy Gradient; Natural Policy Gradient; REINFORCE |
| 8 | - | Policy Gradient Methods | Actor-Critic (AC); A2C, A3C, DDPG; TRPO; PPO; SAC |
| 9 | - | RL as Probabilistic Inference | Graphical Model for Decision-Making; Policy Search as Probabilistic Inference; Maximum Entropy RL |
| 10 | - | Offline RL | Motivation; Distributional Shift; Policy Constraints; Implicit Q-Learning; Conservative Q-Learning |
| 11 | - | Model-Based RL (MBRL) | Model Learning: Planning with Models; MBRL via Policy Gradient; Latent Space Models; Dyna; Dreamer |
| 12 | - | Bandits | Multi-Arm Bandits; Contextual Bandits; Applications |
| 13 | - | RL for Training LLMs | RL with Human Feedback (RLHF); Preference based learning - Direct preference optimization (DPO), Reward-aware preference optimization (RPO), Group Relative Policy Optimization (GRPO). |
| 14 | - | RL Applications | RL for Real-World Applications and Case Studies |
Reference Books :
- Sutton and Barto, Reinforcement Learning. Second Edition, MIT Press 2018 [PDF]
- Dimitri Bertsekas, Neuro-dynamic Programming, Athena Scientific 1996 [PDF]
- Shie Mannor, Yishay Mansour and Aviv Tamar, Reinforcement Learning: Foundations [PDF]
- Csaba Szepesvari, Algorithms for Reinforcement Learning [PDF]
- Sergey Levine, Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review [PDF]