Reinforcement Learning

AIL7022/AIL722: Reinforcement Learning

Course Overview: Reinforcement Learning (RL) is a core area of machine learning focused on how intelligent agents learn to make decisions through interaction with an environment. This course will provide a comprehensive introduction to the principles, algorithms, and applications of reinforcement learning, with an emphasis on both theoretical foundations and practical implementation. Students will learn how agents can optimize long-term rewards by balancing exploration and exploitation, modeling problems using Markov Decision Processes (MDPs), and applying value-based and policy-based learning techniques.

Note: The course is currently offered under two different course IDs.
AIL722: Reinforcement Learning (3 Credit) - For old students only.
AIL7022: Reinforcement Learning (4 Credit) - For new students only.

Grading Scheme (AIL722 - 3 Credit): Minor - 30%, Major - 30%, Assignments - 30%, Quizes - 10%.

Grading Scheme (AIL7022 - 4 Credit): Minor - 30%, Major - 30%, Assignments - 40%, Quizes - 10%.

Prerequisites:: Basic knowledge of Probability and Statistics; Python Language - Numpy; Pytorch.

Attendance Policy: Institute default (<75% attendance leads to grade being lowered by one).

Audit Pass Criteria: Marks equivalent to B- or higher, plus >=75% attendance.

Lecture Hall & Time: TBA

Office Hours: By appointment only.

Class Communication: Moodle.

Tentative List of Topics:

Week No.	Lecture Dates	Module	Topics
1	-	Introduction	Course Logistics; Brief History - Connection to Behavior Psychology & Neuroscience; The RL Problem; Key Challenges
1	-	Planning Problem	Deterministic Decision Processes; Markov Decision Process; Partially Observable MDP; Value Functions
2	-	Planning Problem	Planning by Dynamic Programming - Value Iteration, Policy Iteration; Monte-Carlo Tree Search (MCTS)
3	-	Monte-Carlo (MC) Methods	MC Prediction; MC Control; Off-policy Prediction via Importance Sampling; Off-policy MC Control
4	-	Temporal Difference (TD) Methods	TD Prediction; SARSA; Expected SARSA; Off-policy Q-Learning; Q-Learning Convergence - Contraction Mapping, Banach”s Fixed-Point Theorem
5	-	Temporal Difference (TD) Methods	Fitted Q-Learning; Double Q-Learning; n-step TD prediction; n-step SARSA
5, 6	-	Approximate Prediction and Control	Value Function Approximation; Stochastic-Gradient and Semi-Gradient Methods; Linear Methods; Tile Coding; Non-Linear methods; Off-policy Divergence; The Deadly Triad;
6	-	Eligibility Traces	Forward and Backward View; Lambda-Return; TD-Lambda; SARSA-Lambda
7	-	Policy Gradient Methods	Stochastic and Deterministic Policy Gradient; Natural Policy Gradient; REINFORCE
8	-	Policy Gradient Methods	Actor-Critic (AC); A2C, A3C, DDPG; TRPO; PPO; SAC
9	-	RL as Probabilistic Inference	Graphical Model for Decision-Making; Policy Search as Probabilistic Inference; Maximum Entropy RL
10	-	Offline RL	Motivation; Distributional Shift; Policy Constraints; Implicit Q-Learning; Conservative Q-Learning
11	-	Model-Based RL (MBRL)	Model Learning: Planning with Models; MBRL via Policy Gradient; Latent Space Models; Dyna; Dreamer
12	-	Bandits	Multi-Arm Bandits; Contextual Bandits; Applications
13	-	RL Applications	RL for Training LLMs - RL with Human Feedback (RLHF); Preference based learning - Direct preference optimization (DPO), Reward-aware preference optimization (RPO), Group Relative Policy Optimization (GRPO).
14	-	RL Applications	RL for Real-World Applications and Case Studies

Reference Books :

Sutton and Barto, Reinforcement Learning. Second Edition, MIT Press 2018 [PDF]
Dimitri Bertsekas, Neuro-dynamic Programming, Athena Scientific 1996 [PDF]
Shie Mannor, Yishay Mansour and Aviv Tamar, Reinforcement Learning: Foundations [PDF]
Csaba Szepesvari, Algorithms for Reinforcement Learning [PDF]
Sergey Levine, Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review [PDF]