General Information • Project Description • Project Log • Links and Resources
The gold standard for behavior in game theory scenarios is the Nash equilibrium. A set of strategies is in Nash equilibrium if none can achieve higher expected payoffs by changing behavior. In the machine-learning community, learning algorithms typically do not work this way. Instead of evaluating possible complete strategies, they compute estimates of the values of individual decisions and select decisions with high value. Following this idea, we can say a "value equilibrium" is a collection of value estimates for which the payoffs of the corresponding high value decisions are accurately modeled. In this project, we will attempt to make this notion concrete, prove various properties about it, and compare it to the more common Nash equilibrium concept. Are value equilibria easier to find? The two concepts appear to be identical in zero-sum games and games with pure equilibria. Are there games that have Nash equilibria that are not value equilibria and vice versa?
Week 1:I met with my mentor. We established what I knew about game theory, programming, and reinforcement learning. He challenged me to create a program which would model a human player playing against a Q-learner. I made a non-graphically based program in python that accomplishes this task. Read up on the work that has been done in this area already.
Week 2:I debugged my program, which had some shortcomings. Prepared, and gave, a presentation of my proposed project for the summer. Met with my advisor and some of his other graduate students. One of the graduate students, Michael Wunder, practiced a presentation he was going to give at a conference. Continued to read and build my background knowledge of the subject. Dr. Fred Roberts presented an interesting look into voting strategies. I also attended the predeparture presentation on graph thoery. I had no previous knowledge on the subject and, consequently, was delightfully enlightened in the mathematical ways of viewing problems presented in cartography.
Week 3:Continued to debug my program. It now produces lists of Q-values. I found that to reduce the "noise" in my graphs, I had to run many trials(at least five orders of magnatude) to see accurate results. I graphed my findings in mathematica. The main challenge I faced here was how to move massive amounts of data points to mathematica from my python IDE. Now that I had good results, I'm attempting to impliment IQL by applying the epsilon greedy exploration. If I properly impliment the epsilon greedy exploration, I won't need to conduct so many trials.
Week 4:I've successfully implemented IQL into my program and have attained much more accurate graphs of the Q-values. My graphs exhibit non-converging Q-values for the Spoiled Child game. This is to be expected. As I graphed the payoffs I see that this exploration doesn't have any effect on the slope of the payoff line and thus neither player actually has any long term benefit by defecting. The question is posed that maybe the values in my spoiled child game are too low. I altered the payoffs to the column player in the second column of the game matrix to make the "defect strategy" more appealing. This change resulted into a very different game, and thus was not the correct direction for my research.
Week 5:This week I take a closer look at playing the game as a human player against the computer. I found that a Quickly Forgiving Tit-for-Tat strategy resulted in the highest payoffs to both players, and well above the expected Nash payoffs. I programmed a program in Mathematica that changes between the player action choices and the payoffs for each player. On Thursday we made a trip out to Telcordia Technologies.
Week 6:In an attempt to make my payoff graphs complete I was challenged to create a Gradient Accent Learner that would show where this AI learner fit into the mix of payoffs. I found that the Gradient Accent Learner actually did better than expected (Nash was expected), but still worse than the human vs Q or the Q vs Q payoffs. We took a fieldtrip to IBM and were introduced to some of the neat technologies they work on there.
Week 7:I prepared for my final presentation. The suggestion was also made that maybe human or monkeys action choice method is closer to a Q-learner that impliments a historical value based method. I started to work on a algorythm that would impliment one step of history into a Q-learner. We went to the Cancer Institute of New Jersey. It was really cool to see how enthusiastic the scientists were about the work they were doing, and its practical application.
Week 8:With the assistance of my mentors, I created a Q-learner with one step of history, and ran this learner against a regular Q-learner. The Program is almost over, so I dicussed with my mentor the possiblility of continueing my research into the school year.
Presentation 1 (ppt)
Presentation 2 (ppt)
Links and Resources
 Babes, M., Munoz de Cote, E., and Littman, M. Social reward shaping in the prisoner's dilemma. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1389-1392, 2008.
 Littman, M. Markov games as a framework for multi-agent reinforcement learning. Proceedings of eleventh international conference on machine learning, pp.157-163 San Francisco. CA. Morgan Koufmann. 1994.
 Singh, S., Kearns, M., and Mansour, Y. Nash convergence of gradient dynamics in general-sum games, Proceedings of the Sixteenth Conference on Uncertainty in Artifcial Intelligence, Morgan Kaufman, 2000.
 Straffin, P. Game Theory and Strategy. Washington DC. The Mathematical Association of America, 2006.
 Wunder, M., Littman, M., and Babes, M. Classes of Multiagent Q-learning Dynamics with epsilon-greedy Exploration. Proceedings of twenty-seventh International Conference on Machine Learning, 2010.