## Last Tickets of NeurIPS 2018

As the cards shown, tickets of NeurIPS (NIPS) 2018 were sold out while you were in the bathroom. The bloom of paper submissions, conference participants and sponsors indicate that AI is attracting more and more attention from both academic and industry areas. As a fledgling researcher, I definitely wish to participate such grand final of AI in 2018. Fortunately, my paper “Option Discovery from Visual Features in Deep Reinforcement Learning” was accepted by Deep Reinforcement Learning Workshop at NeurIPS 2018. In this way, I got the chance to present our work on RL and it became my first journey to the top-tier AI conference.

For me, I focused on sections related to reinforcement learning as well as some attractive computer vision works. In fact, NeurIPS has many theoretical works such as kinds of provable bounds, novel non-convex optimization methods. If you are not in these areas, I bet you couldn’t get gist ideas and understand them only through several minutes talks and posters. Even though many works you don’t understand, NeurIPS conference is still worthwhile for new AI researchers like me. Following sections are the recap of my NeurIPS journey. Hopefully, it is inspiring for your research work.

## Reinforcement Learning

### Reproducibility, Reusability, and Robustness

This section is a recap of the invited talk from Joelle Pineau. In my view, the meaningful part of Joelle’s talk is that it highlights the inconsistence and pitfalls of current evaluation metrics in RL research, and the roadmap to Artificial General Intelligence (AGI) in RL approaches.

At the beginning of Joelle’s talk, she cited the concept of reproducibility, resuability and robustness from NSF. Here I entitle them as “3R concepts”. Different disciplines share the same “3R concepts”. In RL research, we typically reproduce the proposed method following the paper and train agents to check whether they able to get similar reward level within same training steps. The tough moment is when you are trying to reproduce works with customized simulator, like distributed PPO. Some works with customized simulators for specific purporse DON’T publish code, making hard for others to reproduce and follow their work. For the resuability, we generally test the proposed method in many different environments without extensively hyperparameters search. Also, some works evaluate reusability from generalization, testing agents in fasion of few-shots or zero-shots. For the robustness, RL algorithms are notorious at this aspect. The experiments show that network structure, activation functions, reward scale and random seeds are all possible to influence the convergent reward level, making hard to distinguish whether the proposed method beats against the state-of-the-art methods.

On top of the common understanding of “3R concepts”, Joelle pointed out current evaluation metrics in RL research are not fair and hard to interpret. Even though you paper report all of hyperparameters of proposed models and baselines, random seeds are crucial to make the performance significant as shown in following.

As the end of section on reproducibility, Joelle proposed a reproducibility checklist. Following it, the reviewers may not argue you experiments anymore! Moreover, the infrastrure of training platform also matters for reproducibility. From my own experience, I found the convergent reward on GPU is less than on CPU with same hyperparameters and random seeds (tested on baseslines ACER code). I am not sure whether the gap is caused by the difference in float precision between CPU and GPU.

In the final section, Joelle defined the generalization error of RL on different seeds over training and testing phases, and the roadmap of RL to AGI. Her team trains RL agents over natural images for classification, and in fancy Atari game with real-word video as background. All of their attempts are focused on reality of observation and diversity of tasks. However, in my view, the ability of learning general knowledge and reasoning, and the flexibility of policy network against the increasing action space are inevitable to AGI. Because these factors lead to possible success in real-word application from around $10^1\sim10^2$ trials.

### Exploration

#### Exploration in Structured Reinforcement Learning

Honestly, I am not good at reading theoretical RL papers with extensive analytic study. For this one, I hope to interpret the gist idea and intuitive things behind this work. As we all know, RL tasks are modeled as Markov Decision Process (MDP). Then, the natural question is: does the special property of MDP help to address the dilemma of exploration and exploitation? This work is a flagship at this direction, in which they define MDPs with some known properties as structured MDPs, represented by a set $\Phi$. They first analysis on the general structured MDP getting a regret lower bound, then get a tight bound for Lipschitz MDPs, that doesn’t scale with the sizes of both state and action space because of its special properties shown in following slide. Finally they proposed an algorithm DEL (Directed Exploration Learning) with optimal exploration in Lipschitz MDPs.

What’s the relationship between exploration and regret? By definition, regret is “the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning”. In other words, regret is the penalty for improper exploration (not found actions as same as the optimal policy does). Thus, if we can find an algorithm to achieve smaller regret, we can achieve better exploration.

To analysis the regret using properties of MDPs, they introduce a new concept, confusing MDPs denoted by

in which $\mathcal{O}(x; \phi)$ represents the set of optimal actions in state $x$ under MDP $\phi$; $\Pi^{*}(\phi)$ represents the set of optimal policies under MDP $\phi$. The value of KL-divergence indicates that optimal actions cannot lead to notable difference of outcomes (next state and reward) between two MDPs $\phi$ and $\varphi$ (condition i). In addition, the optimal policies under $\phi$ is not optimal under $\varphi$ (condition ii).

The regret bound they proposed for general structure MDPs is in theorem 1.

In above slide, exploration rate defined by

Theorem 1 looks complex as an optimization problem of semi-infinite linear programming. The intuitive thought, however, is simple: to minize regret, we need to try each state-action pair $(x, a)$ as less as possible but make sure it is sufficient to distinguish confusing MDPs sharing similar properties $\Phi$ so as to avoid finding the suboptimal policies.

Other theorems on Lipschitz MDPs look interesting as the regret bound doesn’t scale with sizes of state space and action space. But I didn’t dig into it anymore. Because the exploration rate depends on the count of trying action $a$ under state $x$, leading to failure of applying for uncountable environments. Even interestingly, one questioner argued in the oral section that this work is not first one finding the regret bound with this notable feature. I am not familiar with this area so I cannot draw any solid conclusion at here.

Pending... Update soon.

## Memorable Moments

### Richard S. Sutton Signing

Richard’s book Reinforcement Learning: An Introduction (Second Edition) sold out in minutes!

### Writing Your University

Standing by the wall, I was wandering how could the students from Duke and Mila (i.e., U of Montreal) sign at the top. Were there some expo workers from them?

### PyTorch 1.0 Stable Release

Facebook officially released 1.0 stable version of PyTorch and were offering tutorial on custom C++ extensions.

### Closing Reception

Half of the hall for the reception with beverage and snacks and half of the hall for the music party. Notably, players on the stage are volunteers and all of them are AI researchers.

Not only young researchers but also old professors shared such beautiful moment and danced with music enjoyably. This is the end of NeurIPS 2018 in Montreal.