We have Vincent Sitzmann for today’s interview! Vincent is a Postdoctoral Researcher at MIT’s CSAIL and he just completed his Ph.D. at Stanford. Vincent’s research interests lie in the area of neural scene representations — the way neural networks learn to represent information on our world. One of Vincent’s works that stirred the Deep Learning community is Implicit Neural Representations with Periodic Activations also referred to as SIREN. The results of SIREN speak for the efficacy of it. Vincent developed a Colab Notebook that is more than enough to get us started with SIREN. Vincent’s work on Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations won the “honorable mention” award at NeurIPS 2019.
An interview with Vincent Sitzmann, Postdoctoral Researcher at MIT
Sayak: Hi Vincent! Thank you for doing this interview. It’s a pleasure to have you here today.
Vincent: The pleasure is all mine, thanks for having me!
Sayak: Maybe you could start by introducing yourself — what your current interests are, what are you currently working on, etc.?
Vincent: Sure! I’m currently a Postdoc with Prof. Joshua Tenenbaum, Prof. Fredo Durand, and Prof. Bill Freeman at MIT. Before that, I finished my Ph.D. at Stanford in Prof. Gordon Wetzstein’s group. My research interest lies in neural scene representations. The best way to explain what that means is to think about your mental model of the room you’re sitting in right now. Your brain has some sort of representation that encodes information about materials, geometry, affordances of objects, movements, weights of objects, and so on. And all from mostly 2D observations (i.e., images). It’s very impressive, really! I try to build machine learning models that learn to infer representations of scenes that are similarly rich from images. Note that I’m not saying that I’m studying how this works in the brain, though of course I’m very interested in that, but rather, I’m building algorithms that hopefully one day can solve the same kinds of problems, but maybe in an entirely different way! My current tool of choice is the class of “implicit neural representations” — these represent a signal or a scene as a function that maps a coordinate to whatever is at that coordinate. This function is parameterized as a neural network, and so finding this representation is equivalent to finding the weights of that neural network!
These representations have a number of properties that make them very attractive for this task. They naturally live in 3D, just like the room you’re sitting in (ignoring time for now). They have “infinite” resolution and global support, meaning that they can be sampled at any 3D coordinate and at arbitrary density. The memory they require is independent of some sampling resolution, as opposed to point clouds or voxel grids. And lastly, they are “compact”, in the sense that they do not distribute information across space, like point clouds for instance, but all the information is encoded in the weights!
Sayak: This is very helpful, Vincent. More like a SIREN 101. Thank you!
You have backgrounds in differentiable camera pipelines, VR, and human perception. Would you like to shed some light on some of your favorite works in those areas and also what made you switch to the field of 3D scene representations?
Vincent: Indeed, I’ve worked on a variety of different things during my Ph.D.! I’ve always been excited about everything that relates to computer vision and scene understanding. My favorite takeaway from my time before working on 3D neural scene representations is probably wave optics, because it offers such a drastically different perspective on how light transports and image formation works, compared to modern computer graphics. I can recommend the book “An introduction to Fourier optics” for that! However, fundamentally, I have always been fascinated by how effortlessly us, humans can solve problems that involve incredibly complex vision, and how far away we are from that with our modern technology. That is what eventually brought me to artificial intelligence for computer vision, and from there, to neural scene representations.
Sayak: This vocabulary must have been very helpful for you. What kind of challenges did you face when you were starting specifically in the computer vision field? How did you overcome them?
Vincent: Most of the challenges were of a rather practical nature. I didn’t have a pipeline in place that allowed me to easily render out datasets. I never took a computer graphics course, or a course on the fundamentals of computer vision, such as projective geometry — so my start here was a bit rough. Then, I had never written a computer vision paper before. And of course, I was insecure whether my perspective and my research would actually be a worthy contribution. I think I overcame all these challenges due to three reasons: One, I was (and still am) very excited about Computer Vision and Machine Learning, so I’d always drift back to that because it’s the most exciting thing I’ve found to work on so far. Next, I had amazing mentors and collaborators like Gordon and Michael Zollhöfer, who were willing to figure things out with me, who are similarly passionate, and created an incredibly productive environment. And lastly, I’m very persistent, and I can’t really rest until I figure out things I’m getting stuck on ;)
Sayak: Vincent, your tenacity is very inspiring. You never submitted to your insecurities regarding your research and instead, you treated them as opportunities. I am interested to know about some of the capstone projects you did during your formative years?
Vincent: I think some of my most formative projects actually happened way back in my undergrad when I did an exchange in Hong Kong. I took an intro class to robotics there, where we learned about inverse kinematics and Lie groups for parameterizing coordinate transforms. I really, really enjoyed that math! When I came back to Munich, I decided to do my Bachelor’s thesis on a monocular SLAM algorithm that had been developed in Prof. Daniel Cremer’s group. There, I then discovered the machine learning literature, and I was hooked ;) When I found my way back to Computer Vision from Computational Imaging, it really felt like getting back, though I had never done research in that field. My first vision paper, DeepVoxels at CVPR 2019, was when I started building the mental framework for Neural Scene Representations, the necessity for inductive biases (such as 3D structure), etc. With Scene Representation Networks at NeurIPS 2019, we showed for the first time that it is possible to train implicit neural scene representations just from 2D supervision. At that point, I was convinced that implicit scene representations are the way to go — and that’s where I still am :)
Sayak: I am starting to get overwhelmed :D
Let’s switch gears to SIREN now. SIREN is one of the most amazing works in Deep Learning I have seen in recent times. I am interested in learning about the initial grounds that led you to the idea of SIREN.
Vincent: It’s very kind of you to say that! I’m glad that you’re enjoying our work. I’d first like to highlight that SIREN really was a group effort, and that each of the authors — Julien Martel, Alex Bergman, David Lindell, and Gordon — really pulled their weight here!
The initial idea for SIREN has been something I had been mulling over with Julien for a long time. We were both interested in solving boundary value problems (BVPs) with implicit representations. After all, it seems like the perfect fit: One problem in solving BVPs is that sometimes, it is extremely hard (maybe impossible) to find a closed-form expression for the function you’re looking for! What would be more natural, then, to parameterize the solution to a BVP with a neural network? That had been attempted before, but I was bringing a different background, not from physics, but from implicit representations for scenes. The next piece of the puzzle was that multi-layer perceptrons (MLPs) with ReLU, Tanh, and other conventional nonlinearities are really bad at parameterizing high-frequency signals and fine detail. There is a lot to this, but one thing to think about is that conventional nonlinearities are local: They are nonlinear effectively only in a small fraction of their input domain! ReLUs, for instance, at the kink. Now, if we parameterize a signal implicitly, that means, for something interesting to happen at a certain coordinate, one ReLU has to be activated close to its kink at that coordinate. Another way to say this is that these MLPs aren’t shift-invariant: They do not easily learn to apply the same function at two separate coordinates, exactly because the activations are local. That’s how I came up with the sine as a nonlinearity: It is not local, it’s nonlinear everywhere, and you can convince yourself that it’s very easy for an MLP with a sine nonlinearity to, for instance, create a periodic pattern (think of a grid of black dots on a white background, for instance — with two sine nonlinearities in the first layer you’re almost there!). The final insight was that the sine also has this cool property that its derivative is just the sine itself. These three insights convinced us that this had to work, and then, we just had to figure out how — that’s how we ended up investigating the initialization and think about how we can bring out these properties.
Sayak: This is so guided. I really like the way you have attended to the drawbacks of the common non-linearities that we use in neural nets. We have seen how SIREN is a much better choice for introducing non-linearity in Neural Networks than other options vanilla position encodings, ReLU, etc. Technical debt and interpretability often go hand in hand. To that end, what would you say about SIREN with regards to its adaptability in the day-to-day learning problems?
Vincent: We have been thinking about this a bit, but fundamentally, I really think that the properties I alluded to above are most useful for implicit representations. Generally, I’d say, the sine nonlinearity is useful if your application requires the three points above: Shift-invariance and global support in the input domain of the sine, and well-behaved gradients / higher-order derivatives. For convnets, for instance, shift-invariance in the feature domain may be useful, but it’s not immediately clear to me how — shift-invariance for images is key in the spatial dimensions, and convnets already have that. That being said, I could imagine that the property of well-behaved gradients and higher-order derivatives may be very useful across a range of applications! Especially, of course, in applications where higher-order derivatives appear, such as Neural ODEs.
Sayak: Thanks for clarifying this, Vincent. SIREN, in general, gives us a principled system for designing neural architectures that are better capable of approximating implicit signals from the data. How much would it affect the other areas such as Model Pruning, Transfer Learning, and so on?
Vincent: If this was a paper, I’d write “These are interesting directions for future work” — I think there’s a lot of open and exciting questions here!
Sayak: Something to tinker with! Your academic excellence speaks for itself. You got Fulbright scholarship to pursue your Masters’ at Stanford among many other things. Would you like to discuss a bit about your journey to Stanford?
Vincent: My journey to Stanford was very fortunate — originally, I didn’t want to do a Ph.D., and I only applied for a Master’s degree in Computer Science. As you know, in the US, these degrees are very expensive. As in Germany, universities are generally free, we do not have a system of financial aid in place for tuition on this order of magnitude — so, I couldn’t take out a student loan. Stanford does not offer financial aid for internationals either. So, my only way was indeed a scholarship. There are only two scholarship programs in Germany that offer support in this magnitude. Fulbright is one of them, and I was fortunate to receive one. I was also supported by other scholarships, from the DAAD (German Academic Exchange Service) and the German National Merit Foundation. At Stanford, I found that one can also finance a degree by working as a TA or RA, and I did that for the second part of my Master’s. However, it’s not guaranteed that such positions are available — and as you might have heard, the German stereotype is that we’re very risk-averse, so looking back, I probably wouldn’t have come to Stanford if I haven’t had the scholarship. Today, I’d probably go anyways!
Sayak: You are setting an example here, Vincent. Vielen Dank für das Teilen Ihrer Geschichte.
Being a practitioner, one thing that I often find myself struggling with is learning a new concept. Would you like to share how do you approach that process?
Vincent: I have found that I really do learn and think best when I discuss things with someone. It can be literally anyone who is interested in the same problem, not necessarily someone who knows the answers. For this to be most effective, it has to be a judgment-free zone — there are many things I don’t know anything about, but my collaborators do, and vice versa. If one is embarrassed to bring these things up, and the other side doesn’t have the patience to share their perspective & insight, that doesn’t work. So one has to create a judgment-free environment, where everyone is trying to establish a common ground of knowledge — eventually, we always arrive at a level where everyone has an overview of the knowledge of everyone in the room, and that’s then when it is easiest for me to acquire new concepts or push for new insights.
Sayak: Absolutely onboarded with this philosophy! Any advice for the beginners?
Vincent: Don’t be intimidated by fancy-looking math — in the end, you’ll find an explanation that is intuitive to you. Don’t be afraid to do things that other people don’t agree are a good idea — those are sometimes the best ideas since if everybody already knows it’s a good idea, no-one will be surprised when it ends up working ;) Find a topic you are genuinely curious about, in the sense that you really, really want to understand how it works. And lastly, when reading papers, look for new ideas and concepts you’ve never seen before, not for improvements on some baseline — almost no new idea or concept is competitive with the state-of-the-art in its infancy since it hasn’t seen any tweaking yet!
Sayak: Thank you so much, Vincent, for doing this interview and for sharing your valuable insights. I hope they will be immensely helpful for the community.
Vincent: Thank you so much for having me, Sayak!