In this episode, our host Anna Petrovicheva, CTO of OpenCV.AI, is talking to Daniel Cremers, Professor of Informatics and Mathematics and Chair of Computer Vision & Artificial Intelligence at the Technische Universität Münche and one of the most influential scientists in Germany. It is a thought-provoking conversation about Daniel's career path from physics to artificial intelligence, his organization experience of ECCV, the largest computer vision conference, research challenges to be faced in AI, as well as the cooperation between science and the IT industry.
The video was recorded in November 2020.
Anna Petrovicheva (AP): Hello and welcome to the Opencv.AI for Entrepreneurs podcast. My name is Anna Petrovicheva, and today I’m going to be talking with professor Daniel Cremers from the Technical University of Munich, one of the most prominent researchers in computer vision and artificial intelligence areas. Hi, Daniel, and again it's a pleasure to host a podcast at Opencv.AI for entrepreneurs with you.
Daniel Cremers (DC): Hi Anna, it's a great pleasure to meet you!
AP: My first question would be about your part in artificial intelligence and computer vision. Could you please tell us how you decided that this will be the field that you would focus your scientific research on?
DC: For me, it was a long journey to come to computer vision and artificial intelligence. I started actually in physics and mathematics because they were the topics that I was based on from high school, the other topics I was just not so strong at, and so that attracted me to that field; and also, this desire to somehow understand the world and be capable of modeling it. I was very happy in physics but there was one thing that disconcerted me: by and large in research in physics people would work on topics that have almost nothing to do with everyday life: you would be working on quantum mechanical models or in string theories, and so on. I remember for me an eye-opening moment was when I came home during my master thesis in quantum chaos and my mother asked me to tell her what my master's was. My mother has no background in physics, and I said that right at that moment I was computing those energy levels of this hypothetical quantum mechanical system to see how chaotic behavior arose in quantum mechanics, and then there was just dead silence in the room. My mother didn't know what to say and then at some point she said “well, as long as you enjoy it...”. And then I realized I didn’t at all; I do enjoy math, I do enjoy sophisticated mathematics but what I want to do is things that impact our world, that affect everyday life, on people. And I should say the first thing I wanted to go into is neural networks. I was fascinated with neural networks at the end of my physics studies, and so I even sent an email to Jeff Hinton asking if he would be interested in taking me as a Ph.D. student, but he didn't respond. And other people did respond and said that I seemed young, smart, and talented, wondering why I wanted to do neural networks – that was a completely dead area, this was the end of the 90s. They all tell me it's a dead-end, but I thought this would be fascinating, so I actually didn't listen – I did go into neural networks but in biology. So, my first Ph.D. was in neurosciences. I started a whole year working towards a Ph.D. in neural computation and modeling the visual pathway of mammals. But then for some reason, the project didn't move on quickly enough. Maybe I was too young, too impatient, and I decided to switch, and I dropped all of it after a year, and then went into computer vision. And I must say that topic has fascinated me from day one. I remember the people who fascinated me were my then advisors Christophe Schneider, and in particular also the papers of Joachim Weickert on anisotropic diffusion for image processing. And for me to see sophisticated mathematics do things to images – process them, remove noise, enhance structures – for me that was just fascinating! I think that fascination has never lost me until today, it's still very strong.
AP: It's a very interesting path! I specifically enjoy it. There was a rise of neural networks in 2012 and later, so probably you were able to somehow understand more about how they work inside because how the neural pathways and mammals work inside, weren’t you?
DC: It's a very good question! I would like to tell you about one of the first things that I had to do after the one year in the neurosciences – at first, I thought it was a wasted year, but very quickly I realized it was not at all, and this is my recommendation also to young talents who are going into research careers: even if you do detours, these are not wasted time, you learn a lot, and this is often complementary, and can help you later on. And exactly you raised that question: what did it help me in? When neural networks came back in 2012 and subsequent years, the first thing that I realized is that the neural networks that we tend to use in vision are in many ways different from biological neural networks, which are often much more complex, much more sophisticated. For example, in computational challenges in the vision we typically use feedforward networks; now increasingly recursive networks are also coming into play, but in the visual system the networks we have are much more sophisticated. Starting with more spiking type neurons with Hodgkin-Huxley equations that govern the spiking activity of a single neuron, and then also significant feedback loops that we have in the visual pathway, which are not fully explored and understood yet, even in mammals. And so, I think there's a lot more in biological neural networks that we can possibly include in computer vision.
AP: I was actually going to ask you about how you see the future of computer vision and artificial intelligence in, say, upcoming decades, so what do you think would be the areas where we will focus most on? Maybe some new types of data, some new types of algorithms, what do you think will emerge?
DC: So, one thing that I’ve definitely observed in my career in the last 20 years of computer vision is that the topic of computer vision has become dramatically more relevant to society. When I started, a lot of my friends and colleagues when they asked me what I was working on again, I said “computer vision”, and they often had no clue what that was about and why it was of interest. And admittedly I’ve always been just fascinated with the topic of enabling machines to see, to perceive the world as we humans do. This is one thing I also learned from neurosciences: for humans, the visual sense is one of the most relevant. For example, the cortex in mammals – in monk and macaque monkeys, for example – 50 percent of the cortex neurons are constantly doing visual processing, processing of visual information, so you could say in some sense we, as mammals, seem to dedicate 50 percent of our brainpower to processing visual signals, and I think that says two things. It says, first of all, that these are very important for survival, for everything we do; but secondly, it says that it's not an easy problem to solve, otherwise we wouldn't dedicate that much brainpower to the problem. And so, I think if you want to reproduce the capacities of humans in a machine, maybe ultimately building a robot, possibly even copying the human, one of the most important challenges that you need to solve is the perception challenge: in other words, making machines see. And that doesn't mean just placing a camera, but processing the visual information that comes from that camera. And I think we've gone very far in the last 20 years, and some of the popularity we see now in computer vision, the growth in the community, comes from the fact that we've actually achieved quite a lot, that now algorithms actually work. And I think to some point that is because people put code online, like OpenCV, that was a great kind of contribution to make available to the world whatever works. That boosted significantly the performance of computer vision on real-world challenges. The popularity of computer vision came about, and this dramatic growth where every year the conferences are essentially doubling in the number of participants. When I organized ECCV in Munich in 2018 I was expecting 1500 participants because that was the number at the last conference, but that was very far from it – we ended up with 5000 registrations, that's like a triple number of registrations from one conference to the next one. So, this is a very crazy growth in the vision community, where all the seniors who are running these conferences still don't know how this is going to pan out in the long run.
AP: Totally! I remember ECCV was there in Munich two years ago when everybody traveled unlike we do now. It was a great conference, by the way, so congratulations on organizing, you did that well, thank you!
DC: Thank you, it was a lot of work, but I did enjoy it a lot, and I think it was great as I heard a lot of positive feedback. I’m happy people liked it.
AP: I’m thinking about some new types of data that artificial intelligence is going to utilize in upcoming years: you did quite a bit of research in 3D and 3D space understanding, and my feeling is that in the upcoming years, 3D data will become more like a commodity than it is now because of emerging personal devices that have 3D capturing capabilities. So, what do you think about that?
DC: I think in many ways the 3D world will get more into focus for computer vision. As you say, it's not just more sensors that are available, like with the Kinect camera, RGB-D sensors, that became popular. The capacity to add 30 or 60 frames a second scan at fairly good resolution 3D geometry. But what's also come into play is much more powerful algorithms to reconstruct the world from even standard cameras, and we've worked a lot on that so that the precision and robustness of which you can track a moving camera. We are now at a level where we can track a stereo inertial camera system over many kilometers. Even over four or five kilometers the total drift in tracking is maybe in the range of one or two meters, so the precision of these systems that we've been developing is significantly higher than anything we've seen before. The real-time capability is also important. And with all of these algorithms, we are now able to recover in very good detail the 3D world around us.
The main question that arises here is to understand the 3D world, to distinguish, to measure the similarity of 3D shapes. For humans these are seemingly simple questions: I have two hands, I would say they're similar. But how do you teach a machine to measure that similarity? It entails aspects of correspondence, and correspondence is arguably one of the nastiest computational challenges that we typically face in vision. There's a lot of fascinating questions that revolve around shape: a 3D shape, processing shape, analyzing shape, comparing shapes, measuring similarity, defining metrics in the space of shapes, and also things like interpolating shapes. If I see one human here, one human there, can I create a family of intermediate shapes? Humans are somehow capable of analyzing their world in this fashion, and I believe we need to make machines reproduce that capacity to understand. I would say it's not a solved problem but we're close to it with deep networks on ImageNet we can do processing, recognizing planar structure objects in planar images. It is a fairly solved problem, whereas understanding the 3D world is still a long way to go.
But how do you teach a machine to measure that similarity? It entails aspects of correspondence, and correspondence is arguably one of the nastiest computational challenges that we typically face in vision.
AP: Thank you! My feeling also is that in 2D space many problems are now data-bound: as long as you have a very good data set, you are able to solve the problem in a very good way. But in 3D we are still algorithm-bound, we don't know how to solve the problems, so it will be exciting to see how it is going.
DC: This is a good point! As you said, the data-driven approach became extremely popular with deep networks, and it's essential if you have sufficient training data to cover the space that you want to work in then, deep networks do amazing things and achieve amazing results. But indeed, for 3D understanding, for 3D reconstruction SLAM, a lot of the deep-learning-based approaches do not perform as well as classical say optimization based-techniques, and this is one of the most active areas in my team now that we're exploring: how to bring together the performance of classical optimization techniques in SLAM with this predictive power of deep networks.
AP: Exciting area to be in right now!
AP: I also wanted to discuss the following: so, my feeling is that unlike in many other sciences – we already touched that in software engineering and specifically in computer vision, many software and many papers are open-source and available for free for everyone. So, I wonder what you think – why is software engineering and computer vision so different in this sense from other sciences where many papers are available in paid journals mostly? Their journals are the source of the innovation, not GitHub or arXiv.
DC: This is a good question I should say: the other aspect that always attracted me to computer vision is that it was a very active community with people from lots of different backgrounds who are very creative and who want to construct new things and who are willing to change the world in many ways, and also willing to change how research is done, how research is communicated, and how research is made publicly available. For example, as you said, putting papers online is a standard. I’ve been in the vision community and from day one I’ve put all my papers on my web page and that was obvious. And sure, there are some publishing companies that at the time said officially I’m not allowed to do that, but then I figured nobody was going to stop me. I think the vision community can serve as an example for other communities, and this is not only about papers, but also publishing code, for example. It is very important for the reproducibility of science that people can actually try your code, and verify, and validate the statements that you make in the paper, and obviously, this drastically accelerates research innovation. If people not only have to start from scratch every time but they can build up on the codes that we make publicly available.
So, for example, some of the SLAM algorithms that we've published over the last years, LSD-SLAM or direct sparse odometry, we made the codes publicly available and they have been immensely popular with thousands of people building upon them. And in addition, also data sets; I used to work in the neurosciences and it's super complicated there, and these are often also ethical issues that people realize they're facing because if you do experiments on a monkey or a cat, you know this is something that you should do as little as necessary. After all, you don't want to hurt any animals in the process, you want to avoid that. But in neurosciences, someone records a data set, if someone else wants to do research, they have to record their own data set, which means they have to kill yet another mouse, or a monkey, or a cat, or whatever. These are serious ethical issues that come with not sharing the data that people acquire now. A lot of people in other disciplines ask me what my interest is in sharing that data, as I put so much effort into it, and now someone else can just download and skip that effort, what's my benefit in it then? Visibility! In vision, you get a lot of visibility once you publish a data set, the ImageNet data set is the best example of how much impact you can create by just releasing the data set to the public. If there's one thing that researchers are always keen on in an academic career is visibility and impact, and this is a standard performance measure, it encourages people to make datasets publicly available, and this is something that I try to bring across to other disciplines so that they follow the example that we set in the computer vision community of making things publicly available because it advances humanity.
AP: One other thing that I think is very positive for the computer vision community and computer vision development is that in computer vision industry and academia work together on advancing this state-of-the-art, and it's great that many big companies and small companies also donate their studies, and their datasets, and their software tools to open source. So, my next question would be the following: obviously there are positive sides of industry and academia working together in the state of artificial intelligence, but also obviously there are downsides, right? The one downside is that right now in artificial intelligence the salaries are so high that many young people go to industry instead of maybe staying in academia and finalizing their Ph.D. and staying in the university. So, there are their pros and their cons to this collaboration between the industry and academia. What do you think is the best way for these two important entities to work together so that it benefits both sides?
DC: I think the interaction with the industry is a good thing, but it's important to structure it in a meaningful way. As you say, the interest of industry in our field is great, it advances our field in many ways as well, but indeed, one of the challenges that you mentioned is that there is a certain competition when it comes to recruiting senior talents in particular. And this is a problem that is becoming increasingly visible, and that people take note of. The other day I had a very long Zoom call with our chancellor Angela Merkel, and she wanted to know what are the challenges that Germany and Europe are facing in this age of artificial intelligence, and I said that for me one of the biggest challenges from an academic perspective is how to retain the talents when industry salaries go completely skyrocketing. And at the same time, we in University we have to offer fixed salaries, the German system doesn't allow me to offer someone more salary even if I had money. It's all fixed and regulated, and so Angela Merkel took note of many of these points, and in particular of that one, and so there are now discussions of how we can improve on it.
The other day I had a very long Zoom call with our chancellor Angela Merkel, and she wanted to know what are the challenges that Germany and Europe are facing in this age of artificial intelligence, and I said that for me one of the biggest challenges from an academic perspective is how to retain the talents when industry salaries go completely skyrocketing.
One way I think to improve is this is something I often tell my students: now, when they're moving on, when they get their Ph.D., I see there are very attractive opportunities in industry, the salaries are great. But one thing I’ve seen for myself is there is only so much money that you can spend in a lifetime, and when you work in vision and machine learning you will always have enough money to live on, so this is not the concern. In the end, you have to ask yourself what you want to achieve in this world, what is the thing that you want to leave behind, what is the impact you want to have on this world. And I can just say from my own side – I realized that I don't need that much money, and what's the point? This is a question my wife is often asking me is what I want to do with all this money that I make. Plus, there is one other opportunity: if you really want to make big money, then startups are a great way to do so, and I have a lot of colleagues in the field who've shown amazingly successful examples of startups, for example, Amnon Shashua with Mobileye, or Michael Bronstein and Alex Bronstein – they're constantly creating new startups that often become quite successful. I’m a little bit fond of startups because they give you even more freedom to pursue the things that you think should be done in a way that is often not so easy in a bigger corporation. But there are benefits to all of these career paths, and I think we should just be very happy that in computer vision there are almost endless opportunities for everyone.
AP: Thank you! Speaking about someone young, what would you recommend to someone who is only starting their career path or a scientific path in artificial intelligence and computer vision? What to focus on, what paths to choose, do you have some thoughts regarding that?
DC: In terms of research challenges – I think there are many – and what I think the most important question, that I often get being one of the more senior experts in the vision community, is where the future challenges are. The truth is it's up to all of us to discover that. I mean no one has the pattern to know what the future holds, and I’ve experienced it myself: I wanted to do neural networks, and they told me this was a dead-end, and I guess I should not have listened to the experts at the time. And there are many such examples throughout history; one of the most famous ones is Max Planck. When he wanted to go into physics, senior experts and leaders in the physics community told him that he seemed very talented, but in physics, there are no open challenges, it's a solved problem, suggesting him going into different disciplines. And fortunately for the world, Max Planck decided not to follow that recommendation. I think all these stories say that ultimately, we have to come up with our ideas of where we think the future lies, and I think that's important because that is the source of creativity in the community that everyone has to develop their own ideas. And I think the community, humanity in general, is far too much in the spirit of lemmings: the way one guy says we should all do graph cuts, and then the whole community does graph cuts, then someone says we should all do deep networks, the whole community does deep networks. And I think there is room for more independence to explore people's ideas, and I think this is important. For my lab, for example, sure we work a lot on deep networks, but we also work a lot on many other topics, and most importantly when I recruit young talents, and they ask me what they should work on, I ask them to tell me what their interests are, what they want to do. It takes time to develop one's own ideas, but it's really worth it. As for me, I have ideas where I think fascinating challenges lie, and I think one of the most fascinating challenges that are largely unexplored is how we cannot just perceive the 3D world, but recover a complete kind of simulation of the world, for example, we've developed algorithms where we can reconstruct an action in space and time, frame by frame, but what we don't recover is the physics of the world, and this is something that I think there is a huge discrepancy between human visual processing and machine visual processing. This is something that's always fascinated me even as a child: when someone throws you a tennis ball, it's amazing how in a fraction of a second you can put your hand somewhere and catch it. This capacity of a human to catch a tennis ball that someone throws at you has always impressed me quite a lot. I spent most of my childhood playing different ball games – soccer, volleyball, basketball, you name it – and I think one of the fascinations there for me was the capability of humans to predict the future and to extrapolate what's going to happen next to the extent that we can catch a flying ball. It implies that on some level we can reproduce the physics of the world, we can predict how a ball is going to bounce off the wall, we don't need to reconstruct its position at every point until it reaches the hand because that's far too slow. I tried it: if someone throws you the ball, and halfway through the ball flying, you close your eyes, you can still catch it. Not every time, admittedly, but many times. I think this goes back to some capability of inferring the physics that govern our world – the mass, the momentum, etc. that governs the dynamics of the objects around us. To my mind, in computer vision, we've been far too much focused on reconstructing the geometry of the world, the surface of the object, possibly its albedo, its reflectance properties, but far too little effort has gone into recovering the physics behind the world, behind the dynamic world that we live in, and I think ultimately we need to move towards recovering the physics to make more long-term predictions of what's going to happen next.
What we don't recover is the physics of the world, and this is something that I think there is a huge discrepancy between human visual processing and machine visual processing.
And this is yet another domain where humans excel. I have three small kids, and children are in many ways an inspiration; especially when it comes to visual processing, it's fascinating to see how they evolve from a very early age when they constantly watch everything around them with an immense fascination. And it's very different in vision algorithms: you have an algorithm that classifies airplanes, give it an image, and it says airplane / no airplane, car, dog, cat, etc. But what the human visual system does is different: when I see my kids looking into the world,
it's not like they just classify objects, no, they're trying to predict what happens next. And this capability is vital obviously – the better I can predict what the lion in front of me is going to do, the longer I will live. So, clearly, this is a very important survival skill, the capability to predict the future, and this is yet another topic that I think is to some extent underexplored. Can we train these networks to predict the future? So, there's a lot of fascinating challenges still to be explored, and for me, one of the inspirations is indeed the human in the sense not necessarily reproducing faithfully the neurons in the brain or anything like that, but reproducing the capabilities that the human system seems to have.
AP: That's very interesting! When we speak about the emerging industries and what AI made possible, my feeling is that robotics is not where it should be or it could be in terms of its impact on human life. And if we had this understanding of the physics of the world, robotics would emerge way more than it has.
DC: Yes, robotics is an area that computer vision is very close to, especially when it comes to reproducing humans, and I think this is also an area that I’ve always been fascinated with: how you can leverage perception to improve control and action. We've worked a lot on autonomous drones to do obstacle avoidance, etc., it's also a great testbed for real-time computer vision, and so indeed, the human is always an inspiration. From day one it has been an inspiration for computer vision, and even if in certain challenges like object recognition we have by now surpassed the human capability of recognizing and categorizing objects, but still I think the human is still an inspiration on many levels. For example, one major difference is when humans process visual data, they use a lot of knowledge about the world, and how to bring all that knowledge into the machine is a challenge. We work a lot on autonomous driving: how we can get cars to drive autonomously, and it starts with reconstructing the 3D world in front of the car, and then autonomous maneuvering, etc. One of the key differences between a machine and a man is when I drive down the street, and the ball comes rolling into the driving corridor, I’m going to expect the child to come running after it, because I have world knowledge about what things typically happen in the world, which things are likely to happen next, and this is one of the things we need to reproduce in machines – the capability to predict what happens next. For driving assistance this is vital: the longer I can predict the world into the future, the better I can avoid accidents and save lives.
AP: It seems that we have quite a lot of math and algorithmic studies to do in artificial intelligence yet. So, my next question is going to be about the research and how to keep up. Starting from the artificial intelligence boom, there were so many papers, so many different pieces of research, and books, and new startups that it became really hard to follow what's really happening in the industry. Do you have some advice on how to do that?
DC: This is very tricky. I see that with the growth of the community, with thousands of people flooding into conferences, the number of publications that we see surfacing per year is growing dramatically to a level where it's hard to keep track of everything. Another thing that I notice is that the community is also changing with social media, for example, I don't have a Twitter account but I’m being told by all my colleagues that I should get a Twitter account, so that in particular for every paper we write, I can Twitter it out into the world, but I’m still a little bit maybe old-fashioned, you could call it, and I am reluctant to believe that for being successful as a researcher you have to shout out louder, and who shouts the loudest over most social network challenges gets the most attention. To some extent, this is something I see today, and it disconcerts me a little bit. My own Ph.D. advisor told me once you put the paper on your webpage, people will find it. I’m not sure this is true any longer: if you don't put it on Facebook, Twitter, and whatever, people may not take note of it. So, in that sense, the world has changed a little bit, and it's a matter of adapting to the new formats of communication, blogging, twittering, all of these forms. But it still disconcerts me because the question for me is whether I should spend my time on research or social media. Maybe it's better to write one paper and spend the rest of the time on social media rather than writing three papers and not having time for Twitter, a blog, and Facebook – I don't know. So, this is tricky. But overall, I’m happy that there is so much interest in this area, so much activity, and especially over the last 20 years seeing that area grow so dramatically has been quite a fascinating ride!
AP: Totally! My one last question would be the following. What would you recommend to our audience in terms of reading about artificial intelligence? Maybe you have some sources, some newsletters, some blogs, papers, maybe your lab has some handle that we could follow on. Would you recommend something to our followers?
DC: It's a question that I frequently get over the last 15 years of being a professor in computer vision. Generations of students have asked me what I recommend to get into this field, and it's tricky. Admittedly, there are some good books on computer vision but since the community is moving so fast, any book you buy is invariably outdated by the time – it's on your table; books are useful to get the background, but often a little bit more about let's say the more classical approaches. But things have gotten a lot better: nowadays you can find a lot of tutorials online, there's a lot of lectures, for example, all my lectures are available on YouTube, and I know that they are extremely popular – people just go to them, click and watch them. This is true for many of the lectures nowadays. I think a good way to get into this whole field is tutorials and lectures that you can find publicly either on YouTube or on some of the massive online course sites, and this works, a lot of it is accessible. As you said, applications have been online for as long as I can remember, so this is great with the Internet that we have all of this knowledge at our fingertips. Wikipedia often gives good introductory lectures to certain topics, so I think it's all there and accessible, and you don't even need to go to a library to get a solid computer vision and machine learning background.
AP: Thank you! This was my last question for today, thank you again so much for participating in our podcast. It was really insightful! I hope that maybe a year or two later we will be able to meet again on our platform.
DC: I hope so, and I certainly hope that at some point corona is over so that we can all meet physically again because this is something that the vision community is missing nowadays is the chance to meet in person. Ultimately, despite the growth, the vision community is still a wonderful family!