In this episode, Gary Bradski, the founder of OpenCV, interviews Pietro Perona, Allan E. Puckett Professor of Electrical Engineering and Computation and Neural Systems at Caltech. Pietro Perona is known as a pioneer of computer vision. Currently, he is particularly interested in the visual analysis of behavior. Also, Pietro and Serge Belongie founded Visipedia – a network of people and machines that is designed to harvest and organize visual information and make it accessible to everyone.
The video and audio versions you can find on Youtube, Spotify, SoundCloud, Google Podcasts, Apple Podcasts, and Stitcher.
The video was recorded in October 2020.
Gary Bradsky (GB): I’m Gary Bradsky. I'm interviewing Pietro Perona, he's a long time professor at Caltech. He started a Ph.D. at the University of California, Berkeley and followed with a postdoc at Berkeley and MIT, and became a pretty young professor at Caltech. He is a foundational person in computer vision, he won NSF Young Investigator Award, Longuet-Higgins prize in 2013, Koenderink prize for fundamental contributions in computer vision. I first met Pietro when I was getting interested in dynamic processes and an anisotropic diffusion that became a well-known program. He's done work in categorization; he started Visipedia and Caltech 101, and then Caltech 256, which became a foundational image database. So, welcome Pietro!
I guess we can start with your career, I’ll just go through some of that and then we'll get on to your work, and then I’d like to get into your current work and speculations about the vision system, and how the brain works, and we can go into philosophy, etc. So, you started your Ph.D. in Berkeley – was that in vision? How did you become attracted to the field?
Pietro Perona (PP): It was serendipitous. I came to Berkeley and I thought I would work in controls and I liked math, so as an undergrad I had been attracted to controls, where you can see the power of mathematics to describe things that happen in the world. During my first year, my advisor gave me a paper to read, where people were using tools that belong mostly to a theory called Markov Random Fields to analyze images, and it struck me that you could take something as messy as an image and think that you could build a software program that could analyze the image and understand what is there. And that was by the way the thesis of Jose Luis Marroquin, who was co-advised by Sanjoy K. Mitter and Tomaso Poggio at MIT. For me, it was a revelation to see how fairly simple formalisms could make sense of fuzzy complicated information that is there in images, and so I got hooked on the computer vision question.
It was a revelation to see how fairly simple formalisms could make sense of fuzzy complicated information that is there in images, and so I got hooked on the computer vision question.
Soon I left behind the method which by the way was very nice, the Markov fields have had a lot of good uses in vision, but I got hooked on the question of how we can build a system that can analyze images and extract meaning from images. I must say I was always very attracted to images: my mother was very fond of art and she took me through all possible museums in Europe. I enjoyed it – well, first of all, I suffered through what felt like endless marathons through the Louvre, Uffizi, and so on; nevertheless, I became very interested in art, and so the possibility of working on images was very appealing to me. I’m attracted to images, and I thought that putting mathematics and images together was an amazing terrific idea. Since then I've been very fond of this field.
GB: And then you went on to postdocs at Berkeley, and then MIT. What were you working on then?
PP: At Berkeley, first of all, in my Ph.D. I worked on image segmentation using diffusion equations, so that was another little discovery: using simple PDEs you can do so much to understand an image, and so again it was one of those moments of joy in which you see something simple doing something that appears to be complex. Towards the end of my Ph.D. I started becoming interested in using mechanisms that you see in the visual system, namely, Gabor functions to analyze the content of images. The way to do vision is to first extract simple descriptors and then aggregate those descriptors into more complex descriptors, and finally arrive at the meaning that is hidden in images. At the time it felt like visual categorization and visual recognition.
And so, I started studying the properties of early visual filters and got interested in how physiologists could measure those properties in biological systems, typically in the cortex of a cat and a monkey. When I was at MIT, I was studying how to steer these filters and that's something that I got interested in from Bill Freeman, and Eero Simoncelli was there working also on the same topic at MIT: how to steer these filters in a space of sort of scales, orientations, parities, odd and even filters frequency tuning, and so on. That was useful, for example, for finding boundaries in images.
Then I came to Caltech and decided that I should not continue on what I had done before because I risked putting all of my students into the local minimum of whatever I had done for my Ph.D., and so I decided to branch out. I realized that vision was far from being sold, and so it was better to lunge out in a million different directions and put each student on a different problem to see what's stuck and what came out of it. So, my first few years at Caltech were a period of exploration: I don't know if I did anything particularly useful, but at least I trained myself into many different areas of vision and I got unstuck from my initial condition. Towards the end of my first decade at Caltech, I realized that visual categorization was really interesting and nobody was interested in it. But I was, and so I decided to study it and that's why we collected the data sets of Caltech 101 and Caltech 256.
GB: That started with Fei-Fei Li and Rob Fergus, right. I forgot to mention that one of your big contributions is so many fantastic students who have gone everywhere. Silvio Savarese at Stanford, Stefano Soatto at UCLA…
PP: Yes, also Jean-Yves Bouget, Max Welling, who is now in Amsterdam, and so on… I mean that's not my contribution, it's their contribution, and it's just lucky that if you're in a good university, amazing people come along and you get to work with them and you benefit from their ideas and in some ways as everyone who is a professor knows that the key benefit of being in a good university is young people who are smarter than you, and they challenge you all the time, and they ask you to justify why you believe this explains it to them. If you're believing something that is not true, you discover it pretty quickly, and so they help you debug ideas and challenge you to get unstuck from silly things you're believing in, and to look at the real questions, and so on.
PP: At the end of the 1990s, I think maybe the big realization was that people working in visual recognition were working on mostly individual objects trying to recognize the logo of Coca-Cola, the box of the Rice Krispies, or some type of aircraft, and so on. The interesting question was visual categorization: how can you tell frogs, and cell phones, and people, and dogs… And at the time we were very engineering-oriented and analysis-oriented, but it was clear that for visual categorization one would have to move on to learning, there was no other way to go about it. Saying it now feels obvious, but at the time if you look at the proceedings of conferences, at that time 90 percent was geometry, and then the ten percent it was recognition of individual objects. People were working on faces, and so they were handcrafting detectors for eyes, mouth, ears, and so on. People were handcrafting components and hoping to put them together to analyze images, but then it was clear that wouldn't scale. There was an estimate at the time by Irving Biederman that people recognize about 3,000 categories really well, the entry-level categories, and about 30,000 categories in general, and you can see how that would be the case because if you think of all the Chinese characters that people can recognize it's about 10,000–20,000, and then all the faces of people you know it's a few thousands, and so on. So, it's in the thousands, and if you sum up all of these thousands you get the tens of thousands. It would simply not scale to handcraft detectors, and it makes a lot of sense for faces and maybe automobiles because they're so special and so valuable, but if you have to move on to birds, dogs, cats, mushrooms, etc., how do you do it?
A professor knows that the key benefit of being in a good university is young people who are smarter than you.
So, what they thought was to start collecting: the Internet was available, and you could search for images, so the idea was to start downloading images and build image data sets. I thought if somebody wants to claim that they're working on visual categorization, they should be able to recognize mushrooms and dogs and trees and so on – with the same piece of software without changing a stitch in the parameters. This sounds obvious today but it was not at the time, and so that's why we collected first Caltech 4, then Caltech 7. I remember telling Fei Fei one day that we are probably still overfitting. We're doing automobiles, spotted cats, airplanes, faces, maybe we are worth it, and we decided to have more, and she suggested 15. Back then it was a big piece of work to download and sort out the images, and I said it should be no more than 15. She came to me and suggested 20, and so that somehow ticked me off, or I thought that she's being lazy, so I suggested doing 100. She went off and two months later she came back with 101 – she exceeded expectations. What we did was we took a visual dictionary to select image categories that have a visual correlate, so you couldn't select a category like the economy that you cannot depict, but mushrooms and so on. We went through it, and then we saw which images we could download that had enough representation on the web and then we hired Caltech students to sort them.
So that's how we collected Caltech 101. That was quite good, and we put it out. As I have said, papers in conferences were about geometry, and within two or three years you could see the whole sea change of people starting to work on visual categorization. I tell my students now that our initial approach to Caltech 101 was so bad, that it was only 18 percent correct, which is still better than one percent correct if you do random choice, but it's not great. Simultaneously we put out a data set that people could use for training and testing and the paper that made people feel they could do much better than us, and somehow that probably got people more interested.
GB: It's always good to do a paper that can be exceeded because you generate a million citations! It's a very good trick to becoming a professor. I think it was foundational in spurring a learning-based approach. It simply overwhelmed any handcrafting – that was an important event.
GB: So, I think you went on and did Visipedia.
PP: I’m happy to talk about it if you want. I can show you how it was born. Visipedia is a project started by Serge Belongie and me. Serge came on sabbatical to Caltech, that was the end of 2008, and simultaneously or almost, I went on sabbatical in Italy. But then in the brief interlude in which we were both at Caltech, we started talking about the next step in visual categorization. Something clear to us was that just downloading pictures from the Internet just because it doesn't cut it, it's not reality, it's just computer vision people playing. So, if we wanted to know what the real questions were, we should go and talk to people who need to do visual categorization on a large scale and see what kind of images they have to deal with, and what are the challenges that they face. We thought that we had to find a community of people who are going to engage with us and who have interesting images to classify. We went through 10 or 20 different hypotheses, and underlying all of this was the idea there is the question of the flow of information between experts and machines, and we wanted to understand not only how machines can do visual categorization but how machines can learn from people. That's why we wanted to find a community that would engage with us, and, in the end, we chose birds and bird watching.
The reason was that first, of all the pictures are pretty; second, there are tons; third, bird watchers go out and they buy expensive cameras to take these pictures, and they are now uploading these pictures on servers in order to exchange pictures, so they must be technically sophisticated, and they're ready to interact with any web page we may put up on the web to talk to us. We came in touch with the laboratory of ornithology at Cornell. There are a bunch of really good people, and so we started working with them, and it was great because at the time they were just about to deploy a service through which they would allow birders to upload their pictures to the Cornell site in order to obtain help in classifying the birds. They were getting maybe 30,000 pictures a week, and soon it became 50,000, 60,000… At the time Grant Van Horn was starting his Ph.D., and he came to Caltech for his Ph.D. Grant was very hands-on, and so he helped us build an iPad tool and a website that would allow people to recognize these birds automatically. So, we could work with Cornell on this, and there are 10,000 categories of birds, and that was fantastic. We had access to experts who were willing to label the images, so we could measure how many mistakes experts make, and we could measure on Amazon Mechanical Turk how many mistakes people that are not experts make, and we could wonder about how a machine knows who is an expert, who is not, and how the machine measures how much to trust a given person or a given label.
There were maybe two or three main realizations that came out of this project: one was that the paradigm in which machine learning people and computer vision people work is not correct. We think of the machine as a tool like a plow or a tractor, and there is an oracle that is labeling the images, and we have to feed these precisely labeled images into the machine, and the trick is figuring out what's the right learning algorithm to use. And in fact, the real situation is the opposite: the machine is the agent, and nobody is an oracle because the knowledge is not owned by any single person. Knowledge is owned typically by communities, there is no single birder who knows all the birds, no single pathologist who knows all the types of cancer, and so on. Knowledge is in a community, and the machine has to talk and has to socialize with a community of experts, and it has to use images as a means, as a vehicle to obtain that knowledge from the experts. It has to figure out who knows what, and it has to develop trust in the people, and so there is no oracle anywhere; there is going to be a lot of mistakes and yet the machine has to be able to socially interact with people to build up its own knowledge and know how much to trust it. That's the question.
The machine has to be able to socially interact with people to build up its own knowledge and know how much to trust it. That's the question.
GB: Many categories are actually a joint community decision because when you begin to analyze a category there's not a God-given fact.
PP: That’s right. For birds or animals, we know it from genetics that there is a scientific basis for categories and that's a species; but then in many other domains it's just a full folksonomy, for example, fashion types of clothing you may wear, what look you have today, and so on. The names for those things come from people, and so you have to be able to work also with folksonomies and get information from the completely arbitrary crowd, and yet to have to reconcile different ideas. So, that was one teaching.
The second one is that the world is not a uniform distribution like in Caltech 101 or ImageNet where you have a thousand pictures for each one of a thousand categories. The world is a long-tail distribution, and I’m going back to birds – say the bald eagle; you have hundreds of thousands of pictures in the lab of all archives so that you can learn really well. However, they have some birds like the pileated woodpecker for which there is no real photograph of the animal in the wild. So, you have this long-tail distribution, and the machine has to learn some common principles from the categories where you have a lot of information, and then it has to become good at recognizing categories for which you have very few training examples. People are really good at that: you can show them three or four pictures of a new species they've never seen before, and they can get going pretty well from that. And, as you know, for machines we need thousands of training examples to get expert-level performance, and so that's another learning.
GB: This goes into this kind of zero-shot learning or one-shot. I remember reading about a mountain biker who was attacked by a mountain lion, and I wondered how the lion even knows where the head is, what this thing is, so there's something very universal about learning – it can deduce something that's never existed in the natural world.
PP: That's right! I mean by now everybody has realized that one of the main problems in visual categorization and machine learning is how you learn from a few examples in the limit n equals and limit of n going to zero, and we are still struggling with that. But it's definitely something that people are making progress on right now with self-supervision, some unsupervised techniques that have come up recently, and so on. That was another learning we got out of that collaboration. And the third one maybe is how to interact with large communities of annotators or experts, and again how to infer their abilities and how to make use of those to get a level of confidence on some deductions that the machine may make, namely, what kind of label to give to a machine. Here I should mention a collaboration with the California Academy of Science – they have an app called iNaturalist which anybody can download to their cell phone. That's another piece of work that Grant Van Horn did, and so he embedded himself with them and he both built the computer vision system that is behind their recognition, and it handles at the moment about 50,000 or 60,000 species of plants and animals. And also he built the statistical backend that estimates for each person who interacts with the system what are they good at – if they are good at butterflies, lizards, or nothing – and so how to combine different people's opinions into a final verdict. The verdict is safe enough, and you can make the image become part of the collection from which the machine learns. Somehow it led us to what you might think of as the first online system that people use that is for real, namely, the machine there is just interacting with the crowd, the crowd keeps labeling images, and once the machine reaches sufficient confidence that the labels are correct, it incorporates those training images and labels into its training set, and it retains itself. So, you have here maybe the first example of a big scale system that interacts with the crowd, and it trains itself, it learns by itself.
GB: That was kind of the inception of crowdsourcing. I think looking a little bit back, today's people don't really know how hard and limited it was when I did my Ph.D. I had to share a professor's camera, there weren't digital cameras at the time, it was hard to record, and with these image databases, Caltech was the first one. It was unusual to have more than a dozen images or so to work with. It's hard to realize from today that with Google and a billion images available everywhere; it's a very different world. I think the cell phone revolutionized computer vision a lot as well.
PP: I think in my Ph.D. thesis all the experiments were done on three images in fact, two plus one – three images, quite a spectacular difference. The images were acquired in a hard way, as you say: somebody in the building had a tv camera and they had bought a frame grabber which they had put in a computer, and they would have to ask to acquire one image. You would go there with a book on which the image was, they would put it under a pane of glass, it would capture it. And even printing images was not easy, there were the first post clip printers. I remember that Trevor Darrell at MIT had coded a way of converting jpegs – in fact, they were not even jpegs, they were a different format – into dots of different scales in postscripts, and those dots would make it darker or brighter. I had this piece of software that Trevor had given me that would convert the images into this raster of dots of different dimensions and in postscript, and that's how we generated images for the papers; that was quite a lot of work.
GB: Yes, and the field's grown up with the hardware, and now it's of course dominated by these deep learning techniques.
GB: I was wondering what your take is on deep learning: what's missing? How close is that to the visual system?
PP: That's an interesting question. First of all, deep learning has been a godsend: suddenly things started working and we have primitives that we can work with in vision, and our earlier primitives that were handcrafted wouldn't work well. Of course, deep learning has been around for a long time, and basically, Yann LeCun has been at it from the 80s, and he had this great vision which I believed in. I was teaching deep learning in my classes since the late 90s, and nobody knew how far they would go. In some ways, deep learning was there all along, but it took GPUs and large annotated data sets to realize that it was also for real, and it could do a lot of work for us which we now benefit from.
What's the situation now? First of all, using deep learning we can do a lot in practice, and somehow people are now so giddy with what they can do. They think that all the problems are solved but they are not clear. There are many things you cannot do, but you can do a lot. It's still behind a biological vision in the efficiency of learning as we were saying before because if you have enough images in many domains deep learning has shown that you can do better than human experts, and so we know that in classifying skin lesions or birds, and many other domains, you can do better than human experts if you have enough training images. But the number of training images to achieve a certain level of performance is still more favorable to humans, so humans can do a lot with five or ten, and you know how they can do it. For sure, we can pontificate on how they can do it, but the point is coming up with a computational approach, it will mirror that ability, and I have no doubt that we are coming up with it slowly – in the next ten years we'll manage it. Now there are many things we cannot do. We’ve been talking about iNaturalist that can learn from the crowd, but it's not a completely open universe, so you can learn a lot of things, but, for example, it doesn't by itself separate out the flower, the species of the flower, and the bee that is on top of the flower in a given picture, so it will learn that pattern, and it doesn't know that it's the pattern is either the bee or the flower, or the ensemble of the two of them. We still need to give our machines the ability to see objects and to separate out the foreground and background, and to deduce the body plan of animals, the structure of plants, and so there is a lot of structure that is missing. The learning is holistic and now you can think of deep learning as a way to do very smart pattern matching where you can interpolate between patterns you've seen before, but you cannot extrapolate yet.
The human visual system infers a goal for the action, and this idea of inferring a goal and understanding the cause and the effect in the video of what happened and why so, that's still beyond the reach of the computational infrastructure we have at the moment.
GB: It's a great social associational machinery. But many things I think are missing. This gets back to some of your early work: in deep learning, it is an equilibrium approach, you are learning with a gradient but the biological vision system is very dynamic, this goes back to the anisotropic diffusion, you get all kinds of optical illusions, color constancy, and color bleeding through, so there's a lot that's missing in the dynamics of vision. The question is whether it's a feature or a bug: is it a necessary part or is it just a biological limitation?
PP: Right! We don't have the right computational tools yet, for example, to understand behavior, and that's something that interests me a lot. I work with colleagues at Caltech – David Anderson, Michael Dickinson, Marcus Meister – in looking at how using a camera you can decode an animal's behavior. Again, we can classify patterns if there is a behavior that is highly reproducible and whose time scale is fairly constant. For example, there is one action that flies do when they attack each other: they stand on their back feet and then they lunge forward, so that always takes the same amount of time and it's easy to spot it and classify it using a recurrent network. But if you think of more complex behaviors like the idea of loading the trunk of your car, or foraging, or courtship in animals, that's highly variable, and what you spot is the interaction between multiple agents in that case. Truly, the human visual system infers a goal for the action, and this idea of inferring a goal and understanding the cause and the effect in the video of what happened and why so, that's still beyond the reach of the computational infrastructure we have at the moment. We need to think very carefully about how to do it.
GB: Just pure association, of course, people are moving beyond this, but when you look at a lot of optical illusions, a lot of them derive from the fact that we're looking at something in 2D but interpreting it in 3D. But I think there's also somehow we form a model of the world and the difference in seeing and perception is that the seeing is the input but the perception's what we make of it, and we understand an animal as a 3D object with certain forms, I think that also contributes to the speed of learning. It's not clear to me how to represent that as in an algorithm.
PP: I suspect that the human visual system at least has the ability to represent scenes at multiple semantic levels if you will. So, you could have a pattern of the keys on your desk and you just pick them up, but at some level, you also understand the keys as rigid objects and how they rest on top of each other, and the fact that if they were in a different configuration, they would fall down and all of that. So indeed, you could think of an ongoing modification of your model of the world, and it’s the fact that vision is not just this input-output process but it's deeply tied with the body of knowledge that you handle in your brain. Vision is there to modify this body of knowledge or to instantiate some aspects of the body of knowledge. A visual system is a system that is always on, always digesting things, which is curious, so it wants to see more. If you see how children or even animals work – they keep going around, and looking at things, and picking them up, and so on. So that's another place where we are far behind.
GB: When you dream, you're seeing the model, right. When you wake up the dreams don't end, they just connect to the external data.
PP: True. I don't need to tell this to you because you would be an even bigger proponent than me, but the whole idea of having an embodied visual system that is on top of a machine that can go around and manipulate the world, and interact with things, and learn, it's a very interesting question. Now I’m exploring this a little bit in the context of biology: I have an experiment going on with Marcus Meister exploring the question of an embodied visual system. Marcus is a biologist at Caltech and we have a preparation where a mouse lives in its own box, it's a typical vivarium for mice, but the box has a hole and through the hole, there is a tunnel that takes the mouse into a maze, and the mouse can go and explore the maze whenever it wants. We can put little rewards for the mouse in the maze – like a little fountain with water at some point, and we can see what strategies the mouse uses for exploring the maze, and what it knows, and what it remembers, and if it builds them up or not. It's very interesting and it's quite clear – without giving you all the story – it's quite clear that the mouse is under this immense force which is this zest for exploration, and it's not going to stop there, and it seldom rests, it wants to go and explore, and once it has explored the whole maze, it will patrol the maze to see if anything comes up that is new, and certainly, that spends more than 60 percent of its time in the maze and not in its home cage where it gets food and water. It just wants to explore, and so it's pretty clear that we need to understand how that works as well. And vision is of course a starting point but it's not the whole story.
GB: Right! A fascinating area. I tend to call it a robot philosophy because I think by building these models you learn some of the constraints of making a mind, and it has a lot to say about the nature of what it means to be us – an animal or a human; what you can see and what you can make out. Again, it also gets back to these visual categories, and the color is an example of it. It's something that's in our mind, not in the world and I could go on and on about that.
GB: I wanted to move on – earlier you discussed art, and it always interests me; you go to these early paintings and how you could see they understood things about the vision system, maybe just intuitively. I don't know if you can comment on your thoughts on the vision and how it comes out in art. I also wonder that everything humans make, for example, a car, has a face – is that just a mechanical thing or are we reproducing the head of an animal in our technology?
PP: I've studied three things in art. Every maybe every five years or so I get some idea and explore it, and typically don't do it with students because, first of all, the students in engineering are not interested, and then I don't want to make them do something that is going to be highly unpopular. Indeed, anything I've written about art and vision has not been very popular, and yet it's one of those things that you're really happy you did whether or not people pay attention to it. So, one thing was done with one of my earlier students Jennifer Sun, and this had to do with shape perception, shape from shading. At the time it was very unclear how shape-from-shading works, but we wanted to know how quickly people perceive shapes and what's the effect of shading on perceiving shapes. We're talking about 95 and 96; we looked at the effect of the light source direction, and we realized that people perceive shapes best if the light comes from the top left of the space in which you are, and we related that to handedness. So, we found that left-handed people prefer light on top, a little bit to the right, and right-handed people prefer light from the left, so that seems to be an effect of learning, and that explains to you why all of the little buttons in your GUIs are lit from the top left. We also did a survey of paintings and classical paintings in museums, and 85 percent of them were lit from the top left. And if you talk to artists, they're not really aware of this, they just use this convention and it just works for them. So, it's a tiny piece of science that explains to you why art is made the way it is.
If you want the viewer to feel close and almost in a conversation with a portrait, then you've got to take the entire picture from far, but then you have to paste in a face that is viewed from up close
Another thing that I looked at is how portraiture works. Suppose you have a camera, you have a zoom lens, I can take your picture if I stand eight feet away, I can fill the frame with your face or I can come closer two feet away I can fill the frame with your face with a short focal length. Why do I choose one or the other? I became curious about this because I realized that there were some portraits made in the renaissance where people were really fussy about perspective, and these portraits are famously incongruous in terms of perspective: the face, the shoulders, the waist, the shoes are portrayed from different points of view; and I was wondering why would they make this mistake if they knew precisely what they were doing. And it turns out that if you run psychophysics experiments showing to people photographs taken from up close or far, and these photographs look the same because the face fills the frame; but in fact, they don't look the same, there are tiny systematic differences. From up close you see the face with the nose looking a little bit bigger, and you don't see the ears very much at all because they're blocked out by the cheeks. If you take a photograph of the face from far then you see much more of the hair, much more of the ears, much more of what is around the face, and the nose becomes a bit smaller. And people respond with emotion to these faces, and so if you ask them to judge the face – how smart this person looks, or how trustworthy, or how intelligent, or how approachable, there are systematic differences for faces that are portrayed from up close or from far.
And it turns out that if you want a face to be approachable, and readable, and talking to you, you have to take the photograph from up close to the point that the person that you're photographing feels that the face is almost a little bit too caricatural, but that's how your friends your and your family look at you. If you want the face to look more impressive, intelligent, but a bit forbidding, then you take the photo from far. So it turns out that if you want to make a standing portrait (like they would do in the renaissance), or even an equestrian portrait in which you have to show the entire horse, and the rider, and the head of the rider, and the armor, etc., if you take true perspective and you image the person from far, then the face would be seen from far and will not look approachable, it will look forbidding. If you want the viewer to feel close and almost in a conversation with a portrait, then you've got to take the entire picture from far, but then you have to paste in a face that is viewed from up close, and that's why you see inconsistencies in perspective. So, that was another thing I realized and worked on for a while.
There is another story which is maybe interesting for now because people are spending so much time on Zoom. Suppose that you want to have a whole lot of people on Zoom, suppose that there is a family dinner and an old aunt needs to be patched in and so you want to show everybody on Zoom, well, it's impossible – there is no point of view where you will be able to portray everyone, and the single viewpoint camera, which we love and we use all the time, is singularly inadequate to make somebody feel like they're there with you, it's simply impossible. Yet, painters have faced this forever. If you think of the Last Supper it's precisely the problem of “patching in an old aunt” to a dinner as there are a lot of people around the table. You have to make them all look close, visible, and easy to talk to, and they shouldn't block each other out. So, the painters figure it out: again, you've got to make a mosaic of individual viewpoints, and then you stencil them into a common picture. I think that Zoom and all the video conferencing tools will have to evolve towards that idea. This is a project I did with Lihi Zelnik and so we looked at how do you make people feel like they are there, and you have a single picture, and yet the picture allows you to walk through a room, or to see things from different viewpoints and so on. We're very inspired by David Hockney's mosaics – he calls those collages “Joiners” – and he looks at the same object from different points of view, and he takes pictures, and he pastes them together. And we wondered if there is a way to do this automatically, so we built a system that could automatically take many points of view on the same object but combine them into a single photograph that makes you feel like you're there and you can explore without the image changing. It is surprising to some extent that I realized that your visual system is very robust to these inconsistencies, and you're perfectly willing to take multiple points of view, maybe your face will be seen many times with different expressions, and you can use your visual system as a window to go back, and so you take it in as a sequence in time, different moments, different aspects of the same scene, different viewpoints and so. Your visual system is enabling you to construct a model of the world precisely true, and it's able to do so even if the image is not a single viewpoint perspective, which is what we are training algorithms to see. So, that's another place maybe where computer vision systems can improve: how to combine different views and acquire technology.
GB: I wish more students were interested in this. I've noticed that with medieval portraits that they often do have like you look at an alleyway, and that's actually a point of view, it's not as if there was one camera, and then the person is a point of view. I always thought it was related to visual attention: you're there in the scene that's what you would observe, you would look at one direction, then you would look at another, and that would be the focal point of your attention, so it puts you in the scene more than making the portrait as if it were just the camera photo. So, that's a fascinating area.
GB: I wonder if we could start talking about where you see computer vision is going, what's coming next, what are the current uses? Again, looking back it's hard for people to realize that like when I started when I first met you, hardly anything actually worked. And now it's gotten to the point where it's a real tool: we can say yes to most things, we can do that. But before I would have no idea.
PP: Yes, just to recapitulate a few things, a few loose ends that you clearly see now. There are fundamental questions like how computer vision systems can learn more efficiently given fewer training images, how it can become our helper and not a ball and chain when we want to start doing something, we've got to spend months collecting images and annotating, and we would like to get going right away. So, how can you train a computer vision system like you would train a smart child? For example, this is the dog, you just see one, and we've seen them all, so that's one fundamental capability that will help us throughout.
The second one we've mentioned before is cause and effect and intent in the video, also in steel frames. How can we understand the interaction of objects of agents like humans, animals, vehicles, and how do we infer intent and agency from sequences and causality? That's another big one, and I should remind people that machine learning is helpful, but machine learning as it is now is good for prediction because it's looking at correlations, but it doesn't understand causation, and so it's not good for intervention. So, if you want to have machines that become active in the world, you've got to understand the effects of an intervention. If you see that every time that people have their umbrellas out it rains, it's not true that by asking people to close their umbrellas it will make the rain stop. You've got to understand which way the causal error goes, and vision today doesn't get a signal, it's not able to do it.
And the third one is as we were saying before vision in the loop; so you have an agent that trains itself by interacting with the world, it has curiosity, it has volition, and so going back to causality, it is able to carry out experiments, which are the only way really to establish cause and effect, and so you would want to see agents that play with objects and let them roll down and fall and all of that to teach them how to interpret the vision of a visual signal in order to infer a cause and effect, and therefore being able to influence the way the world is going.
Another one is that vision is maybe the most informative sense we have to discover the world, and so humans use vision all the time, and there are entire branches of medicine that depend on vision like radiology, pathology, ophthalmology, dermatology. So, how do we combine the knowledge of humans, how do we combine images, how do we combine algorithms in order to achieve what they call communities of knowledge which is enabling people to exchange information more efficiently, to train each other, to let reality or let the truth bubble up from the data and multiple sources of expertise? How can we put a machine at the center of a community and make it more efficient, effective, knowledgeable, and discover what it doesn't know? Also, how can machines help people distill knowledge? Vision plays a big role there. So, these are my big ones, and I’m sure there are many more.
One thing that worried me, when things started working out, is that soon we would saturate the things we can do with vision and will become much more of a routine and indeed you see a lot of routine papers out in conferences right now of trying different variants of a certain deep network and that work is really useful and needed necessary. But I became a bit concerned that maybe vision would not be as much fun, there would not be as many open questions. However, the more I think about it, the more open questions I see, and I think it's a field that is going to go on as a lively interesting intellectual area for quite a while, and so we have to allow the boundaries of what we want to do to become bigger, to be hungrier for more. We have to be willing to change somehow the modality of operation, put humans in the loop, build machines that have vision in the loop, and so on. This is really interesting, it's very difficult because when we carry out experiments using large batch data sets, it's so simple – we run the experiment, we change a couple of parameters, we've burned through a few kilowatt-hours of energy, but it's easy to do, and the results are experimentally solid. And if you want to do experiments with people in the loop or machines in the loop, then every experiment is a story of its own, and it's very difficult to see what caused what, and so that will be slower progress, but it's worth investigating.
Another thing I’ve realized recently is again going back to annotated data sets: when we do what we call an experiment, it's really an observational study, namely, we collect pictures from the world and we annotate them in different ways, and then we see how well our algorithms perform on them. But they're not a real experiment in the sense that we can control variables, and we know what the effect of a given variable is on the results of experiments. And this is coming out I would say very compellingly now that people want to know, for example, if vision systems have biases, for example, racial bias and recognition of people or face classification, or whatever. And so, you see people collecting data sets, and annotating them, and publishing results, and saying there is bias or there is no bias, but truly we don't know because those data sets are observational, there are so many correlations that go on with gender, ethnic background, and everything else. For example, there are these celebrities data set that people use, and one simple correlation is female faces are much younger than male faces, and that is probably because females have an earlier career in the show business, and for men it takes a bit longer to develop, and they are viable for longer than females, and so on. Anyway, there is a big age difference; but people say that it works better for females or males because of gender or because of age. So, we have to learn how to carry out experiments to control the variables one at a time, and in order to be able to make causal statements like “a gender change will make algorithms perform less well or better” or “a change in age will cause more errors or fewer errors”. We need to learn how to do experiments and, of course, it's very difficult – so, that's another challenge we have in front of us.
GB: As for bias, it's clear that any embodied or purposeful activity has a bias to achieve that purpose, so a lot of that is reflected in data sets. If you're taking wedding photos, you're going to center the people and do all this stuff – it'll have a bias in that way; if you're attempting to fight crime, you'll get a bias because of what you're trying to do, where you're trying to do it.
PP: That's something we have to learn: how to recognize and dissect the bias in our data sets. It will take time because it's not self-revealing.
GB: We need to somehow make it transparent what the biases are because everything is biased!
PP: Right, no question! However, the field is a lot of fun. I think that young people who enter computer vision will enjoy it a lot – if you're willing to work on problems that are not completely well formulated and yet you can take the challenge and see if you can clean them up and ask the right questions. Working in vision forces you to come in touch with many different fields from physiology to art, geometry, physics, medicine, and there are all the fields of application, so it's lots of fun to get your hands dirty and to help other disciplines be successful. The call for new bright people to come in is open, and I think it's a good investment of a lifetime to work in this field, and I hope that people will have as much as I’ve had so far.
GB: Way back I used to feel guilty – I was attracting people with my research and library into vision, and I thought none of these people will ever have a job because it doesn't work. But it turned out just the opposite: none of those people had ever seen unemployment.
PP: That's right, luckily!
GB: This has been great talking with you, it's very interesting!
PP: Thanks for having me talk to you, this was fun!