In this episode, our host Anna Petrovicheva, CTO of OpenCV.AI, is talking to Dmitry Petrov who is the co-founder & CEO of a California-based high-tech startup named Iterative.ai. They created DVC (Data Version Control) which is one of the most popular tools to organize ML research and production.
Anna Petrovicheva (AP): Hi my name is Anna Petrovicheva, and today I’m talking to Dmitry Petrov, who is the co-founder and CEO of a California-based high-tech startup named Iterative.ai. They created DVC or data version control which is one of the most popular tools to organize your machine learning research and production. DVC tracks your experiments, manages your data versioning, and helps you research in a very predictable and organized way.
AP: Welcome to our AI for Entrepreneurs podcast, I’m happy to have you here. So, my first question to you would be about the tool that you develop, the great software that is called DVC (Data Version Control). So, we share a common love for open source solutions – I’m an OpenCV developer, you develop DVC and other great open-source tools. Could you please tell me what in your opinion is the power of open source? The main power that makes people develop open-source solutions compared to closed source solutions.
Dmitry Petrov (DP): First of all, thank you for inviting me to your great podcast! And let's talk about open source: why open source is important? I think there are a few components. I believe the biggest component of open source is community and user feedback: it's great to work on open source because you have a very quick feedback loop from users to developers, and also you have a variety of users, variety of clients. It's a huge difference between having 100 clients in the first half a year and having one-two clients when you work in a regular business environment using regular sales strategies. I believe this is a big power and there are a lot more components of open source, and we can discuss this separately, but the community is definitely the major one.
AP: Comparing open source solution to close source solution in terms of reliability of the software, what software typically is the most reliable solution?
DP: I think reliability is quite a different topic. I don't see a clear connection between open source and reliability. In open-source, at least as open source works today, you still have a core team, you still have core contributors, and they manage the quality of software. They can prioritize reliability, they can prioritize feature development, and this is the bound that those people are choosing. So, I don't think there's a big difference between open source and closed source.
AP: Got it, thanks! Could you please tell us just a bit about DVC, and why it's important, and why it's open source?
DP: DVC is data version control, it's a tool that was created to manage data, manage experiments, and a big component of DVC is a connection between data and code. This is why it's DVC: it controls data, which means all the data artifacts that you use – it might be gigabytes of text data, it might be millions of gigabytes of images, and stuff like that, and it helps you to keep track of all the code with data and version all these components during the experimentation phase. This is the major functionality of DVC.
AP: How did you come up with the idea?
DP: I was a data scientist at Microsoft, and I’ve seen how engineering teams work on an ML project. In a business environment, you have a lot of pressure compared to academia, and this pressure pushes teams to use best practices. What I realized, best practices on ML projects and data projects are very different from software engineering projects, and Microsoft built AI tools specifically for this. I was thinking: all right, Microsoft has the tools, an AI platform, and some large companies also have internal AI platforms, but how about the rest of the world? When I start thinking about this, I come up with the idea that this solution needs to be open source eventually. Such platforms need to be open, they need to use some common sense set of conventions, and when I start thinking about the tools, I come up with the idea that any tools need to have a data versioning component or data management component under the hood. And this is how DVC was created.
AP: That's an interesting path to creating an open-source tool coming from an industrial and closed source background. Could you please tell me how DVC was growing in terms of the open-source community? There is a core team that develops the main tool but I’m sure that there are a lot of people all around the world that bring in the functionality to DVC. What do you think is the ratio between the new functionality created by the core team right now compared to the functionality that is created by people all around the world?
Best practices on ML projects and data projects are very different from software engineering projects, and Microsoft built AI tools specifically for this. I was thinking: all right, Microsoft has the tools, an AI platform, and some large companies also have internal AI platforms, but how about the rest of the world?
DP: Talking about DVC, it's more or less the same as any open source products today: about 90-95 percent of the functionality is created by the core team, and this is a common trend – the core team develops the majority of source code, the majority of the functionality, and this is how it works today. People contribute to code, we have a lot of contributors, each month we have around ten new contributors who bring some code to the tool, but usually there are small fixes, sometimes improvements but still the majority of the work goes to the core team.
AP: Got it! What do you think is the ideal collaboration between open source software and business because eventually any business has to make money, right, and what do you think is the best way to collaborate between doing something for free for open source, for everybody, and making money on that? What do you think is the best way to do that compared to how it is done now?
DP: So, this is a good question. If you're asking how it should be – I don't know how it should be. What do we need at the end of the day? We need people who work on a daily basis on some open source. If you don't have those people, it's really hard to maintain a substantial piece of software, and we are talking about a system that will be part of some industrial solutions, some industrial teams. So, you need to have those core contributors, and they need to be paid somehow. Today the industry has come up with the idea that developers build two pieces of the project: one piece is open source that everyone can take and use, and the second one is more enterprise-specific. So, general rules here are as follows: if some functionality is needed for an individual or a team of three-five people, it should go for free. If something is needed for a company and requires complexity around the company structure, then it's about enterprises, and it's fine to provide additional services or additional software specifically for companies, and this is when open-source teams charge enterprises. It's about the complexity of the problem you are solving, the organizational complexity to be exact. This is your usual separation, and in most cases all the security features, management features go to the enterprise part, and the rest stays for everyone.
AP: Specifically, for DVC, we speak about the enterprise clients – these are big companies that require complex data management systems, experiment management systems, right?
DP: Yes, we do have large clients and our focus is on large clients mainly, but we have relatively small, mid-sized clients. It's not necessarily an enterprise with thousands of people and many teams of data scientists engaged; in some cases, it's a relatively small company, maybe like 100-200 people with one team of data scientists, and they still need automation around them.
AP: It seems that more and more people and companies right now understand the importance of ML apps, the importance of managing data, managing the experiments in a very reliable way. What do you think is the main power of having these reproducible processes in AI and machine learning model development? Why is it important?
DP: It's important for a variety of reasons: one of the reasons is that you need reliable processes around your applications. It's especially important when machine learning and AI components serve some online applications. Clear discipline around your ML models and datasets is vital: we should know what to do if something fails, how we can revert to our previous version of the model, how we can retrain on the new data set without asking data scientists to step in and spend another two weeks on it. This is the question that people need to answer. And second component here is knowledge sharing because in software engineering as well as in machine learning you need to have a tool allowing people to share ideas. It’s especially important for companies, as it helps a lot in the long run. So, on the one hand, it’s about your liability, and about what is happening today, how to solve some productization issues; on the other hand, it is about economic sharing and long-term effects.
AP: My next question is going to be about the comparison: how academia compares to the industry in terms of these processes of development, what do you think is the main difference between them and where are they similar?
DP: I worked in academia more than ten years ago, and then I moved to Microsoft, and I have seen how ML teams in large companies are working on ML models in the industry. I think the biggest difference is in how much pressure you have from the business in terms of the timeline of the development of your models in the first place. In the industry the pressure is much bigger than in academia because you have tighter timelines to deliver some result, and you need to have some process around your result, you're responsible for this result, and you should be able to answer the productization questions. And this forces the industry to use all the best practices. In academia priorities are very different, namely, innovation and sharing ideas are in the limelight. The pressure from timelines or pyramidalization is not so significant. What I have seen is that academia is not very motivated to use those best practices: for academia, it's nice to have a reproducible result, reproducible papers, but it's not something that a team will be wasting time on. And in fact, it takes a lot of time to make those results reproducible, and especially operationalizable – when you can use it in a real environment, it increases development cost, sometimes five times, sometimes ten times; it’s not the price that academia is ready to pay, to my mind.
AP: Got it. The recent trend that I saw in academia is that maybe from 2014 all the way to 2017 everything was exponential, everything was growing so fast, and there were these incredible numbers about CVPR being fully booked in several seconds. And then right now I see that artificial intelligence is the new normal, so there is no exponential growth anymore as far as I see. The processes are becoming more stable and more mature. In the industry, do you have the same view as I do? If yes, what do you think it says about the development of the industry in general?
DP: I think part of this is just a kind of a fashion, and so this is why people are very eager to join some conferences. And today definitely we have less of a fashion component, and it's good in some sense, but I still see that this topic is the hottest one in academia, and that is the hottest one in industry. Last year it might be just not as crazy as it used to be in the previous years – maybe it can be partially explained by fashion, or it might be due to Covid. I don't know what is the global trend, but yes, today the interest has changed.
AP: Do you think that the field becomes more mature in terms of the processes everybody applies?
DP: Yes, I can say that. I’ve just said that academia is not ready to pay this high price for reproducibility, but in fact that they are interested in this topic and they understand that, without this academia won’t survive in the long run. They're doing a lot of work in this sense. And I’ve had a few discussions with people who organize large conferences including NeurIPS, and they're trying to be in the world when every paper does not just represent ideas but represent code, and data sets, and everyone can try and make sure that this actually works not only in a very specific environment with very specific settings, but it works more reliably. So, academia is very interested in this subject. My general feeling is that it will take time, it will take probably another five, ten, fifteen years to get there.
In the industry the pressure is much bigger than in academia because you have tighter timelines to deliver some result, and you need to have some process around your result, you're responsible for this result, and you should be able to answer the productization questions. And this forces the industry to use all the best practices.
AP: It's a slow process, right.
DP: Yes, in academia as I said pressure is not as big as in the industry, and that is why this process is spread around for many years.
AP: So, what do you think about the general future of artificial intelligence, machine learning, say, for the next ten years? What do you think will be the next biggest thing there?
DP: Well, it's a good question! The maturity of the area should be and will be the biggest component – this is what I already see. We are going to be more rigorous about the results, about these new ideas and knowledge, and we should have better tools and practices on how to get to the result, how to evaluate the result. My big belief is that we need to learn how to reuse the results of others in a very easy way, and it can bring the industry to the next level, and not only industry but science as well. Today we share only ideas, and on the way from idea to an actual system, you need to waste sometimes a week, or usually a month, to reproduce some ideas, to make sure those ideas from paperwork in your settings. Let's imagine you can do this in one hour, or one day – it will change the dynamics a lot! I believe this is the next level, instead of sharing ideas, we should share more meaningful results, which you can check quickly, and build more complex solutions on top of those results. Just ideas are not enough, we need to make it to the next level.
AP: Cool! So, my last question is going to be about the skills: so what skills do you think will be needed in the next 10 years to develop solutions in artificial intelligence, machine learning, or open-source solutions?
My big belief is that we need to learn how to reuse the results of others in a very easy way, and it can bring the industry to the next level, and not only industry but science as well.
DP: It is very interesting. First of all, I believe that people who work on machine learning for artificial intelligence tools have a little bit different mindset, and we need more people with this kind of a new mindset (maybe not new but at least a little bit different), when people think more in a quantitative way, not a structured way, and I see this is a huge difference between various professionals. I think this is very important on one hand. On the other hand, those people need to learn best practices and come up with a set of tools from engineering, first of all, software engineering, because this is the closest area; and from engineering in general they need to come up with a framework, more predictable workflow – predictability is an important part in this process. You have tons of risks in AI projects, and if we can reduce those risks, then it becomes way more interesting for business.
AP: That's for sure! Thank you so much, it was a very interesting discussion, and I hope that in maybe one year or so we will have this discussion again and see what has happened in the industry.
DP: It would be great to have a discussion and see the progress!
AP: See you then, and good luck with the tool that you are developing!