Interview with David Cournapeau, Head of the MLE Team: Working in the open source space


David Cournapeau joined Cogent Labs in August 2017.
As head of the Machine Learning Engineering Team, he plays a pivotal role at the interface between cutting-edge research and product development. Beyond his work at Cogent Labs, David is also well known in the open source community as the original author of scikit-learn, and a major contributor to NumPy and SciPy.

We sat down with David to learn more about his experiences and insight as a scientist, an engineer, a manager and an active member of the open source community.

This interview series will feature four installments spanning three main topics: David’s career to date, his thoughts on developing products from research, and his experience working at Cogent Labs. David’s career will be covered in two installments, with the first focusing on his academic and professional career, and the second addressing his work in the open source space.


Topic 1 (continued): Career – Working in the open source space

Q: Many people know you as the original author of scikit-learn. Could you share the story behind its creation?

DC: First I should explain that even though I created scikit-learn, what happened is that other people took it and made it what it is today. I was not so involved in that process. It is someone else who made it very successful.

At the beginning, I was still a PhD student at Kyoto University. I was already somewhat involved in open source software for science because I was using machine learning for my own studies. There was not much machine learning with Python technologies so I decided that I would just create something for myself. That is often how it starts. You just create something to scratch your own itch, as they say. I needed it so I created it.

Google had this program called the Summer of Code where they would pay a student and a mentor for that student to make sure that what they do has some value for existing open source projects. That is how it started. I applied successfully for a position at the Summer of Code in 2008 and I created scikit-learn. It was not called scikit-learn actually. I just called it learn. Anyway, I started to bring together a few different pieces of code I had from my PhD and packaged it in a way that was more useful.

Funnily enough, I actually contributed much more to other open source software which are below scikit-learn, namely NumPy and SciPy. I no longer have the time to work on them, but in terms of engineering effort I spent much more time on those projects than scikit-learn. Scikit-learn is relatively famous but still many people are surprised to learn that it was created in Japan during my PhD, especially Japanese people. I think they like the idea that it was created here!


Q: Why do you think scikit-learn has enjoyed so much success and is so widely used?

DC: One thing I did that I think contributed to its success later on, when more people started to input their own algorithms, was that I tried to make sure all the algorithms had a very similar interface, so you could easily switch between different algorithms. That is still one of the major reasons why so many people use scikit-learn. They can try many algorithms with the same dataset, without needing to do much additional work.

Another reason, which I am definitely not responsible for, is that, from the start, the people who worked on it were very careful about having really good documentation.


Q: You are a proponent of open source software and an active member of open source communities. How did you first get involved with this?

DC: I got involved in the scientific Python community in 2006. The main driver for me was to stop using a proprietary software that was very common in signal processing. Open source software was already well known by then but it was not so commonly used for scientific purposes. There are different reasons for this. The first is that scientists are not very good at writing software or at least software that can be used by other people. Most scientists write software just for themselves. This is still true today but even more so ten years ago. They just care about the output and after that they simply throw it away.

What happened as well is that it was not so easy to share software. For example, in speech recognition, everyone used the same tool and it was not free. That, for me, was very frustrating. To me there was a lot of appeal in using open source so we can freely share software resources.


Q: How do you think the rise of open source software has impacted the scientific community, for example in relation to machine learning?

DC: I think deep learning would not be where it is today if it were not for open source. It made it easier for people to share algorithms and build on top of each other, instead of building separately.

More generally, I think open source in science has really grown as well. One reason is the crisis of group reproducibility in science. This is where people claim they have discovered something, but then it is so complicated that nobody else can actually reproduce the result. Of course, part of it is fraud but that is only a minority. The vast majority is that, when you have a lot of software, it is very complicated and it is very easy to make a mistake. I do not think open source is the solution but it is part of the solution to avoid that. If you share more code, it means that other people will notice abnormalities and try to understand where that is coming from.

Open source helps us as well. Some of the tools we use at Cogent Labs are based on open source software. That lets us really focus. If it were not for open source, we would have to create everything from scratch. Open source is one of the reasons why there have been many more small software companies and startups in the last 15 years compared to before.


Q: What do you think of the state of the open source community today?

DC: Today, I think there is a major movement. It is very exciting to see how significant it has become. I no longer go to conferences as often anymore, but before they were almost like family gatherings. Now, conferences are much bigger. Python has become one of the most common programming languages in the world and there are PyData conferences all over the world. They are huge and are attended by thousands of people if not tens of thousands of people. It is exciting to have been a part of that from the beginning. I am quite proud of that and proud to see how it has grown. I like to think that I had some contribution to that.