Modern Libraries and Much More: Results of Problem Interview with Developers

Pallium
4 min readAug 6, 2018

--

Some time ago we had a tricky interview with developers working on machine learning issues.

We were wondering which libraries, learning resources and frameworks the developers use, why they choose that particular stack, what problems they face with while using technologies of the selected stack or training models: infrastructure issues related to computation capacities and to implementation of the final product.

Over 80 specialists from different countries took part in the study, at the same time 10 people didn’t find that interview interesting. We are thankful for all who shared their opinion and presenting the results of our survey.

The first question was related to selection of libraries and frameworks for machine learning. 75% respondents use TensorFlow/Keras, 54% use Scikit-learn, and the third place is taken by Pandas as 34% respondents specify it, while NumPy and Pytorch are sharing the fourth and the fifth places with 29 and 26 % correspondingly. As for SciPy, Matplotlib, XGBoost, they take from 10 to 14%, when Seaborn and Theano have less than 10%. Apart from that, a few dozens of less popular resources and frameworks were mentioned, such as CatBoost, Matlab, Deeplearning4j, LightGBM, LIBSVM, H2O, spaCy, Gensim, Caffe and other.

Users explained why they choose TensorFlow/Keras: they are the most widespread, popular and even fancy instruments with a good community support, with huge and open documentation, with a handy and practical set of tools. Besides, they are considered as powerful, flexible, able to be adapted to certain tasks, having multiple functions and implemented everything ‘needed for ML’. Almost the same way the respondents describe benefits of Scikit-learn and Pytorch, specifying they are well-developed and the best community-supported resources on Python. As for private comments, the users who prefer Pytorch as the fastest and the lightest to TensorFlow, also praise Seaborn for user-friendly visualization tools.

While describing benefits, the majority of respondents have much in common in comments, however when they answered the question about problems the developers face with when using chosen resources and frameworks, we got a full bunch of different notes and complaints. This indicates not all tasks might be resolved on the current stacks, so the main part of coding has to be done manually; rather acceptable architecture solutions for a final product are not available; there are troubles with learning selections; the data quality is rather low; the resources are overused unreasonably; some people even openly complain about unavailability of cloud technologies and the bound need of working on powerful laptops. Some users face with the problem of data import and processing, other point out conflicts with operation systems, for example Windows, while installing resources. There are people who sarcastically say that a majority of problems they have are related to a human mistake and not to technical issues. However, with all that variety of problems, there are a few which are mentioned by almost all respondents: they include huge time needed to solve big tasks, the cost and volume of computation resources and issues with implementation of the final product.

When planning that interview, we expected TensorFlow to be the most popular library. And our expectations have been fully proved. However, we were wondering not only why that particular library is so popular — we told about it above referring to the answers of our vis-à-vis, but also why developers make their choice against our rating leader. We didn’t manage to find significant common patterns but we found the following answers appeared to be the most understandable and distinct: it’s not good to perform development on C++; Pytorch is more advanced (without clear reasons) and productive. There was a certain pool of answers (about 20%) the authors of which consider TensorFlow as a near-term prospect and plan to use it in the future.

However, our interest lay not only in troubles with certain libraries but in common infrastructure problems which developers face with or don’t, while training their LM models. 11 lucky guys said they don’t have any issues. Three people said they have issues very rarely. The rest of respondents did have problems. However, they were usual and obvious related to libraries faults, compatibility problems, transferring models from one library to another, implementing models in production stage. The biggest and most frequent problem is certainly computation capacities. As a result of this, a lot of time is spent for training models, especially at the stage when models get more complicated and data get more volume.

The problem of resources and computation capacities is obvious and essential. Where do our respondents take computation capacities and are they enough to solve tasks of different scale and complexity? More than a half of them — over 51% use own capacities, PC or laptops, both own and even friends’ laptops. 26% refer to local servers and 17% use AWS and other cloud servers. About 5% possess capacities provided by institutes or companies where our vis-à-vis work in. There are few exceptional people who get capacities from God. We thought how great it would be if anyone could get it because the last question of our interview was concerning availability or unavailability of computation capacities. And only 24% say they don’t have such problems either now or in the near-term prospect. 30% just the opposite, do feel lack of computation capacities and 36% are alright now but they see it as a problem in the future. And only a part of 36% is thinking about possible solutions: leasing additional capacities, referring to open resources. There are people who are not willing for such solutions and would prefer to spend more time for training models. And there are other ones who dream to pay for additional capacities only when they are really working but not when other learning issues, for instance, concerning architecture, are getting solved.

--

--

Pallium

Pallium Network is the distributed computation network for machine learning.