Building vs Buying Training Data Software Considerations

Anthony Chaudhary
Diffgram
Published in
4 min readApr 30, 2020

Subject Matter Experts

Annotations are a very human endeavor. It’s a high touch, high usage system, where annotators can be spending many hours per day on the system. This means a system needs to be performant to the level of say a word processor.

Time to Market

Even if a team and support budget is put together, it will take years to build an effective, tested, and scalable system. As with software engineering, the durable version is at least 10x longer to create than the prototype. While Diffgram is still new, the product trends towards the durable status with over 900,000 files created in over 1,000 company projects. You can be fully integrated with Diffgram within 3–6 months vs years to build.

Total Cost

When purchasing a commercial solution the research and development costs are shared with other people with similar problems. Diffgram has customers from virtually every industry, providing a diverse range of perspectives on ease of use and product quality. The vast majority of functions are shared across these industries leading to better value and ROI for you.

Focus

The scope of training data software is large enough that building instead of buying is akin to creating a second product. This may take away focus from your primary goal and add additional risk to the project.

Unknown Unknowns

Creating any type of software always carries risk. Especially so for novel areas like machine learning and deep learning. Expect the scope to evolve more rapidly than a traditional project.

Compatibility

In order to make your system reasonable future proof you will likely want it to be compatible with multiple cloud providers and machine learning frameworks. Since there will need to be a team working on this product separate from the team shipping the main product there will need to be good documentation.

In effect, by the time the team building the training data software makes this documentation they will have recreated something similar to what is available from Diffgram.

Rapidly Evolving Area

This is a rapidly growing area, with new standards and requirements being developed. Diffgram provides an evolving standard to approaching these problems. Diffgram is helping put forward and surface the most novel ideas and approaches to improve training data performance. Such as active labeling, pre-labeling, auto-grader, and other quality control concepts.

Ongoing Maintenance and Support

Due to the nature of this being both an end user and system facing product there is a significant support budget that will be needed.

Open source alternative

The primary alternative is to use or customize an open source alternative. These tools are generally either designed for a single user and come with a higher administrative overhead. Even to get started requires significant configuration. Ongoing maintenance is required, and feature additions are slower.

Passion

We are passionate about making AI more accessible and practical. This is not just a day job where we come in to meet a singular requirement spec. We care about the overall goals and directions that training data can take. We are shipping the best training data software in the world.

Unclear gain

The primary reasons companies create in house software is to gain a competitive advantage, such as shipping a novel product faster than a competitor. Because Diffgram already exists, recreating a similar version does not directly yield this advantage.

Diffgram offers the ability to deploy the code on your servers, under your control. And to pick and choose what parts of the software to use, for example an advanced user can use the core service without using the label editor. Diffgram also offers close partnerships where we can work with you to build desired features, or help your technical team build extensions through our SDK.

Conclusion — it’s a lot of work and risk to build it yourself.

Building training data software at a scale is a significant undertaking.

Considering training data software? Check out: Diffgram.com or the feature list.

Thanks for reading!

--

--