[Colab] [Github]

By Ken Gu

Transformer-based models are a game-changer when it comes to using unstructured text data. As of September 2020, the top-performing models in the General Language Understanding Evaluation (GLUE) benchmark are all BERT transformer-based models. At Georgian, we find ourselves working with supporting tabular feature information as well as unstructured text data. We found that by using the tabular data in our models, we could further improve performance, so we set out to build a toolkit that makes it easier for others to do the same.

The 9 tasks that are part of the GLUE benchmark
The 9 tasks that are part of the GLUE benchmark
The 9 tasks that are part of the GLUE benchmark

Building on Top of Transformers

The main benefits of using transformers are that they can learn long-range dependencies between text and can be trained in parallel (as opposed to sequence to sequence models), meaning they can be pretrained on large amounts of data. …


Image for post
Image for post

A starting point to help you choose the right platform for your ML project.

By Jing Zhang

Introduction

If you’re looking for an end-to-end machine learning (ML) platform, you’re spoiled for choice. There are three main choices for cloud providers: Google Cloud Platform (GCP), Amazon Web Service (AWS) and Microsoft Azure Platform (Azure). The question is: how do you choose between the three? What functionality do they provide to build ML pipelines? We set out to answer these questions in a recent hackathon.

The R&D team at Georgian, where I work as an ML Engineer, decided to organize a hackathon to explore how each provider can help with the workflow. Our team builds machine learning software components ourselves, but we’re typically working with the growth-stage software companies in the Georgian family to deploy, so we wanted to familiarize ourselves with the different platforms and to be able to adapt to their workflows faster. …


by Akshay Budhkar and Parinaz Sobhani

In this article we make some suggestions on how to use the latest technologies to run your team as lean as possible while creating a differentiated machine learning product at your startup.

Image for post
Image for post
Photo by Chris Ried on Unsplash

As a team, you and your machine learning scientists and engineers are focused on driving efficiencies for your customers by automating processes and improving the usability of the product by providing insights to your users.

However, now more than ever, you will have to do so as efficiently as possible yourselves. Here we make some suggestions on how to use the latest technologies to run your team as lean as possible while creating a differentiated machine learning product at your startup. …


Image for post
Image for post

In a world where the risks and costs associated with privacy are on the rise, and privacy issues are leading to broader questions around trust and AI, we believe that differential privacy offers a solution. Differential privacy allows machine learning teams to create products without risking the privacy of any individual’s data.

With the right tools, we think that companies of all sizes can take advantage of advanced machine learning techniques such as differential privacy. …


An opinionated guide to tooling in Python covering pyenv, poetry, black, flake8, isort, pre-commit, pytest, coverage, tox, Azure Pipelines, sphinx, and readthedocs

Adithya Balaji

Image for post
Image for post

Here’s a link to the TL;DR

Introduction

So you’ve got an awesome idea for a new open-source package that’s going to rock the Python landscape. You might be wondering how you make sure you set up the package so that the open-source community can help take your idea to the next level. Or, you might be wondering how can you solve the problems you’re running into when setting a personal project. For example, you might be running into merge conflicts due to inconsistent styles or broken refactors because the appropriate tests are not in place. …


by Akshay Budhkar

Image for post
Image for post

The International Conference for Learning Representations (ICLR) is one of the biggest machine learning conferences in the world. It’s a competitive conference, with the main conference having an acceptance rate of just 31.4% (500/1591), so the standard of content is high.

I attended the seventh edition of the conference in New Orleans along with Eddie Du, another Applied Research Scientist on the Georgian Impact team. In this post, we share our highlights — papers that we found interesting in areas where we saw the potential to add value in applied projects with the Georgian Partners portfolio.

1. [Contributed Talk] ImageNet-trained CNNs Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness

Why we like it: A simple but very efficient approach to improve object detection by relying on shapes (like humans do) as opposed to textures. The approach can be extended to other domains. …


Aaruran Elamurugaiyan

Privacy can be quantified. Better yet, we can rank privacy-preserving strategies and say which one is more effective. Better still, we can design strategies that are robust even against hackers that have auxiliary information. And as if that wasn’t good enough, we can do all of these things simultaneously. These solutions, and more, reside in a probabilistic theory called differential privacy.

The Basics

Here’s the context. We’re curating (or managing) a sensitive database and would like to release some statistics from this data to the public. However, we have to ensure that it’s impossible for an adversary to reverse-engineer the sensitive data from what we’ve released [1]. An adversary in this case is a party with the intent to reveal, or to learn, at least some of our sensitive data. Differential privacy can solve problems that arise when these three ingredients — sensitive data, curators who need to release statistics, and adversaries who want to recover the sensitive data — are present (see Figure 1). …


A head to head comparison of four automatic machine learning frameworks on 87 datasets.

Adithya Balaji and Alexander Allen

Introduction

Automatic Machine Learning (AutoML) could bring AI within reach for a much larger audience. It provides a set of tools to help data science teams with varying levels of experience expedite the data science process. That’s why AutoML is being heralded as the solution to democratize AI. Even with an experienced team, you might be able to use AutoML to get the most out of limited resources. While there are proprietary solutions that provide machine learning as a service, it’s worth looking at the current open source solutions that address this need.

In our previous piece, we explored the AutoML landscape and highlighted some packages that might work for data science teams. In this piece we will explore the four “full pipeline” solutions mentioned: auto_ml, auto-sklearn, TPOT, and H2O’s AutoML solution. …


A review of 22 machine learning libraries to help you choose which one might be right for your pipeline.

Alexander Allen and Adithya Balaji

Introduction

At Georgian Partners, our data science team is consistently looking for ways we can improve our efficiency and the efficiency of teams at our portfolio companies. One way is through improving the tooling in our machine learning pipelines. Rather than manually writing code to manipulate datasets it can be more efficient to draw from the vast collection of libraries available. However, there are so many libraries claiming to improve upon different processes in different ways it is overwhelming to make a selection. …

About

Georgian

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store