PyTorch 1.0 and PTDC Part 2

James Burton
17 min readOct 3, 2018

--

Continuing from my previous post covering the morning of the event, here is a summary of the afternoon’s session at the PyTorch Developer Conference featuring the launch of PyTorch 1.0.

TL/DR

This session includes some fun examples of CycleGANs generating art and video effects, some advanced PyTorch libraries for Probabalistic and Gaussian processing, and a couple of people from the fantastic FastAI team who are doing wonders to democratise AI and who have also just released v1 of their wonderful library.

We start the afternoon with Session 3, featuring a range of speakers using PyTorch for research.

There has been interesting and ongoing research into optimising neural networks, with newer techniques finding ways to converge on different shapes of data. One of the ones to emerge earlier this year is Sign-SGD. This is a variant of Stochastic Gradient Descent (a common optimisation strategy) which uses the sign of each mini-batch’s gradient, and majority voting, instead of transmitting entire gradients to reduce communications overhead. This facilitates distributed learning, whilst maintaining convergene, and the full paper is available on Arxiv and was first published in February.

The speakers announced their new pre-release of a PyTorch implementation, that has been published to GitHub here:

https://github.com/PermiJW/signSGD-with-Majority-Vote

We are then introduced to Tensorly, a high-level API for tensor algebra. This handy library supports many tensor operations and multiple back-ends, including PyTorch, Numpy and MXNET, allowing programming of tensor operations in a more abstracted level and allowing easy use within difference environments.

The we move onto discussion of a Universal Loss function, and the ability to use such with unmatched or unsupervised data, and attempts to implement such routines. There has been particular development here around CycleGAN’s, where in addition to the individual discriminators there is a comparison of the round-trip error; As this involves checking that the output returns sufficiently closely to the original input this is amenable to generalisation as a universal loss function.

This development has led to some eye-catching examples of image and video processing, from relatively simple seeming conversion of horses to zebras and back again, or winter to or from summer in Yosemite park:

The model can do many more things to, and some of the interesting art generation and AI assistant possibilities are amazing. When these networks are trained on outlines of objects and examples of the objects, they become able to generate images from simple drawn outlines. Applying the same trick to image segmentation (marking what is what in an image, such as a map) allows you to train a model to perform segmentation, or to generate images from segmentation details.

This has particularly caught attention when applied to people, and the variations achieved are striking when multi-model output is used to show a range from a single input.

(See PyTorch multi-mapping image-to-image)

The Pix2Pix and CycleGAN PyTorch implementation that I’ve been using was discussed, and is available here: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix.

A clip Memo Atken’s “Learning to See: Gloomy Sunday” video was used to demonstrate application of this technique to video. In this the artist uses video of cloths moving to represent sea and clouds, with some additional objects to hand, and then uses a CycleGAN to render sky and waves moving, and rocks forming where objects are placed.

See http://www.memo.tv/portfolio/learning-to-see/ for the full video and more

The final project presented here is OpenPose, which can detect a person’s pose from a photo. The pose can then be used with pix2pix to apply the appearance of a different person … Now you can transfer one persons movement to another! The conference showed video of a professional ballet dancer performing moves, and two seemingly poor dancers mapped via her pose and pix2pix to appear that they are performing the same moves. Faked dance videos aside, there are many potential applications here, particularly in the virtual avatar space, interaction with more realistic generated characters, film and game production amongst others.

Finally, why does he love PyTorch? He says simply that it allows his students to be more creative and productive.

Speaker: Andrew Wilson from Cornell University

Andrew takes us on a tour of various intricate developments going on within the mathematics and implementations underpinning world-class deep learning research around PyTorch.

The team developed stochastic weight averaging, which is now available in the new PyTorch. Their research suggests that this quite significantly improved generalisation. Importantly, this works particularly well with low precision tensors, hence supporting the current updates to better support 16-bit floating point tensor operations.

He then introduces Scalable Gaussian Processes. With these tools it is possible to run Gaussian processes on millions of points in seconds, instead of thousands of points in hours as had been the case until now.

He introduces us to GPyTorch (https://gpytorch.ai/). This library enables fast matrix multiplies, by applying deep learning with uncertainty.

https://github.com/cornellius-gp/gpytorch

  • Training a GP+DenseNet on CIFAR-100 is possible at only 20% overhead.
  • This library makes it possible to train up to 30x faster, generate predictions up to 1000x faster, and sample up to 18,000x faster.
  • Implementation is as easy as building a matmul

The library particularly uses LazyTensor, allowing operations to be stacked up without resolving the actual sums (more a plan of operations). This allows some shortcuts and stacking of operations, and shared operations; For example, a sparse lazy tensor will skip the empty values.

Complex differentiable layers in deep learning

Speaker: Ziggy from Carnegie Melon

Ziggy begins by discussing AI in self-driving cars, which he worked on previously. He draws on this experience to present a comparison of Deep Learning vs traditional programming

Essentially these modules can be defined as training parameters to minimise a loss.

Ziggy then continues on to discussion of the OptNet Layer (see arxiv paper). Essentially an OptNet layer integrates optimization problems (specifically quadratic programs) to allow solution of problems that traditional convolution and fully-connected layers cannot capture, whilst retaining a low overhead and compatibility with standard backpropagation of gradients.

I won’t copy all the code examples provided, but as with most of the contributors they have open-sourced their libraries:

http://locuslab.github.io/qpth
https://github.com/locuslab/qpth
https://github.com/locuslab/optnet
https://github.com/locuslab/e2e-model-learning
https://github.com/locuslab/lcp-physics
https://github.com/locuslab/mpc.pytorch ?

The QPTH library provides an efficient batch solver based upon GPY-accelerated matrix factorization, effectively giving a “free” backwards pass.

This has been applied to a range of problems, including the following:

  • Solving sudoku problems (NB: challenging to standard DL, easy in qpth)
  • Power system scheduling
  • Learning zero-sum games (inverse game theory)
  • Differentiable physics (including colisions, and other hard to model elements), still trainable)
  • (Released today): Differentiable MPC (Imitation Learning, Decision Learning, Deep RL, planning, etc.)

Ziggy was surprised out how easy it is to build these, and how easy it is to re-use these as modules in PyTorch, and notes that the object oriented design/architecture of PyTorch is instrumental to this.

Early detection and why it matters

Presented by a researcher from New York University

Stats show the leading causes of death in the USA show many of these amenable to intervention with early detection.

Large Scale Chronic Disease Prediction using electronic health records.

The team encountered challenges with badly scribbled notes; bad handwriting, use of shorthand and abbreviations, and doctors just concerned with key points rather than good writing, which all make these notes more complex to understand via NLP.

They tested data spanning a 3–9 month window, to see if the earlier data could predict any of the later data with enough sample information to train from. NB: This forms part of the ongoing research is at NYU.

Out of over 6 million users in the system, only about 300,000 have enough data to fulfil the time-range and other requirements for inclusion.

Some cross-correlation of structured data against the notes has been possible, and has identified gaps in the structured data which have been present in the unstructured notes.

Handling of numbers is especially important in this system.

They have stuck to the community spirit by open sourcing their code.

They needed visualisations for the projects and recommendations, and have looked at Gradient based, and Log-Odds based methods to be able to explain decision, because these systems can’t just be a black-boxes; the recommendations need a basis to drive behaviour and insight.
https://github.com/NYUMedML/deepEHR
https://arxiv.org/abs/1808.04928

There was a break here … and we then resume with the Industry segment of the conference.

Tesla

Speaker: Andrej Karpathy, Director of AI

Andrej is interested to see techniques from software engineering spread more widely into the deep learning and AI community, and discusses some of the techniques and patterns consider.

CI (Continuous Integration) workflow:

  • Automate the build (automatically kick off neural network training jobs)
  • Run the unit tests immediately
  • Every commit should be built (on every change to training/data)
  • Keep the build fast (use large-scale distributed training)
  • Make it easy to get the latest deliverables
  • Automate deployment

Version Control:

Now data needs to be version controlled too.

  • Labelling documentation will change over time.
  • You will feel the pressure to still use legacy labels
  • Different labellers

Adrej then discusses Mono-repos
e.g. Google / Netflix / Twitter / etc.

  • These allow for sharing ofcode at source level instead of linking pre-build binaries.

Single models are the mono-repos of SW 2.0

  • Build a single model, traing from scratch, every time.

In Summary:

Treat deep nets like code:

  • Test Driven Development
  • Use CI
  • Use Version Control
  • Train 1 model to solve all tasks, from scratch

Long way before deep learning becomes a discipline rather than a dark art.

Applied Deep Learning

Brian Catanzaro, Deep Learning Research @ NVidia

They like PyTorch because it is:

  • Simple
  • Extensible
  • Fast.

Brian shows a diagram with a loopin triangle, from Idea, to Test, to Code, and looping back to Idea.

PyTorch has been great to reduce the latency in this cycle.

Deep Learning Supersampling:

This demonstrates Deep Learning for real-time graphics.

DLSS is a marquee feature of the new Turing GPU lineup. Applying DL to real-time graphics is now possible.

DLSS improves image quality and framerates with a neural network, with features including:

  • Remove aliasing
  • Super-resolution

NB: Tensor Cores enable this functionality to be applied in real-time.

Nvidia Tech Demos

In-painting with partial convolutions
http://research.nvidia.com/inpainting

Image and Video Synthesis
https://github.com/NVIDIA/vid2vid

Goal: render graphics with generative models
Conditional on high level input

  • Easy to create and edit
  • Provides control

This Create videos with temporal consistency

Frame Prediction (SDC-Net):
Predicts per-pixel warp vector, per-pixel sampling kernel to generate future frame. This can be used to improve frame-rate by only actually fully computing every other frame.

Text-to-Speech: Tacotron2 and WaveNet
https://github.com/NVIDIA/nv-wavenet

New wavenet implementation achieves:

  • 320 voices at 16 kHz, using the “Deep Voice” net
  • Maximum sample rate of 48 kHz

The project is open source, and uses PyTorch bindings & training code.

Tacotron takes text and generates a spectrogram, and wavenet uses that as input.

Unsupervised Language Modeling
https//github.com/NVIDIA/sentiment-discovery

  • Converge language model on 40 GB of text in 4 hours, using mixed precision arithmetic on 128 V100 GPUs
  • Transfer language model to sentiment task
  • Uses Apex AMP to automatically train model in mixed precision
    https://github.com/NVIDIA/apex

NB: For those who haven’t seen Apex, it is a great new library to support lower precision tensor operations, increasing the speed and size limits of a given model on compatible hardware.

The Natural Language Decathlon:
Multitask Learning in PyTorch

Speaker: Brian McCann, Research Scientist @ SalesForce
bmccann@salesforce.com

Brian takes us through their multitask research, and the decaNLP project that is designed to evaluate and share techniques for NLP.

From Single-task to Multi-task Learning:

  • Great performance in recent years given a single (dataset, task, model, metric).
  • For more general NLP systems, we need to push ourselves into a multitask setting.

decaNLP: The Natural Language Decathlon

  • NB: Includes leader-board, which helps see which features/tests worked where

Benchmark:

  • Question Answering
  • Machine Translation
  • Summarrization
  • Natural Language Inference
  • Sentiment Classification
  • Semantic Role Labelling
  • Relation Extraction
  • Dialogue
  • Semantic Parsing
  • Common Sense Reasoning

Framework:

  • General language understanding
  • Multitask learning
  • Transfer learning
  • Pre-training
  • Fine-tuning
  • Weight sharing
  • Zero-shot learning
  • Domain adaption
  • Optimization strategies
  • Data augmentation

Their platform supports testing models in a multi-task settings, and has tools to support evaluation, and pre-trained models for comparison. It also supports custom datasets (JSON format, with “context”, “question” and “answer” properties to each object in the set).

Approach:

  • Train a single seq2eq/QA model
  • Unifies classification, extraction, generation
  • Works on ten tasks
  • Works on single tasks (SOTA on WikiSQL)
  • No human intervention
  • No task-specific settings
  • Domain adaption
  • Some zero-shot

Brian then proceeds with the always risky live demo … He picks a URL (a wikipedia page page about PyTorch), and asks it the question “What is PyTorch?”. The system successfully highlights a relevant section from the page.

“Is this sentence negative or positive?”, with a URL to the pytorch 1.0 announcement/PR … this successfully reads around the meaning of sentence to to choose to answer that the article/post is positive.

“Is this release pessimistic or exciting?”, with a URL to pytorch again … Even though it hasn’t really been trained for pessimistic vs exciting, it can make sense of this and answer that this release is exciting.

The Dream: Multitask Learning + PyTorch; Develop a single multitask model and train on everything.

Pyro / Deep Universal Probablistic Programming

Speaker: Fritz Overmeyer @ UBER AI Labs

http://pyro.ai

Goals:

  • Universal
  • Scalable
  • Flexible
  • Minimal

Dynamic Computer graphs were a key feature of PyTorch required for this.

3 layer architecture:

  • Probabalistic Programming
  • Effects Library
  • Inference Algorithm

This approach enables composable inference algorithms.

The approach utilises specifying whether ranges or independent of each other or not when looping, allowing for appropriate optimisation choices.

Examples

MNIST Digit Generator

Infer latent data

Stochastic Variable Inference

Pyro with PyTorch architecture

AllenNLP

Speaker: Mark Nuemann, Research/Engineering

http://allennlp.org

Mark presents a synopsis of the AllenNLP project, and here are the keynotes for this:

An open source library to support deep learning-infused NLP research. both within AI and beyond, allowing researchers to build on the most useful reusable abstractions

Reference implementations, live demos and state of the art models.

Major improvements to all models using language model pretraining (ELMo) on raw text

Research on paragraph understanding and semantic parsing.

Original built upon Keras, but rebuilt on-top on PyTorch to take advantage of the following:

  • Imperative and easy to debug
  • Dynamic graphics for flexible NLP models
  • Amazing and active community

They found themselves jealous of reinforment learners, and wanted these features:

  • High quality public demos
  • Batching/Padding with support for nesting and structured data
  • JSON Configurable Higher Level APIs for NLP

Range of modules

Minimal NLP example:

NB: More examples available on their website.

On-to live demo … dependency parsing.

http://demo.allennlp.org/dependency-parsing/

NB: They show some good outputs, but if prodded show some of the break points quite well too.

An example of machine comprehension shows that asking who stars in the matrix picks out the actor name list, but asking how many people starred in the matrix pulls out the number 1999!

PyTorch 1.0 really powers this with the new JIT compilation, avoiding the need to use other compiler techniques/packages.

Used by Facebook, Amazon Alex, AirBNB, UCI, NYU, Harvard, + more.

They are looking to apply these across other domains, such as science journalism.

NB: This is the last Session before the panel.

Democratisation of Deep Learning

Speaker: Stuart Fry @ Udacity

Udacity started with an introduction to AI Stanford course that they decided to put online for free whilst also running on campus.

160,000 people signed up (before they stopped enrolment).

More than 23,000 finished the online version, alongside the two hundred or so students at Stanford. This represented a new scale of learning; more students than all the other classrooms teaching AI at that time combined.

They gave everyone the same test at the end.

Udacity keep making online courses/MOOCs.

They then also developed the Nanodegree Programs, which are larger, project based, properly reviewed and assessed.

Stuart shows us an example of their Self Driving Car course going through from project #1 to final project #8. The first project is relatively straight-forward, building rapidly through to a final project code which is actually run on a self-drive car for testing.

They have a whole “School of AI” set of courses.

PyTorch is an important (and increasingly so) part of their program, woven through 4 out of 8 of their nano-degrees

Previously the Deep Learning course had been TensorFlow, and now uses PyTorch.

The pythonic style fits very well with python and numpy.

horses2zebras shown again:
https://junyanz.github.io/CycleGAN/

Discusses students building ragdool <-> birman cat cycleGAN.

Discusses an automatic caption generation AI; Training a CNN to recognise patterns in an image, and using an RNN to generate a descriptive caption of the image.

Free Course:
Intro to Deep Learning with PyTorch
by Facebook Artificial Intelligence, Soumith and Udacity.
(Comes out on November 9th)

PyTorch Scholarship Challenge from Facebook

  • Entries open now (live today).
  • Scholarship starts of Nov 9th (with free course too)

Rachel Thomas @ Co-founder Fast.ai

Rachel discussed the state of the software and community we now have bulding up, and the direction, changes and ethos.

FastAI provides the same ethos as Keras over PyTorch, and FastAI v1 was announced today (2nd October 2018).

They started using PyTorch 0.1 in Spring 2017 for attentional models with teacher forcing. This is an important NLP technique, but it is tricky in Keras.

FastAI officially switched all of fast.ai to PyTorch in Sept 2017.

Why PyTorch?

  • Dynamic (as opposed to static) computation graphic -> easier to debug
  • Rapid iterations and easier to experiment -> fast models in fewer lines of code

PyTorch also feels more natural from an OO background.

Less code, higher accuracy, faster.

FastAI is applied in github engineering, and in fortune 500 companies.

Some of their students working on social impact projects were written up in Forbes, for example:

“Artificial Intelligence Education Transforms The Developing World.”

More fun projects are available too, such as:

There are now some articles about the FastAI team winning the DawnBench deep learning speed challenges:-

  • An AI speed test shows clever coders can still beat tech giants like Google and Intel
  • A small team of student AI coders beats Google’s machine-learning code.

NLP improvements over prior state-of-the-art:

  • Introducing state of the art text classification with universal language models
  • Universal Language Model Fine-tuning for Text Classification

Helpful community on forums.fast.ai

Rachel comments that it is particularly worth noting the “Language Model Zoo”, in which people can share their language tips and models:

Final Panel, hosted by Soumith

Here we have a panel hosted by Soumith, with some of the top influencers and developers at the cutting edge of AI today. I will not be writing down everything said, and please don’t assume all comments are accurately quoted, but I have tried to provide selected highlights for those who don’t wish to watch it in full.

Speakers:

  • Yangqing Jia (YJ)— Creator of Caffe, Caffe2 and ONNX, and part of PyTorch 1.0
  • Noah Goodman (NG) — Associate Professor Stanford (Computation and Cognition Lab), and worked with Uber on Pyro project
  • Jeremy Howard (JH)— Co-founder of FastAI, previous CEO of Analytic, and president of Kaggle.
  • Chris Lattner (CL)— Google, creator of CLang, Swift, led developer tools group that built XCode

Do you say you do AI or ML?
NG: Human and Machine Intelligence
JH: AI
CL: ML
YJ: AI

JH: The bad thing about PyTorch is Python. You can say its better than R, but its worse than just about everything else.
CL seemed to not so much care about the language, rather he thinks the idea and throwing compute power at it solves most problems. Agrees that python has issues running at scale. Says that Swift for TensorFlow is about seeing what happens if you remove the python barrier, suggesting that things that were loved 40 years ago in the Fortran world have been forgotten.

Discussion of C++, Swift, Python

  • Flexibility is a key strength of Python.
  • Soumith refused to answer if he things there will be a SwiftTorch, but saying that he’ll be guided by the community.

Soumith: “One of the things that’s happening is that the AI landscape changes every month”

Some people suggest probablistic programming will be the next wave after deep learning. Do you think this is likely for mainstream applications, and how do you see the stack changing?
NG: Multi-paradigm programming will be more of an end-goal.
… NG also suggests that logic programming will be a part of things.

JH: System Dynamics is fairly close to what is now called Probabalistic Programming. He thinks the current methods are over complex (as is much of the deep learning).
… It should be about you providing the domain information you have and it figuring out what it can from there.
… Domain specific differentiable layers may help average users solve problems and generate predications, etc.
CL: I don’t have a deep enough knowledge of probabalistic programming, but I generally completely agree with Noah. The key is how you can pick and choose the right technologies for given problems, without needing to completely change technology and tool stacks.
YJ: … This was one of my PhD projects … the sphere lacks the software engineering rigour, such as source control, testing and documentation, to provide more stability.

There was some discussion on hardware becoming more specialised, and whether this is likely to affect the software stack.

JH: Comments that he’s heard these step-change description about hardware for 25+ years
… suggests that it leads developers to reaching for more and larger machines, rather than plan to optimise to see what they can run on their laptop
… ponders what you can manage in JavaScript, as a lowest common denominator that runs everywhere then.
… It’s all about simple, fast, cheap … and not enough people are working on that problem.
CL: I’m more worried about over-fitting on software, rather than hardware. He feels the hardware is moving faster than the software. He suggests that approaching the end of Moore’s Law leads to greater specialisation of hardware to current software patterns.
… Suggests that the focus on matmul may be missing some generalisations to support other important operations

Discussion between YJ and CL, querying whether there is much hardware focus on ML techniques rather than DL (such as decision trees).

YJ suggests that the slow hardware development cycle may be part of that … CL suggests that in that case you should look to mobile, as they have a “replace every 2 years” type of cycle.
JH refers to Intel’s analogue CPU release (and failure) in the 1990’s and whether optical CPUs, approximate CPUs, etc. might be more interesting developments than lower precision operations to boost throughput.

NG: Says that at some point we will need to focus on smaller power and other requirements, rather than stacking more cores up to run your models.

Mention of AutoML paradigm … Soumith wonders if that combined with a compute spike would change the way people approach that.
YJ: The way we build and train models become data
CL: Ponders whether AutoML the way? Is DL going to be taken over by the machines?
… He suspects that with an explosion in compute this becomes highly plausible in the long run, but that the current level is far from putting programmers out of work.
NG: If you can get the DL tooling to be revising itself

JH: 25 years ago you could buy a product to develop a model for you, and we don’t use that now.
… AutoML is completely the wrong route, as augmenting and advising in a smart way is preferable to brute-force
… He suspects that the models may be calculated with brute-force, but that the hyper-parameter tuning is more skilled and intelligent, and can lead to better outcomes than brute-force too by finding other variations outside the sample space.

--

--