Deep Learning 2: Part 2 Lesson 13

https://medium.com/@hortonhearsafoo/adding-a-cutting-edge-deep-learning-training-technique-to-the-fast-ai-library-2cd1dba90a49

TrainPhase [2:01]

Notebook

phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),   
TrainingPhase(epochs=2, opt_fn=optim.SGD, lr = 1e-3)]
learn.fit_opt_sched(phases)
phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2), 
TrainingPhase(epochs=1, opt_fn=optim.SGD,
lr = (1e-2,1e-3), lr_decay=DecayType.LINEAR),
TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)]
lr_i = start_lr + (end_lr - start_lr) * i/n
phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),    
TrainingPhase(epochs=1, opt_fn=optim.SGD, lr =(1e-2,1e-3),
lr_decay=DecayType.COSINE),
TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)]
lr_i = end_lr + (start_lr - end_lr)/2 * (1 + np.cos(i * np.pi)/n)
lr_i = start_lr * (end_lr/start_lr)**(i/n)
lr_i = end_lr + (start_lr - end_lr) * (1 - i/n) ** p
phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2), 
TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2,
lr_decay=DecayType.COSINE),
TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)]

SGDR [7:43]

So the cool thing is, now we can replicate all of our existing schedules using nothing but these training phases . So here is a function called phases_sgdr which does SGDR using the new training phase API.

def phases_sgdr(lr, opt_fn, num_cycle,cycle_len,cycle_mult):
phases = [TrainingPhase(epochs = cycle_len/ 20, opt_fn=opt_fn,
lr=lr/100),
TrainingPhase(epochs = cycle_len * 19/20,
opt_fn=opt_fn, lr=lr, lr_decay=DecayType.COSINE)]
for i in range(1,num_cycle):
phases.append(TrainingPhase(epochs=cycle_len*
(cycle_mult**i), opt_fn=opt_fn, lr=lr,
lr_decay=DecayType.COSINE))
return phases

1cycle [8:20]

The new 1cycle we can now implement with, again, a single little function.

def phases_1cycle(cycle_len,lr,div,pct,max_mom,min_mom):
tri_cyc = (1-pct/100) * cycle_len
return [TrainingPhase(epochs=tri_cyc/2, opt_fn=optim.SGD,
lr=(lr/div,lr), lr_decay=DecayType.LINEAR,
momentum=(max_mom,min_mom),
momentum_decay=DecayType.LINEAR),
TrainingPhase(epochs=tri_cyc/2, opt_fn=optim.SGD,
lr=(lr,lr/div), lr_decay=DecayType.LINEAR,
momentum=(min_mom,max_mom),
momentum_decay=DecayType.LINEAR),
TrainingPhase(epochs=cycle_len-tri_cyc, opt_fn=optim.SGD,
lr=(lr/div,lr/(100*div)),
lr_decay=DecayType.LINEAR,
momentum=max_mom)]

Discriminative learning rates + 1cycle [8:53]

So something that I haven’t tried yet, but I think would be really interesting is to use the combination of discriminative learning rates and 1cycle. No one has tried yet. So that would be really interesting. The only paper I’ve come across which has discriminative learning rate uses something called LARS. It was used to train ImageNet with very very large batch sizes by looking at the ratio between the gradient and the mean at each layer and using that to change the learning rate of each layer automatically. They found that they could use much larger batch sizes. That’s the only other place I’ve seen this kind of approach used, but there’s lots of interesting things you could try with combining discriminative learning rates and different interesting schedules.

Your own LR finder [10:06]

You can now write your own LR finger of different types, specifically because there is now this stop_div parameter which basically means that it’ll use whatever schedule you asked for but when the loss gets too bad, it’ll stop training.

Changing data [11:49]

Then the bit I find most interesting is you can change your data. Why would we want to change our data? Because you remember from lesson 1 and 2, you could use small images at the start and bigger images later. The theory is that you could use that to train the first bit more quickly with smaller images, and remember if you halve the height and halve the width, you’ve got the quarter of the activations every layer, so it can be a lot faster. It might even generalize better. So you can now create a couple of different sizes, for example, he’s got 28 and 32 sized images. This is CIFAR10 so there’s only so much you can do. Then if you pass in an array of data in this data_list parameter when you call fit_opt_sched, it’ll use different dataset for each phase.

data1 = get_data(28,batch_size)
data2 = get_data(32,batch_size)
learn = ConvLearner.from_model_data(ShallowConvNet(), data1)phases = [TrainingPhase(epochs=1, opt_fn=optim.Adam, lr=1e-2,
lr_decay=DecayType.COSINE),
TrainingPhase(epochs=2, opt_fn=optim.Adam, lr=1e-2,
lr_decay=DecayType.COSINE)]
learn.fit_opt_sched(phases, data_list=[data1,data2])
  • the fastest GPU result
  • the fastest single machine result
  • the fastest publicly available infrastructure result

CIFAR10 result [15:15]

Our CIFAR10 results are also now up there officially and you might remember the previous best was a bit over an hour. The trick here was using 1cycle, so all of this stuff that’s in Sylvain’s training phase API is really all the stuff that we used to get these top results. And another fast.ai student who goes by the name bkj has taken that and done his own version, he took a Resnet18 and added the concat pooling that you might remember that we learnt about on top, and used Leslie Smith’s 1cycle and so he’s got on the leaderboard. So all the top 3 are fast.ai students which wonderful.

CIFAR10 cost result [16:05]

Same for cost — the top 3 and you can see, Paperspace. Brett ran this on Paperspace and got the cheapest result just ahead of bkj.

1x1 convolution [18:23]

1x1 conv is simply saying for each grid cell in your input, you’ve got basically a vector. 1 by 1 by number of filters tensor is basically a vector. For each grid cell in your input, you’re just doing a dot product with that tensor. Then of course, it’s going to be one of those vectors for each of the 192 activations we are creating. So basically do 192 dot products with grid cell (1, 1) and then 192 with grid cell (1, 2) or (1, 3) and so forth. So you will end up with something which has the same grid size as the input and 192 channels in the output. So that’s a really good way to either reduce the dimensionality or increase the dimensionality of an input without changing the grid size. That’s normally what we use 1x1 convs for. Here, we have a 1x1 conv and another 1x1 conv, and then they add it together. Then there is a third path and this third path is not added. It is not explicitly mentioned but this third path is concatenated. There is a form of ResNet which is basically identical to ResNet but we don’t do plus, we do concat. That’s called a DenseNet. It’s just a ResNet where we do concat instead of plus. That’s an interesting approach because then the kind of the identity path is literally being copied. So you get that flow all the way through and so as we’ll see next week, that tends to be good for segmentation and stuff like that whe re you really want to keep the original pixels, the first layer of pixels, and the second layer of pixels untouched.

Ethics in AI [35:31]

This is the bit where we talk about what’s most important which is now that we can do all this stuff, what should we be doing and how do we think about that? The TL;DR version is I actually don’t know. Recently a lot of you saw the founders of the spaCy prodigy folks down at the Explosion AI did a talk, Matthew and Ines, and I went to dinner with them afterwards, and we basically spent the entire evening talking, debating, arguing about what does it mean the companies like ours are building tools that are democratizing access to tools that can be used in harmful ways. They are incredibly thoughtful people and we, I wouldn’t say we didn’t agree, we just couldn’t come to a conclusion ourselves. So I’m just going to lay out some of the questions and point to some of the research, and when I say research, most of the actual literature review and putting this together was done by Rachel, so thanks Rachel.

Unintended consequences [45:04]

Runaway feedback loops [46:10]

Bias in AI [48:09]

Responsibility in hiring [52:46]

IBM & “Death’s Calculator” [54:08]

Style Transfer [1:01:28]

Notebook

https://arxiv.org/abs/1508.06576
%matplotlib inline
%reload_ext autoreload
%autoreload 2
from fastai.conv_learner import *
from pathlib import Path
from scipy import ndimage
torch.cuda.set_device(3)

torch.backends.cudnn.benchmark=True
PATH = Path('data/imagenet')
PATH_TRN = PATH/'train'
m_vgg = to_gpu(vgg16(True)).eval()
set_trainable(m_vgg, False)
img_fn = PATH_TRN/'n01558993'/'n01558993_9684.JPEG'
img = open_image(img_fn)
plt.imshow(img);
sz=288trn_tfms,val_tfms = tfms_from_model(vgg16, sz)
img_tfm = val_tfms(img)
img_tfm.shape
(3, 288, 288)opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)
plt.imshow(opt_img);
sz=288
trn_tfms,val_tfms = tfms_from_model(vgg16, sz)
img_tfm = val_tfms(img)
img_tfm.shape
(3, 288, 288)
opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)
plt.imshow(opt_img);
opt_img = scipy.ndimage.filters.median_filter(opt_img, [8,8,1])
plt.imshow(opt_img);
opt_img = val_tfms(opt_img)/2
opt_img_v = V(opt_img[None], requires_grad=True)
opt_img_v.shape
torch.Size([1, 3, 288, 288])
m_vgg = nn.Sequential(*children(m_vgg)[:37])
  • We’ve taken our bird image
  • Turned it into a variable
  • Stuck it through our model to grab the 37th layer activations which is our target. We want our content loss to be this set of activations.
  • We are going to create an optimizer (we will go back to the details of this in a moment)
  • We are going to step a bunch of times
  • Zero the gradients
  • Call some loss function
  • Loss.backward()
targ_t = m_vgg(VV(img_tfm[None]))
targ_v = V(targ_t)
targ_t.shape
torch.Size([1, 512, 18, 18])max_iter = 1000
show_iter = 100
optimizer = optim.LBFGS([opt_img_v], lr=0.5)

Broyden–Fletcher–Goldfarb–Shanno (BFGS) [1:20:18]

A couple of new details here. One is a weird optimizer (optim.LBFGS). Anybody who’s done certain parts of math and computer science courses comes into deep learning discovers we use all this stuff like Adam and the SGD and always assume that nobody in the field knows the first thing about computer science and immediately says “any of you guys tried using BFGS?” There’s basically a long history of a totally different kind of algorithm for optimization that we don’t use to train neural networks. And of course the answer is actually the people who have spent decades studying neural networks do know a thing or two about computer science and it turns out these techniques on the whole don’t work very well. But it’s actually going to work well for this, and it’s a good opportunity to talk about an interesting algorithm for those of you that haven’t studied this type of optimization algorithm at school. BFGS (initials of four different people) and the L stands for limited memory. It is an optimizer so as an optimizer, that means that there’s some loss function and it’s going to use some gradients (not all optimizers use gradients but all the ones we use do) to find a direction to go and try to make the loss function go lower and lower by adjusting some parameters. It’s just an optimizer. But it’s an interesting kind of optimizer because it does a bit more work than the ones we’re used to on each step. Specifically, the way it works is it starts the same way that we are used to which is we just pick somewhere to get started and in this case, we’ve picked a random image as you saw. As per usual, we calculate the gradient. But we then don’t just take a step but we actually do is as well as finding the gradient, we also try to find the second derivative. The second derivative says how fast does the gradient change.

def actn_loss(x): return F.mse_loss(m_vgg(x), targ_v)*1000def step(loss_fn):
global n_iter
optimizer.zero_grad()
loss = loss_fn(opt_img_v)
loss.backward()
n_iter+=1
if n_iter%show_iter==0:
print(f'Iteration: n_iter, loss: {loss.data[0]}')
return loss
n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,actn_loss))
Iteration: n_iter, loss: 0.8466196656227112
Iteration: n_iter, loss: 0.34066855907440186
Iteration: n_iter, loss: 0.21001280844211578
Iteration: n_iter, loss: 0.15562333166599274
Iteration: n_iter, loss: 0.12673595547676086
Iteration: n_iter, loss: 0.10863320529460907
Iteration: n_iter, loss: 0.0966048613190651
Iteration: n_iter, loss: 0.08812198787927628
Iteration: n_iter, loss: 0.08170554041862488
Iteration: n_iter, loss: 0.07657770067453384
x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]
plt.figure(figsize=(7,7))
plt.imshow(x);

Forward hook [1:29:42]

This is one of these things that almost nobody knows about so almost any code you find on the internet that implements style transfer will have all kind of horrible hacks rather than using forward hooks. But forward hook is really easy.

class SaveFeatures():
features=None
def __init__(self, m):
self.hook = m.register_forward_hook(self.hook_fn)
def hook_fn(self, module, input, output): self.features = output
def close(self): self.hook.remove()
m_vgg = to_gpu(vgg16(True)).eval()
set_trainable(m_vgg, False)
block_ends = [i-1 for i,o in enumerate(children(m_vgg))
if isinstance(o,nn.MaxPool2d)]
block_ends
[5, 12, 22, 32, 42]
sf = SaveFeatures(children(m_vgg)[block_ends[3]])
def get_opt():
opt_img = np.random.uniform(0, 1,
size=img.shape).astype(np.float32)
opt_img = scipy.ndimage.filters.median_filter(opt_img, [8,8,1])
opt_img_v = V(val_tfms(opt_img/2)[None], requires_grad=True)
return opt_img_v, optim.LBFGS([opt_img_v])
opt_img_v, optimizer = get_opt()
m_vgg(VV(img_tfm[None]))
targ_v = V(sf.features.clone())
targ_v.shape
torch.Size([1, 512, 36, 36])def actn_loss2(x):
m_vgg(x)
out = V(sf.features)
return F.mse_loss(out, targ_v)*1000
n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,actn_loss2))
Iteration: n_iter, loss: 0.2112911492586136
Iteration: n_iter, loss: 0.0902421623468399
Iteration: n_iter, loss: 0.05904778465628624
Iteration: n_iter, loss: 0.04517251253128052
Iteration: n_iter, loss: 0.03721420466899872
Iteration: n_iter, loss: 0.03215853497385979
Iteration: n_iter, loss: 0.028526008129119873
Iteration: n_iter, loss: 0.025799645110964775
Iteration: n_iter, loss: 0.02361033484339714
Iteration: n_iter, loss: 0.021835438907146454
x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]
plt.figure(figsize=(7,7))
plt.imshow(x);
sf.close()

Style match [1:39:29]

The next thing we need to do is to create style loss. We’ve already got the loss which is how much like the bird is it. Now we need how like this painting style is it. And we are going to do nearly the same thing. We are going to grab the activations of some layer. Now the problem is, the activations of some layer, let’s say it was a 5x5 layer (of course there are no 5x5 layers, it’s 224x224, but we’ll pretend). So here’re some activations and we could get these activations both per the image we are optimizing and for our Van Gogh painting. Let’s look at our Van Gogh painting. There it is — The Starry Night

style_fn = PATH/'style'/'starry_night.jpg'style_img = open_image(style_fn)
style_img.shape, img.shape
((1198, 1513, 3), (291, 483, 3))plt.imshow(style_img);
def scale_match(src, targ):
h,w,_ = img.shape
sh,sw,_ = style_img.shape
rat = max(h/sh,w/sw); rat
res = cv2.resize(style_img, (int(sw*rat), int(sh*rat)))
return res[:h,:w]
style = scale_match(img, style_img)plt.imshow(style)
style.shape, img.shape
((291, 483, 3), (291, 483, 3))
opt_img_v, optimizer = get_opt()
sfs = [SaveFeatures(children(m_vgg)[idx]) for idx in block_ends]
style_tfm = val_tfms(style_img)
m_vgg(VV(style_tfm[None]))
targ_styles = [V(o.features.clone()) for o in sfs]
[o.shape for o in targ_styles]
[torch.Size([1, 64, 288, 288]),
torch.Size([1, 128, 144, 144]),
torch.Size([1, 256, 72, 72]),
torch.Size([1, 512, 36, 36]),
torch.Size([1, 512, 18, 18])]
def gram(input):
b,c,h,w = input.size()
x = input.view(b*c, -1)
return torch.mm(x, x.t())/input.numel()*1e6

def gram_mse_loss(input, target):
return F.mse_loss(gram(input), gram(target))
def style_loss(x):
m_vgg(opt_img_v)
outs = [V(o.features) for o in sfs]
losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]
return sum(losses)
n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,style_loss))
Iteration: n_iter, loss: 230718.453125
Iteration: n_iter, loss: 219493.21875
Iteration: n_iter, loss: 202618.109375
Iteration: n_iter, loss: 481.5616760253906
Iteration: n_iter, loss: 147.41177368164062
Iteration: n_iter, loss: 80.62625122070312
Iteration: n_iter, loss: 49.52326965332031
Iteration: n_iter, loss: 32.36254119873047
Iteration: n_iter, loss: 21.831811904907227
Iteration: n_iter, loss: 15.61091423034668
x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]
plt.figure(figsize=(7,7))
plt.imshow(x);
for sf in sfs: sf.close()

Style transfer [1:57:08]

Style transfer is adding content loss and style loss together with some weight. So there is no much to show.

opt_img_v, optimizer = get_opt()
def comb_loss(x):
m_vgg(opt_img_v)
outs = [V(o.features) for o in sfs]
losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]
cnt_loss = F.mse_loss(outs[3], targ_vs[3])*1000000
style_loss = sum(losses)
return cnt_loss + style_loss
n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,comb_loss))
Iteration: n_iter, loss: 1802.36767578125
Iteration: n_iter, loss: 1163.05908203125
Iteration: n_iter, loss: 961.6024169921875
Iteration: n_iter, loss: 853.079833984375
Iteration: n_iter, loss: 784.970458984375
Iteration: n_iter, loss: 739.18994140625
Iteration: n_iter, loss: 706.310791015625
Iteration: n_iter, loss: 681.6689453125
Iteration: n_iter, loss: 662.4088134765625
Iteration: n_iter, loss: 646.329833984375
x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]
plt.figure(figsize=(9,9))
plt.imshow(x, interpolation='lanczos')
plt.axis('off');
for sf in sfs: sf.close()

The Math [2:06:33]

Actually, I think I might work on the math now and we’ll talk about multi GPU and super resolution next week because this is from the paper and one of the things I really do want you to do after we talk about a paper is to read the paper and then ask questions on the forum anything that’s not clear. But there’s a key part of this paper which I wanted to talk about and discuss how to interpret it. So the paper says, we’re going to be given an input image x and this little thing means normally it means it’s a vector, Rachel, but this one is a matrix. I guess it could mean either. I don’t know. Normally small letter bold means vector or a small letter with an arrow on top means vector. And normally big letter means matrix or small letter with two arrows on top means matrix. In this case, our image is a matrix. We are going to basically treat it as a vector, so maybe we’re just getting ahead of ourselves.

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store