Addendum: Evaluation of My Model

As a mercifully short addendum, I’d like to quickly address a few questions about my model. Please read my update post to hear my important updated beliefs on this situation, because I believe the details of how powerful my model is or not are not actually very important to the overall situation.

As described in my technical post, my model is not identical to OpenAI’s, because I simply didn’t have all the details of what they did. The truth is also that the samples and metrics I have shown aren’t 100% accurate. For one, my metric code is flawed, I made several rookie mistakes in setting up accurate evaluation (let train and eval data mix, used metrics whose math I didn’t understand etc), and the model I used to generate the samples is in fact not the final trained model, but one about halfway through the training. I didn’t take my time to evaluate the strength of my model, I simply saw I had the same amount of hardware as OpenAI and code as close to the paper as possible and went with it. The reason for this is a simple human flaw: I got cold feet once I realized what I was sitting on and acted rashly. I made a mistake, I did something stupid, that’s all there is to it.

Thanks to help from OpenAI it is now safe to say that my model is not as powerful as OpenAI’s. The metric results for WikiText2, LAMBADA and PTB are (lower is better):

GPT2: 18.67 / 8.63 / 36.51
Mine: 43.79 / 109.47 / 202.29

Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a significant difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix. I am very tempted to continue tinkering with the model and seeing if I can improve it…but I will be holding back for now.

I think that even if the model isn’t perfect, it might be a useful “shortcut” to a more powerful model. In other words I think someone could save a good chunk of compute while creating a more powerful 1.5B by starting with my model. Even if it isn’t very useful for generating text, it may still be useful for creating a model that is.

So while its release definitely wouldn’t have the same impact as the original 1.5B, I have reasonable reasons to suspect it would still be a non-zero effect, and releasing it would undermine my overall message either way.

I think the power of my model doesn’t actually affect my main (updated) arguments, though. Please read the main update post.