Writing code for Natural language processing Research #EMNLP2018 #NLProc
“Code is available upon request”, “Authors promise ..”, a repo that only contains model.py .. we all know... no need to explain further.
In First day of #EMNLP2018 Joel Grus, Matt and Mark from Allen AI institute presented the tutorial that was arguably the one that attracted the most interest amongst all.
I like this type of tutorials, which discusses new issues in the research community, so much more than having a catwalk of SOTA models across a specific task. A similar tutorial was given in IJCNLP2017 about how to make a good poster and how to give an understandable presentation.
In this tutorial, they summarize their best practices & lesson learned while writing code for NLP research and from the development of the Allen NLP toolkit.
Disclaimer: I am trying here to quote Joel, Matt, and Mark as much as possible. There are few points where maybe I am not so accurate (sorry for that). Most importantly, there are other few points that I don’t entirely agree with — yet I see their point, especially in the first part of the tutorial considering code redundancy when prototyping. But now I leave my opinions aside.
from a tutorial starts by describing two main purposes for writing code in NLP research in general.
Tutorial Part 1: Prototyping
Primary goals of prototyping are :
1) writing code quickly.
2) run some experiments with minimal errors.
3) debugging models’ behavior.
1) Writing code quickly
- In this phase of the project your ideas change fast, so the idea here is to get something running fast, to get a better understanding of the problem.
- Get a baseline running, start from a reusable modular library (Allen NLP, Fairseq, Sockeye, Tensor2Tensor … ) or an implemented paper that is easy to read its code and run it (good luck finding that..).
- Sometimes however, you need to start from scratch if nothing around fits your needs.
- In this phase don’t over-engineer your code and DON’T try to reduce code duplication, copy first your code and refactor later, this guarantees that you will have something running quickly.
This still obliges you to have a readable code, so you can understand and maintain. here are some useful tips considering that:
- <<<SHAPE COMMENTS>>> on tensors for easy debugging later
- Write code for people not machines: write long comments describing the non-obvious logic
- Do minimal testing (but not no testing): “If you write tests that check experimental behavior this is a waste of time because this is subjected to be later changed”
- However, Write testing for data prep. code. This will be used to preprocess and generate batches and will be independent on the model adjustments so it is better to check that it is working correctly. (e.g. write tests to make sure that the code for reading and doing evaluation on the SQUAD dataset is working correctly).
Hard-code, only parts you are not focusing on.
- This makes controlled experiments quite easier (for you and people reusing your code later (if any …) ).
2) Running experiments:
Keep track of what you ran while doing experiments
- The easiest way to do that is putting in experiments’ results in a spread sheet.
Note which version of the code was used to run which experiments using Version control.
Controlled experiments: test only one thing at a time (do ablation tests)
- Don’t run experiments with many moving parts.
- Change one thing each time while keeping everything constant. This is Important to control experiments and show what made the performance improvements.
How to write controlled experiments:
- Make everything controllable as a param to the model.
- Load those models from a configuration code or a running script
3) Analyzing Model Performance
Tensorboard is very useful for that. It is also compatible with almost all DL libraries not only tensorflow.
Here is a list of some useful metrics to Visualize:
1) Loss, Accuracy
2) Gradients: mean, std, actual update values
3) Parameters: Mean, std
4) Activations: Log problematic activations
Look at your data:
- don’t do print statements at the end of the training
- Instead: Save model checkpoints then write a script that given the model checkpoint it runs some queries against your model.
- It is always better to put that in a web demo this makes it a lot easier to debug the model and interact with it visually. Moreover you can show some of the model inner insights (e.g. attention matrix) with each of the given examples in your web demo.
Build your data processing: so that you read from a file but also your Models are able to run without given labels i.e. the model doesn’t crash if it cannot compute a loss, same code can run for Train and for demo.
Tutorial Part 2: Developing Good Processes:
2nd part of the tutorial was about how to develop a good development process in your experiment to be re-runnable everywhere.
- Use source control: “Hope you do that already”
- Code reviews :
* Find bugs for you
* Force you to make your code readable
* Writing clear code allows the code review to be discussion to model itself not the code
* Prevents publishing a code and later-on finding bugs in it that can make your results incorrect — This happens btw and can lead you to retract one of your accepted papers check this out
- Continuous integration (build automation)
- Continuous integration always be merging into a branch
- Build automation always be running tests with each merge
4. Testing your code (revisiting testing again)
- Unit testing: is an automated check that part of your code works correctly.
- If you are prototyping what should you write tests for?
* Test the basics: test forward path direction things that doesn’t rely directly on the results
* e.g. Assert that batch size the same size you expect
* e.g. All the words in the batch are in vocabulary
- If you are making a reusable library what should you write tests for?
* models can train, save and load
* that it is computing backprop gradients
- Test Fixture: Running tests on large datasets everytime you merge is kind of big and slow. instead keep a tiny amount of data that is in repo and run those tests on them.
- Use your knowledge to write clever tests: Attention is hard to test because it relies on parameters: e.g. testing attention models are hard but you can hack that test to be if you put all attention weights equal would that be equal to averaging all the input vectors
Tutorial Part 3: Writing code for Reusable Components
The third part of the talk is about writing code for reusable components, e.g. the code that you and your colleagues will be reusing a lot (e.g. your lab’s MT library). Joel starts to list his best practices while developing the AllenNLP library.
Some abstraction (abstraction is simply making a generic class for this type of classes) has proven useful but some haven’t.
So, you don’t have to abstract everything but rather find the right amount of abstraction to compromise between re-usability and time spent in coding abstractions. This is basically for components that:
* You are going to be reused a lot: e.g. training a model, mapping words to ids, summarizing sequence to a single tensor.
* That have many variations: changing a character or a word to a tensor, changing a tensor into a sequence of tensors, summarizing a sequence of tensors to a tensor (attention model, avg, concat, sum ..etc)
* Things that reflect our higher level thinking (text,tags, labels, ..) this allows you to abstract your model as much as possible.
Best practices from Allen NLP
this part are just listing some examples of abstractions from Allen NLP library.
I’ll skip through them fast but check the slides for more details:
- Models: are extension of torch.nn.module uses same model abstraction from pytorch.
* Vocabulary: all_tokens, tokens2id, id2token ..etc
* Instances: the instances in the dataset, used to create vocabulary
* Instances contain Fields: (source text, tgt text ..etc)
and some more abstractions …
* DataIterator: Basic iterator: shuffles batches
* BucketIterator: groups instances with similar lengths
* Two different abstractions for RNNs (Seq2Seq, Seq2Vec)|
Most of Allen NLP objects can be instantiated from jsonnet blobs
* This allow to specify experiments using json
* Allows also to change architecture without changing any code.
Each training loop creates a model archive which is:
config.json + vocab + trained model weights
This can be used to eval on test or to make a demo
To create a model you need your model to accept json and output json
`Predictors` are a simple json wrapper of your model designed for that.
Still lots of stuff haven’t been figured out yet in AllenNLP:
*regularization / initialization
*model with pretrained components
* model complex training loops
* caching preprocessed data
* expanding vocab
Joel ends his part of the tutorial by demonstrating a use case showing differences between building models from scratch using numpy (don’t do that), pytorch & AllenNLP.
Tutorial Part 4: How to share your code
Use Docker containers:
Don’t be intimidated by docker, its fairly simple:
1) Write a dockerfile
2) Docker file is used to build docker images
3) Run docker images on containers
there’s a fast docker tutorial in the slides that has all the important commands that you will probably need.
Releasing your code:
When releasing your code make sure that people have the right data to rerun your code, be specific “there are 27 corenlp jar files on the website”. People can easily download the wrong thing.
Use file cache:
Use python environment: for this you can rely on virtualenvs or anaconda. create a new virtual env for each project. Export it to a requirement.txt file which can be installed by anybody who reuses your code
The tutorial ends here..
Hopefully this tutorial will have an impact on the amount of reusable code we will see in the next NLP conferences.
Thanks Joel, Mark and Matt for the cool tutorial and the large amounts of GIFs in the slides.
Oh speaking of slides, Here you are..