You need to be careful on what you print on Jupyter Notebooks
--
Today, I was working on a dataset, using Jupyter Notebook application.
I was trying to parse the dataset from one format to another. And the dataset size was relatively large, about 2 Gigabytes of text.
While parsing, I fell into the carelessness of printing the parsed data values. Which resulted in printing a significant part of the 2 Gigabyte text.
It resulted in having a notebook with 200 Megabytes of size. I guess since the information in it corresponded to millions of text lines, browser was not able to open the Jupyter Notebook anymore. That means I could not even copy paste my code to recover my work.
I didn’t want that. I wanted my work back, so I tried a few things.
Things I have Tried
1- Cell -> All Output -> Clear
I tried to clear all cell outputs, and then save the file.
It did not work, even though I managed to press “Clear” and then hit Ctrl+S before the webpage died. I tried this multiple times, but there was no hope.
2- Deleting Lines Manually
I knew that Atom handled large text fıles better than the default text editor, so I tried editing the .ipynb just like we would edit a .txt file, to delete the unnecessary parts.
Atom was crashing after I scrolled more than a few thousand lines in the .ipynb file. I did not know how many lines it contained, but I changed the difference in file size.
0,5 Megabytes was reduced from the 200 Megabyte file, with a deletion of few thousand lines. So Atom was not a good solution, since it crashed when it loaded more than 0,5 Megabytes of lines.
3- Deleting Lines with Python
I think this might have been done much better, however I was confident that this approach would solve my case, so I tried the particular solution:
- First, I created a backup of the corrupted file, just in case I do anything wrong.
- I have read the .ipynb file from python, then wrote back only a few lines back.
- To be more clear, I did not open the .ipynb file in Jupyter. I opened it as a readable and writable file in Python, to modify it. Just like you would open a .txt file in Python.
- First and last lines, I wrote back to the file. The millions of other lines that were in between, I did not write back.
This resulted in me being able to open the .ipynb in Atom again, since file size was reduced to 2 Megabytes.
There, I found the relevant cells contatining the code, copied the code, and my work was recovered.
This is a basic issue, however it seemed interesting to me. Turns out you need to be careful on what you print on Jupyter Notebooks.
Note: Modifying an .ipynb like you would modify a .txt file corrupts the .ipynb if you don’t do it carefully. So take a backup of your notebook before you modify it like .txt, if you will.