Convert word to html using python with python-mammoth

Today I had a request which client want to convert a word to an html format content which can easier import to any web application. The big picture behind that they had a lot of documents in vary of format: word, excel, pdf and planed to migrate them to web based application. They need to be copy and paste to a rich text editor (WYSIWYG).


I first come with google document which have a good composer tool at there drive tools. I tried with this how to:

How to using google drive to import document (Word / Excel) and export to html using for web publishing?
1. Go to
https://drive.google.com
2. Import your document: can be word or excel
3. Open this document on google drive, you will go to a new tab which show the editor for this document (we still on google drive)
4. Click on File > Download as > Web page (.html, .zipped) and you can download current document under zip format
5. Extract this zip file somewhere and open the html file inside and you can see the page on browser
6. Press combo Ctrl + U to view it’s HTML source code and copy all the content of html file and paste to Policy Rich Text Editor (tab Text)
=> These step will try to transform document from word / excel to html document which keep as much as possible format/styles
=> This will support grid in excel file and word as well

These above steps are so complex so I come up with another solution. I tried with python-mammoth (https://github.com/mwilliamson/python-mammoth)

With only 2 steps:

pip install mammoth

mammoth document.docx output.html

I then can easier transform them to html in seconds. It have a lot of custom which could make your life easier.

I still continue with excel and pdf … but it will be in another story.