pycaption: PBS’s open-source caption converter
Originally Posted by Joe Norton on Aug 15, 2012 at 1:49 pm
Closed captioning is a hot topic for video producers today. PBS itself has found a new need for converting closed captions from older legacy formats, particularly Line 21 broadcast captions, to newer timed-text formats such as DFXP. Therefore, in an effort to better align with its mission statement and comply with new legal requirements, PBS is proud to present its newest offering for the open-source community: pycaption.
https://github.com/pbs/pycaption
Suggested download from command line:
pip install pycaption
Basic Usage
scc_caps = open("FRON3008.scc", "r").read() caps = SCCReader().read(scc_caps) print DFXPWriter().write(caps) print TranscriptWriter().write(caps)
Supported Formats
Pycaption was developed with the idea that captions can come in many formats, but these formats often need to all be streamlined to a newer format that a specific video player can support. In PBS’s case, we have in our database captions in the SCC, SAMI, and DFXP formats, and we would like all our captions to be converted to DFXP so that PBS’s video portal can load the captions correctly. However, every user will have a slightly different need for closed caption conversion, and so these are pycaption’s currently supported formats:
- SCC(read support), Scenarist Closed Caption format: the imported version of Line 21 broadcast captions. Mainly used by North American television stations.
- SAMI(read/write support), Synchronized Accessible Media Interchange: Microsoft’s format, used by Windows Media Player and other media players.
- SRT(read/write support), SubRip Text format: a simple text-based caption file made popular by the SubRip software. Officially supported by YouTube.
- DFXP(read/write support), Distribution Format Exchange Profile: W3C’s suggested caption standard, a profile of the XML-based TTML format. Commonly used by Flash players on the web.
- Transcript(write support): a text version of the captions stripped of styling and timing.
Styling support is still rudimentary for most of these formats. Due to the vastness of the specs (particularly DFXP) and the tendency to do one thing in 10 different ways (SAMI), only certain styling practices are supported. This can be easily expanded as the need grows, but it currently covers PBS’s styling needs for existing caption files in our database.
Workflow
The basic workflow for using pycaption is as follows:
- read your captions in through a provided reader, which translates them into an intermediary format
- write those captions into another format with any of the supported writers
- rinse and repeat
Command Line Testing
As a way to easily test out the functionality of this module, I created a command line interface for pycaption, which can be downloaded here:
https://github.com/jnorton001/pycaption-cli
Once setup (see the readme for details), use as follows from your shell:
pycaption filename [--sami --dfxp --srt --transcript]
E.g.
pycaption ../jnorton-caption.scc --dfxp --transcript pycaption /Users/Joe/Desktop/jnorton-caption.sami --dfxp > jnorton-caption.xml
How It Happened
This summer as a Product Development Intern for the Interactive department, I had the privilege of working on pycaption to handle PBS’s closed caption conversion needs.Since no Python-based caption conversion library existed in the open-source world, I was tasked with fixing this problem. Once my initial script was written to handle the current business requirements, we decided to expand this project to a bigger and broader idea: an easily extendable Python module that can take as input a caption file in one of several formats, and then output those captions in various formats as needed.
Pycaption has been a very enjoyable project to work on, and I look forward to seeing the fruits of it as PBS’s mobile video streams start to incorporate closed captioning! Many thanks to my supervisors and peers who gave constant advice and help.
Some fun things I learned while working on pycaption:
- XML parsing and HTML parsing
- SCC character code conversion
- PyPI Distribution
- NLTK Sentence Boundary Detection
- Git branching and committing through GitHub
- Software licensing
- Custom shell commands
- lots more!
This is a lot more real-world knowledge than I expected to learn at PBS when I applied. I firmly believe that the reason I learned so much is due to the fact that PBS’s internship program does a fantastic job of expecting that the interns will observe but then also contribute daily. This method of learning-by-performing has been more effective than any of my college classes thus far! This knowledge is also only a slice of the whole pie, considering that I was also able to work on several other big projects and many small tasks and assignments.
In conclusion: please download the module, follow us on GitHub, submit patches, report issues, and help this project become a premier offering in the closed caption field!
— -
Also: see the attached SAMI file for an example caption file to be used for testing purposes.
Originally published at open.pbs.org.