Python2Vec: Word Embeddings for Source Code
Alex Gude
211

I wonder what would happen if we tried to generate n-grams from a walk over either the AST or the control-flow graph or the data-dependency graph … Maybe something as simple as a random walk would be sufficient?

Another idea: A sequence of symbols could be generated by doing a (depth-first or breadth-first) walk over the AST and emitting the type of each AST node encountered —we could then treat this sequence of node-types as the underlying data for our n-gram analysis.

This would give us a appreciation of which functions are structurally similar — rather than similar in terms of variable name / function name etc…

Like what you read? Give William Payne a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.