Using Machine Learning for Code Completion in Pharo. Research Questions
I am in the final year of my Bachelor’s in Computer Science. I am starting to work on my thesis, in which I want to explore how we can use machine learning to improve code completion in Pharo.
Last year I did an internship at INRIA and this summer I participated in Google Summer of Code, where under the supervision of Marcus Denker I have implemented a better completion engine in Pharo based on analysing the AST of source code.
And now in my thesis I want to see how the sorting of the completion results can be improved with ML.
These are the research questions that I want to answer in my work:
- Can we improve code completion in Pharo by sorting candidate completions with an n-gram language model?
- Can we build a tool based on a trained n-gram model that would propose completion fast enough to be used in an IDE? (the user can not wait 30 seconds for a completion to appear)
- How can we numerically evaluate the results of code completion produced by different completion strategies?
- How is source code different from natural human languages (English, French) in the context of building statistical language models? Can we effectively model source code with n-gram language models that were designed for natural language? How different is the process of training those models on source code (preprocessing steps, vocabulary size, repetitiveness, predictability, etc)?
Any questions or feedback would be most welcome!