“Naturalness of Software Languages”

I recently read a paper called “On the naturalness of software” by Prem Devanbu, a CS professor @ UC Davis, on the idea of transplanting the ideas of Natural Language Processing from languages like English and products like Siri to software developers and writing code.

Devanbu applies the n-gram model of natural language processing to software languages and development. This is a neat idea, and his paper shows the effectiveness of applying statistical models to software languages which have precise syntax and structure, even more strict than the English language for example. Given that natural language can be applied to English and other common spoken languages, it is potentially easier for software. The best example that Devanbu uses is the for(int i=0, i<10 and how the statistical model should readily predict that ; i++) is the next piece of software. One severe limitation that Devanbu’s statistical model falls short is predicting logic. The logic of a software developer is crucial, since code can be written in different ways using different logic. I don’t think statistical models can be used to predict the logic of a software developer anytime soon. If that is the case, then we might be at the point where software can write itself.

One idea in the “Future Directions” section that stood out to me was applying the statistical models to help developers who are disabled or have RSI. I am someone who deals with RSI and agree that code is repetitive and predictable; I would find great value in using his plug-in if only it was for Xcode for the Mac. I posit that an improved auto-completion will improve productivity of individuals significantly. Who wouldn’t want that?

However, I wonder when we should not use this. Even if we can do it, should we? Even if the application of statistical models for software languages is perfect, is there a time when it isn’t needed or even avoided entirely? What are the security implications if malicious hackers used this as a tool? I am hedging my bets that this will affect school curriculums for Computer Science around the nation, for better or worse.

If statistical models can be applied to spoken language and now software languages, what about the use of computer vision combined with statistical models for body language, which is estimated to be 90% of communication. Law enforcement would love to use this idea to identify nervous behaviors among travelers at airports and other transportation hubs around the world. Body language is indeed structured like spoken languages–humans make the same gestures when nervous and uncomfortable, and the same for happiness and sadness. I see body language as the next frontier, but this could bring us one step close to the 1984 world of surveillance and limited behavior.