Productivity in Software Development

Productivity in Software Development

August 21, 2019 1 By Bernardo Ryan


>>Thanks everyone for coming. So this session is on
software productivity and AI. So I have to apologize
for the confusion. There was a session before this that was Software for AI Developers, and this is AI for
Software Developers. I think that was where
I think they got the wrong title and we
got a different title. So hopefully, I’m in the right class. So I also want to introduce the three panelists that are
here: Margaret Ann-Storey, from University of Victoria, and Prem Kumar Devanbu, from UC Davis, and also Ahmed Hassan, from Queens University in Canada. So welcome everyone. So I’m going to give
a brief introduction to the area and also some of the stuff that we’re doing in
Microsoft that’s related to this, and then open it up to
the panel to discuss further and take it from there. Given it’s a small crowd, we’ll kind of improvise us we go in terms of how we
allot the use of the time. So first thing we want
to talk about: What does software developers look like today? So you want to predict
the future of the AI workforce. So there’s been a ton of study on data scientists, and data science, and what data scientists should do and how do engineers
become data scientists, how do data scientists
become engineers, etc. Lexis-Nexis did a study
saying if there’s going to be 20 exabytes of data by 2020, which means we need a lot
of engineers to process that data and how are we going
to hire those engineers? Which then IBM created
some numbers saying that we’re going to require
2.72 million people, so data scientists by the year 2020. Then Mckinsey, released some data saying that the demand for data scientists is so much that we are going to be 50
percent short and we’re going to be really short of data
scientists by the year 2020. Then they release a later number
which basically said, AI technology and tools and skill sets of data
scientists will be rendered useless in 12-18 months,
and we see that. People come knowing R and
some standard programming models. I have turned over probably
three rounds of data scientists in the last four years because either they pick up
new skills or they don’t. So it is a challenge. Then finally, KDNuggets did
a survey of ML engineers, and that 51 percent of them feel that they will be out of
a job by the year 2025 because they’ll be replaced
by AI software that can write code itself and they’re
not going to have jobs anymore. So the whole data science
engineering AI space is in a flux. The definition is not very clear what kind of
people we should hire, nor what the future of
that looks like, etc. It’s a challenge for
people like us who hire engineers or scientists and the kind
of problems we throw at them. So hopefully through
this discussion then, we can get through to some of the interesting topics
related to this area. So I want to start with a story of or actually I wanted to say
something about software 2.0. Actually the fourth panelist was
supposed to be Andrej Karpathy, who came up with this
notion of software 2.0, and his premise is
that large portion of the code, no real-world problems
have the property that you cannot write
algorithms for it, you cannot write code for. They will have input
and they will have output and they will have patterns. So you actually can build a model
and replace code with models. So his idea, at least from the
space that he’s working in, he works for Tesla, and
lot of this is, hey, how do I make the car
automatically park, or when do the windshield wipers
go off automatically? There’s no logic for it. If you try to write logic for
it, it’s very complicated. So instead of that, if
I gather enough data on the input side and then I
build a model and then produce output that actually works and
learns from the output and goes back to the input and do
a reinforcement learning stuff, then actually I don’t
have to write code. All I have to do is replace what I would have done here
with a giant neural network, which I need to train and deploy, and all it is doing is finding the right kind of code
in the program space. This is directly
borrowed from his blog, which will basically find
the right kind of algorithm, but it is somewhere in
this space rather than writing that piece of code
that I have to write, which has very complex logic. The nice thing about this idea is that it is self-healing
and it’s self-correcting, but it solves certain kinds
of problems very, very well. So if you talk to Johannes Gehrke, who’s a fellow at Microsoft, I think he is a keynote later
today, he might talk about it. They have replaced big chunks
of code in Skype and Teams with this kind of a
model that will replace thousands of lines code with
just a few lines of model, where the system is much more compact and
performs more efficiently. My own journey as a
software engineers started when I graduated from
college, I was trying to figure out, I did a thesis on compiler compilers
and this is back in India, and we have to type our dissertation and then actually had to do it
with a real typist. That was the requirement
of the university, and the typist came
back to me and said, “You have one-too-many compilers in here,” and they removed one compiler. So wherever it said
“compiler compiler”, they just overrode
the word compiler and completely destroyed my dissertation. It was based upon this paper, literally written 40
years ago by Cattell, basically on code
generator generators. So CMU was doing a ton of work on instead of you got m languages and n machines and you’d write a code generator for a particular language to
a particular machine, you cannot to be writing m times n compilers instead of
writing m times n compilers. How do you reduce that
to m plus n compilers? So they have this idea of
code generator generators. This was the paper I read much
later than it was published, but that was before the Internet
and I was in India. So you know how long it takes
to get that paper out there. But if this paper was published
today, what would it say? AI automatically generates
code that run on any machine because that’s
what you would say today. But I mean, there is
some truth to it as well. So when I started as an intern, it was a research
center back in India, and they hire the top graduates from the top universities in
India to do program translations. So a lot of the code that you
write in computer science at that time was translating code in one language to another language. So writing program translators, and you do it in a human way. So companies like Hewlett-Packard or IBM would have software
written in one language. Like COBOL, you need to
translate it to PL/I, or you have something that
had written in Modular, I need to translate that to C
or C++. What would they do? They give this transformation rules to these engineers who would have this card on their side and they will actually go edit the code and say, “Oh, I see an If, left brace, right brace, I’ll change it
to If condition Begin End.” I mean, they physically did that. So a lot of the problems
we did was of this nature. You’ll get a program in one language and you need to translate
your program in another language, and you need to write
a code generator. You wrote this code
generator by hand. But you don’t really need to
write this code generator. My mentor at that time told me that programmers
should get paid well, but people who generate programs
should get paid 10 times as much. So you don’t write programs,
you generate programs. So that was the idea. So what you did was, you got a grammar in one language, a source language, and
a grammar in a target language, and then you write a code
generator generator that translates the code from that pertains one grammar and that satisfies the other grammar. So you wrote a lot of rules. If you didn’t have the grammar, then you had examples. So you take examples of code
in one grammar and then you wrote rules to map them to examples of code in another grammar. Then you’ve got your code
generator generator, which will generate a code generator, which will translate programs
from one language to another. You could further improve
that by kind of saying, “Hey, I don’t want to do m by n
code generator generators.” So I have an intermediate
representation of standard kernel that goes
between different grammars, and so that way I have a re-targetable
code generator generator. So I can reduce everything to an intermediate form
and I can do it. So all of my early programming
went in writing tree transformers. So we would be writing
tree transformation rules. We would be writing tree algorithms. Efficient tree algorithms
that programs the tree, you navigate the tree efficiently, and you make this transformers
run really, really fast. So I spent a lot of time doing that, and the nice thing about it was
so I can quote one project where the company I was interning
at where I got paid $200 for six months. Of course translated
to Indian rupees. They had quoted a 100 man-year, to tell you it to quote projects, 100 man-years to translate everything written in Modular to Unix and C because they’re going to hire 100 people to manually
change the code, millions of lines of code. What I did was I actually wrote a code generator
generator system which will generate trees
in one language and translate it to trees in
another language that was done in about two months time, and the way they priced it was
they charge $80,000 for writing the translator generator and charged $0.25 for each line
that was translated, and about three million lines of code was translated automatically. So this was way before AI, but at the same time I
had the satisfaction of being a part of that project. Of course I got paid, as I said, $500 for doing that. But those are efficient systems. So if I were to do that
today, what would I do? I covered all of this part. I have a lot more data than before. So all of this stuff have examples
of source code in one language, examples of code in another language. I built this giant AI model
that will learn from these examples and build
a transformation model and automatically give
my program translator. So that’s the stuff that we do today. So in fact, if you look at
natural language processing, there’s a ton of progress that has been made in the program
language community. We are trying to learn from that
and we’re trying to apply that. So the recent MIT Technology Review
has a great article on how researchers have
been able to translate between very rare languages, languages for which there
is very little data, and how they have mapped into other languages because maybe
the past tense in the language, or the verb in the language, or the combination of verb and
noun in the language are mapped a certain way across the languages and you can get
an automatic translator. This is really exciting for us because we can
learn from these things, which are done from
natural language and can apply to programming languages. So it’s really a good time to be an engineer writing
compiler systems. We don’t go find people
writing compilers anymore. But at the same time, you
can do much more than just writing standard compilers
doing this kind of stuff. With that, I want to
switch gears a little bit and say what do
we do at Microsoft. So I see some of my colleagues here, some PMs, and they’re very
familiar with these pictures. So when we think
about our developers, we think about what we call
inner loop and outer loop. So a typical developer
has an inner loop, edits the code, builds it, debugs it. That’s the inner loop
experience and I’m not sure if that terminology
goes. Yes, go ahead.>>Yes. Can I just
add something though? It seems like this is a tumble. I mean, compilers, yeah, because that’s how
I cut my teeth too, but it seems like that’s
the one area that we don’t have a replacing
compiler writers yet, but you’re worrying me.>>Now let’s talk about it.>>So it’s not actually happened. I mean, I haven’t seen
it or maybe I’m unaware. But I mean, I sort of agree that it’s within the scope.
It’s a language. It’s got a grammar. There are semantic rules. But somehow the correctness criteria, it’s much more precise and unambiguous
than say natural language. So I wonder, I’m a little less
scared by the talks so far, but also like we don’t
have evidence yet, do we?>>We don’t have evidence yet, but at the same time you’re
up for the challenge.>>No. I’m with you, but I’m just saying
that still remains, like we have all these
amazing examples you said for natural language, but it’s been actually
pretty tough going. We haven’t seen huge progress
in synthesis really. I mean, the synthesis has been
mainly for small programs, and in translation between
languages is still because of this semantic precision and
this small differences. There’s lots of
corner cases and stuff. So I’m not saying it won’t happen, but I don’t think it’s happened yet.>>So we’re taking small steps. So today we’re saying that’s
why I put up this picture. We’re taking very,
very small steps here. So the smallest step is to say, “Hey, you’re writing code. Can I help you write code?”
That’s where we start. Now the big picture, I will talk about it as we go. We’re not there yet,
and you’re right. So I have friends who
are Chinese on Facebook, one of my friend says all his
friends are Chinese except for me, and they talk in Chinese,
I take their texts, post it to Google Translate, and then what I want
to write in English, I turn that to Chinese
and translated on Google, and put it back and
send back on Facebook, and they understand what I say. Nine out of 10 words
might be corrected. They laugh at me once in a while, but at the same time
they get the answer. Now if I feed that to a computer, it’s going to just barf. It’s not going to work because there’s no program
that’s almost correct. It has to be correct or it’s got to be wrong. So you’re right there. But at the same time, we can do stuff that actually
improves the developer productivity, helps a developer to write 80 percent of
the code in efficient ways, and then put the human do the
remaining hard 20 percent. So a lot of this
automation is about doing the easy ones really well through the machine and give the hard ones
to that human beings so that they actually do
it in the human way. So that’s the idea. But we have to push the limits
of it to see how far we can go. Coming back to this point, so that’s the inner loop of development, and I’ll give you a
couple of examples of Mark and Shang Yu are here, you must have seen their demo
on IntelliSense on how we help developers write
and complete their code. Then outer loop you have
code reviews, and testing, and continuous
integration., CICD stuff, release management, issue reporting, documentation, production,
and analytics. All of this come with a ton of data, and we can use AI in
every step of the way here to improve the productivity
of the developer or the teams to do
things automatically. When I say automatically, I
say do the mundane things automatically and have the human
beings do the intelligent things. So there’s a ton of stuff that
can be done automatically. Today for example, one of my team members had done an
analysis of code reviews, PR reviews, and he found that 60 percent of PR
reviews are stylistic. Says hey, you need a tab here, you need a space here, you need to capitalize
your variable name, you should have more
meaningful variable names. It’s almost like somebody doing a paper review the night
before the reviews are due. So the conference chairs expect you to write
a page full of review, and if you see that 80 percent
of that is all typos and spelling corrections and stylistics that really doesn’t
talk about the details. Engineers are forced to do PR
reviews and they’re saying, “Oh, these are the things I can say
about the code and I will say it, and that way I’m done with the job.” So that can be easily automated, and we are trying to
automate that stuff. So then there has been that it’s hard code reviews and maybe there’s
a human expert that’s required. So for coming back to the point
of helping the user develop code. So we have taken a
very, very systematic approach to helping
developers write code. So today, when we started
doing this, we said, well, about 60 percent of
what you write are API codes. Today especially people doing a lot of AI code, a lot of systems code. You have a library you import, and then you make an API code. So when you call a class, an object of a class or make
a class and a method call, can I predict the method
that you’re going to call? That’s the first step. If I can help you with that, I significantly improve
your productivity. To do that, now we have to
make that a part of an editor. So that’s Visual Studio, VSCode, whatever that is, and we’re going to make it efficient
in the sense that we’re not interfering with
the developers experience. So we have to make sure
that it delivers a model in a compact way so that it can sit in or not take up enough
memory and decline, and also the model comes
back the inferences so fast that it’s not interfering
with the user experience. So today we have a very,
very simple model. You must have seen
the demo in the mornings. I’m not going to show the demo. This basically understands
all of the program flow. We have the control flow, things inside if conditions, things inside while loops, it defined before use, etc. It takes advantage of that and builds a very nicely engineered
Machine learning model and works very well. It uses less than 30 megabytes
of space and comes back with a recommendation within
15 milliseconds. That’s perfect. But if you want to make
it do more things, so for example if you
wanted to predict what argument you are going to
use when you make a method call, especially when you’re
overloaded methods, now you start to push
the limits of what this can do. So you need something more than that. If you look at code, code is like texts, so which means it’s a sequence. So you go left to right. The things that you have in the right are dependent on
things that are in the left. So it’s just a sequence model
very much like human texts. So we have a model that is basically an LSTM based model
with attention. Now this performed
significantly better than the regular classification model that we have but at the same time, it has other challenges. Number 1, it is not transparent. I’m not able to explain when it makes a recommendation or
what recommendation it is making. It is an abstruse model because it’s building stuff inside the thing
that is not able to explain. So explainability is not there, but at the same time it gives
me a lot more accuracy. Now if I throw a lot more data
at it and lot more code at it, then I can build a very,
very complex model. I can use something like a GPT-2, or a BERT model, or XLNet and I can understand a whole lot
of things about your code. Not only I understand about code in one language like Python or Java, I can mix them up because
almost all of the code, all of the languages have
a notion of a variable, notion of assignment, notion
of definition, use, etc. I can just mix them all up and
learn the structure of your code, learn the syntax of your code, learn the semantics of your code. So one of the things
that happened when we build this model
on the left was that, when you build a model
for C-sharp and Java, we have a type system. So we could hang up
of the type system when you’re making a method call. We could say, “Oh,
this is of this type. So only these method
calls will apply.” But when you are working
with languages like Python, which are loosely typed
out or no type at all, we have to figure out
what the type might be. We have to create
approximations for the type. So all of it had to be engineered. But when you have a model that’s as complex as this which
uses a lot of data, we don’t have to worry about
the texts because we’re learning from patterns of the code. Today we have a model that
works reasonably well, and it doesn’t care whether it’s a function call
or whether it’s an argument, it can finish lines of code
for you. It can do a lot more. The nice thing about
throwing everything into it is that it knows that
the left parenthesis has to be matched by right parenthesis
after four arguments just because it has
so many examples of use of that. So that’s the beauty of it. So going back to Tom’s point, can we do that? We’re not there yet, but I think we are getting there and there’s an
opportunity to get there. So we can be. These come back with
about 90-92 percent accuracy. So given enough resources,
given enough code, given enough complexity
of your algorithms, I think you can do that.>That’s what I think we can do. So what do we need to
achieve all of this thing? So first thing is data. For us in terms of code,
we have tons of data. If you look at GitHub, you’ve got more than
100 million reports. We’ve got 30 plus
million contributors. So you’ve got
200 million pull requests. So there’s tons of data or there
are 50 plus languages of code. You can have access to all of
this stuff and you can do. You’ve got English texts,
which French texts, German texts you have got a code
in 50 different languages, and you can do it a lot
of stuff with that. Similarly, if you’re Stack Overflow, you got six plus million users, 12 plus million
questions and answers. So when somebody asks a question, the answer comes with code related to it and we can combine human text, English texts along with code.
You’ve got corporate data. So if I’m in Microsoft, we got millions of
lines of code only in one piece of software like Windows. So we can do something
that’s very specific to Microsoft inside Microsoft
and to build us intelligence. So basically, we have tons of data and we can
take advantage of it. Now to go with this data, we have to have analytics. So one thing we were
very careful about was we didn’t throw
a model I did first, we see is analyze the code deeply
before we throw a model at it. So for example, there are some
classes that are used all the time. There are some classes
that are very rarely used. So if you want high accuracy model, you focus on things that
are used more commonly, patterns that occur more commonly and you can
have a simpler model. So learning code for example, here back in ’71, DE Knuth looked at
800 fortran programs and said 95 percent of the loops are actually you have
only one line inside of them. If you have this kind of information, you can feed a machine
this information and build super-efficient model and actually more recently some of
my colleagues in MSR, they did analysis of 25 million lines of code and came
up with almost the same answer, which is amazing I think. So some things have not
changed over the years. We can use that to do
EDM mining and stuff like that. So anyway, then we have AI has
made significant progress. We started with some very simple
models but today we have RNNs and LSTMs we can use for code to
do the stuff that we’re doing, and we are barring a ton of stuff
if NLP and computer vision. So and we have also tons
of tools like AutoML, and PyTorch, and TensorFlow. Then we also as we mine code, we’re being careful that we’re not violating people’s privacy, etc. So we wrote a ton of tools for us
to build good algorithms around. In terms of system, we have a lot of GPUs available, energy piece or by default in
most of the computers today and we have much faster
interconnects than we had before, and we can actually take
borrow ideas from HPC. How do we overlap computation
with communication? How do we do reduction
in an efficient way? How do we combine data parallelism that’s
model parallelism stuff like that. So we borrow techniques from
there and we can combine all of these things and put them together and actually build systems that work. So and finally, there
are humans in the loop, which you should not forget because one of the things
we found when we deployed our models was our offline, the first model are offline accuracy, this in the 70 percent, and we deployed it out
[inaudible] and did a survey of our customers and
the customers are quite happy. Usually, I mean anybody who
has done machine learning who does an offline accuracy of 70 percent when
they’re deployed online, it just doesn’t work. So we did something right. One of the things we
learned for us because we did the user
experience was perfect. We iterated a lot on the
user experience to make sure that the models don’t interfere
with the user’s flow, users thinking flow or the code writing flow and we
made sure that happen, because users don’t care that
the [inaudible] underneath. All the users care is
that you are helping them be more efficient
and more productive. So basically, my summary is there are four things that you require if you want
to build AI-Infused Software. So one thing is programming
languages and compilers systems. The second is algorithms
I talked about. So good use of data in
AI and methods from high-performance computing that we have to use to build
large-scale models and finally, user experience which
is really important. A model that’s good enough, provides greater experiences
much better than a model that’s perfect interferes with the user. So this is our learning from some of the automation that we have done
to improve user productivity. I call upon the panelists
to touch upon some of these topics and they don’t experience and then we
can have a discussion. So we’ll start with Peggy
from University of Victoria. She’s done a ton of work related to user interaction in
software productivity, and Peggy, take it away.>>Yes.>>When she sets up, if you have
any questions, comments, please.>>Questions or comments for Neil. Can you hear me okay? So that was
the PowerPoint AI at work there, you see the way it was
zoomed in on my nose.>>Okay.>>I think right and if
the AI was really smart, it would’ve known that Peggy
is a synonym for Margaret. Switch that into probably, so you’re finding my slides here. So you remember you saw
the four circles that he had there and the human computer
interaction was the bottom one. So that’s the one that
I’m going to talk about just a little bit
here. Thank you very much. I want to talk about why humans in AI need to join forces in
software development. So I’m a professor at the University of Victoria
which is a short plane ride if the plane is not canceled away up in Victoria just
across the street here. I’m also I work quite a bit
I did some sabbatical and I’m still doing collaboration with the [inaudible]
team here at Microsoft, and also working with guys over
Microsoft software research. So a lot of the stuff
that I’ve been doing is looking at productivity at Microsoft so I’ll talk about
that during my talk here today. First of all, I want to
talk about conceptualizing productivity and what do
we mean by productivity. So how do you even define what it is? How do you, if you can’t really define it, then how
do you measure it? How do you come at the metric? How do you know that the AI system
that you’re building is actually going to
make a difference if you can’t really define
our productivity is? So a bunch of us from research and also some industry
partners we met in [inaudible] couple years ago now and we
talked about productivity for a week and then we wrote a book and this is a book
that came out of it, Rethinking Productivity
in Software Engineering. It’s written more
with practitioners in mind and I was co-author
on one of the chapters called Conceptualizing or
actually it was called a Software Development Productivity
Framework and the goal of this chapter was to
conceptualize productivity. We came up with this framework, these three different ways that
you can think about productivity. So in the one hand, you can think about
the velocity of the work and on another hand you can think about three hands now,
sorry about that. Another way, you can
think about the quality. So that I would say the number of bugs in the system
when you deploy it. Then finally, you can think about productivity in terms of
developer satisfaction. So developer satisfaction is
often used as a proxy for perceived productivity on
the part of the developer because developers do a lot
of things during the day. They don’t just write code or
test it. They also review it. They help other people. They write test cases. They go to meetings, they design, they look at
requirements and so on. So looking at their
perceived productivity is a good way to measure
productivity as well. While I was at Microsoft
two years ago, and during the past year
and a half as well, we’ve been looking at
trying to understand how to measure perceived productivity
on the part of developers. Out of that research, so we did a lot of observations and we interview different people from
different groups at Microsoft and that led to
a big survey through which we have this initial theory about how developer’s satisfaction and their perceived productivity
match to each other. There is this bidirectional
relationship between their satisfaction with their job and also their
perceived productivity. We build on theories from management actually that looked
at and saw that there was this bidirectional relationship. So basically what that means is if you feel more satisfied at work, you’re going to feel more perceived productive and if you
feel more productive, then you’re going to
feel more satisfied. So the purpose of
this research though was to identify what are the other factors, both social and technical factors, that might influence how developers feel about their jobs and
about their productivity. This is still working. So good. So what we did was we did the survey
to identify these factors, to also understand how satisfied developers
were with these factors, and then we built
a regression model to try to understand which ones actually can explain their overall satisfaction
and their productivity. Let me look at the challenges for us. So the other thing that we found through the survey
and the work that we did is that there’s a lot of
challenges that developers, say influence or impact negatively their satisfaction
and their productivity. We expected that we would see a one-to-one relationship
with challenges and then the factors in our model
that would impact their overall satisfaction and perceived productivity but we didn’t. So what you see is that some challenges just have
this knock-on effect. So for example, if you report that your manager has a big impact
on your ability to work, that is going to then lead to lower satisfaction across
16 other measures. The same with team culture. If you feel that the culture
of your team is not good, it’s going to have a
knock-on effect with nine other factors
which then in turn will impact the satisfaction and the perceived productivity
of the developers. So this diagram here shows that first theory that I showed you at
the beginning which was at the high level theory relating the factors to
overall job satisfaction and perceived productivity, and after we run the survey, I think we heard from
about 470 developers. We then created this regression model which then allowed us
to tease out which are the actual factors that help explain their satisfaction and which factors help explain their
perceived productivity. Just a couple of things to note here, one is that in the terms of the key factors that influence
their perceived productivity, you see that there are
many different factors that play a role there. Sure, engineering tools is one
of them and that’s where AI fits in but there are also all of
these other factors as well. Then on the satisfaction side
we see that work culture, and under work culture we
clustered team culture, organization cultures, as
well as their manager, and how collaborative
their team were. We see that that actually this woman had the most
explanatory power in terms of how satisfied they
felt about their jobs. So when we look at this and
when we look at other research, we do know that software
development is a team sport. So a lot of the focus that
I’m seeing so far coming out about AI and how AI can
change software engineering, and change software development, is very much focused around the individual developer and how
to help the individual developer. So what I’m trying to do through
this short talk and maybe provoke some discussion
afterwards is thinking about can we shift and
think about how AI could help collaboration and help
the team be more productive. This is a picture to
just demonstrate this. So on the one hand if we focus too
much on the technical aspects, we may have this great AI that does a really good job of
producing say more warnings. I’ll pick on Brandon’s
favorite example here. Maybe more security warnings and so we have this AI
that shows there are all these security warnings
and developers should address them
before shipping the code. If you just do that
and don’t think about the developers and how they’re
going to deal with those, then you may hit some problems
because a lot of engineers will go, “I’m not going to fix those bugs. I’m just going to
suppress the warnings, or maybe I’m going to hide them somehow from my manager
and then carry on.” On the other hand then if
you spend too much time just looking at
the humans in the loop, and you forget or you
don’t look at what are the technical opportunities
that you might have at hand then you also don’t
do as well as you could. So it really is
this sweet spot here in the middle and that’s
socio-technical intersection, and if you pay attention to
both the technical side, and the human and social side, that you really will get those gains
in terms of the productivity. So I’m really calling for
this joint optimization of both. How can AI boost
developer productivity? If we look just on
the individual side, we see a lot of examples of how AI
can automate tasks particularly rote tasks and basically just remove them from the developer’s work
that they have to do, or we may see some AI that
provides cognitive support. So it doesn’t completely automate
the task but it amplifies their cognition and somehow maybe it gives them a list
of recommendations, maybe it removes the need
for the engineer to keep track of a list of
things and it does that for them or does some calculations, or maybe the AI will just
provide some information about system attributes that are quality of the system or again recommendations, or maybe it provides some feedback
on personal productivity. So the AI can be watching what
the engineer or the developer is doing and then give them feedback on terms of their own productivity. With a team level in terms of
understanding how teams work, now we have to start
taking into consideration how developers and
other stakeholders on the project, how they communicate with each other, how they coordinate their work, how they collaborate on the projects that they
have to collaborate on, and also can the AI also give them awareness of how they’re
collaborating with others, and even help them reflect on
their collaborative aspects. So how does AI do that? Neil actually gave
a pretty good overview of some of the possibilities
particularly in the inner and the outer loop and the different
levels and the different ways that AI can be used and
we’re going to hear more actually from Pam
and Akhmed on that. But how is that AI, that
magical information, how is that communicated to the developers? What
does that look like? Well, it may look like a prompt. It may look like a pop-up box, it may be a list of
things that you might want to choose from
a set of search results, or it may be just
something that’s done automatically and the developer and engineer doesn’t
even have to touch it, but often AI has to be fed
through a user interface and often dashboards are used for doing that at
the individual and team level. So looking at telemetry data, and running AI and telemetry data, that is then brought to
individual developer and stakeholders’ hands
so that they can look at that information
in the dashboards. There’s not actually a lot
of research though on how these dashboards are used
or how they’re designed. So there’s a lack of
research on how that’s done, and I’m not going to talk about
that a lot today but we did write a short book chapter
in that other book that I told you from
the Dodge Tool workshop, and I’ll touch on that
a little bit later on. Another way that we’re seeing AI
making its way into the hands of developers and into the interplay between developers working
together is through bots. Through these
virtual team members and I do want to talk about
these for a few minutes. So what are bots? So this is a definition that my student
came up with for her thesis. So software bots are interfaces that connect users
with software services. We’ve seen a lot of bots over the last few days depending on the breakout sessions
that you went to, so basically a bot is
this interface that may provide access to
either integrated services or calculations or algorithms
within the bot itself or it may actually connect to external services and
access through an API. The bot brings those services
to the user to use directly. Now of course they also bring
additional value in terms of the AI, or the intelligence that’s in them, but they may also bring value by the way that they have a personality, the way that the bots
work with the developers, and I’ll talk about
that next a little bit. In development, in
software development, we’re seeing a lot of
different bots propping up. So we see bots, well, Neil was talking about some of
the ways of synthesizing code. So we’re seeing software bots as well providing a way
to create that code, we see test bots that are
working together in the team of developers and detecting bugs and
detecting code quality issues, and then feeding information back to the developers perhaps in
their Teams channel or in Slack Channel and
telling them there are these issues or the bots
may even automatically open up bugs in GitHub and assign
them to particular developers depending on which part of
the code they find the problem in. There’s also DevOps bots that automate deployment and operation
and run the things that were manually done before
and then send messages or communicate with the developer’s
again through these platforms, these communication chat platforms, there are code review bots, so we were talking about those
yesterday that might recommend changes often they’re pretty
simple changes based on the code, things like naming issues. Code-review bots may also
recommend reviewers. There are also bots that support
interaction with product users. So a lot of companies are using
bots to actually directly talk to the users of
their software so that they don’t have to have engineered
time doing that. There are documentation bots
that produce docs from developers artifacts and translate them from
one language to another. There are also interestingly
a lot of entertainment bots. So we heard in one of
the keynotes the other day, if you want to be
productive it’s good to also have fun and it’s
good to take a break. So when we were studying
how developers use bots, we found a lot of
these and we chuckled, but probably serves a good role, and whenever I teach and I use bots in the stuff that I’m teaching, my students always ask, “Can we have Giphy please?” It’s important that they
have these fun things. This last one is one that
I don’t see a lot of this, but we’re starting to see it, and I think we could in the future. We can have researcher box that could study individual and
team productivity. So that’s something that I think
we’ll start to see more of. There’s also this notion of
ChatOps instead of DevOps where the ChatOps is kind of DevOps with this bot that also chats with the different people
in the collaboration. I love this quote. So ChatOps is a collaboration model
that connects people, tools, processes, and automation
into a transparent workflow. This is pretty important because the bot is not just doing
things in the background. So it’s more than just doing scripts, but it’s actually
communicating through the messaging platform with the other developers that they
have issued this command. Other developers will see that
those commands have been issued, so that increases
the level of transparency, but also increases the way that developers learn how
to do DevOps themselves. I just wanted to share with you, this was one study we did with
a local startup company in Victoria and a lot of
start-ups are relying on Bots. Why do you think that
is? Any guesses?>>Cheaper than humans>>Pardon?>>Cheaper than humans.>>It’s cheaper than humans. They want to automate
as much as they can. So they’re are very, very
clever at figuring out which different human roles
they can automate with Bots to make everything
much more efficient. So talking to them,
they even had a Bot, I didn’t put it on this list, that answer the doorbell. So when you rang the doorbell, they would get a message in
Slack that somebody was at the door and they’d see a little
picture of whoever was there. That was cheaper than hiring
somebody to open their door. So they had Bots that basically connected all of the different
things that they needed to do in their product including notifying team members when
errors and exceptions occur. They have Bots that interacted
with their end-users. They even have Bots for
linking their texts and phone. So everything that they were
doing was connected and came through the same platform. So normally, I’m sailing
at this time of year. The summit is always in the middle of when I’m
supposed to be sailing. I was complaining to a couple
of people last week in Slack that I’d rather be sailing and
then preparing a talk for today. So I thought, “I know,
I’ll prepare a talk about sailing instead of
those other things.” So I’m actually going to talk
about how Bots and how AI or actually how AI really and how
data analytics helps sailing, and how maybe we could
learn or take some lessons, take some analogies
from that and maybe how we could apply it to
software development teams. So how many people
here are familiar with the America’s Cup? A few people. Okay. So for those that are not, the America’s Cup is probably the most prestigious
sailing race in the world. It’s been around since
about the 1850s. America or the US actually held
on to the cup for 132 years. It’s now this very contentious race
where there’s only one winner. There’s a quote, I think
from the first race where Queen Victoria
was watching the race. Somebody came in first. America came in first, and she said, “Well, who’s going to be second?” Somebody said to her, “Your Highness, there is no second. There is only first.” So this race is really intense, and countries and syndicates now put a lot of money into
building these boats. They used to be 12 meters
for most of the race. Lately, they have become these
catamarans that are like up to, I think in 2017 they
were 72 feet long. The boats costs
10 million or more upwards. The reason I’m talking about it not because I wish I was
on a boat right now, is because we can learn a lot about how they use data analytics and how they use that in this team approach to
winning the America’s Cup. So when I was looking at this, I found some articles and I’m sharing just some of
those with you here. I’m hoping we can discuss some of these lessons and see
how we could apply them. So there are several ways that America’s Cup champion sail
like successful IT teams. I’m pulling out some
of these ways or some of these analogies from
this article here. So the first one is,
“Management is important, but building the complete team
as mission critical.” So we saw this a little bit
on my survey that the manager as a factor was very important in terms
of proceed productivity, but the team was as well. So we see this here in sailing. This article talks a lot about
rethinking who is on the team. So you might imagine that if you look at a team on an
America’s Cup boat, that the team refers to
the sailors that are on the team. But it’s not just the sailors, it’s everybody else around it. It’s the engineers,
it’s the training team. In fact, you might even say America’s Cup can be
sometimes won or loss before they even launch
the boat that’s going to race because so much
engineering goes into it. But I also want to put out
there that we can think about these bots that are really a user interface to the AI that we’re trying to build
for developers to use, that we can think of those
also as virtual team members. So this AI that we’re
building or could build can become like
a virtual team member in the team. So that’s just one way. The second way that
they talk about is that winning teams embrace disruption. So the America’s Cup teams
over the years, if they didn’t sort of spy
on what the other teams were doing and see the new technology that they were using,
that they would lose. They just would not
be able to keep up. The Engineering Systems team
needs to respond at the speed of opportunity and in
the preparation for these races. So the more recent
America Cups boats, and I don’t know if you can
see this in this picture, but they’re more like they’re flying than sailing because
they have this technology. The New Zealand team did this first. They had this insight that if they built this what
they call a foil between the two hulls of the catamaran that they would
get lift from the water. As they go over 18 knots, the boat lifts up and it
literally flies over the water. They can get going just crazy speeds, like over 60 kilometers an hour, or even maybe it’s miles an hour. So the boats, as soon as
somebody comes with something, they have to jump at the speed of opportunity to really be able
to work that effectively. So what disruptions have we
seen in Software Development? So Neil touched on a few. What are the big disruptions do you think we’ve seen in
Software Engineering? I look at the people who have
been around a bit longer. I can help you out if you want.>>The whole continuous deployment.>>Yeah, the whole
continuous deployment. What about what enabled that? Yeah, automation, the Cloud,
the Internet, e-mail.>>Can we say containerization?>>Containerization. So these disruptions
are changing the game. They’re changing the speed
at which we can deploy. Another one that I
wanted to just touch on a little bit is the use of social technologies and the use of the Cloud and how developers
communicate with each other. So I’m referring here to
some studies that I did that looked at how developers
particularly in open source, no longer sort of sit in their room
and write their own code, but they actually are part of a big community and they learn
and they help other people. So you have this what’s
called a participatory culture of software development. So a lot of my friends that are
not software engineers will say, “You know, the developers,
they’re so anti-social. They don’t talk to other developers. They just sit in the basement.” I’m like, “No, not today. Developers have to be very social. They have to know how
to use these tools and how to use something
like Stack Overflow.” So imagine not having
Stack Overflow today. So this is another example
of a disruption. I don’t think that we were really aware that it was happening
while it was happening. It was just something that
caught a lot of people off guard and we had to
just rush to keep up. So I think with AI that
we’re going to have, what kind of culture is AI
going to lead to in Teams? So I don’t know, but I think it’s something
we need to think about. Okay. So going back to
the sailing analogies. This one, it’s all about the data. So in this America’s Cup teams, so I’m referring back here, I think this was from the 2013 team. Their boat had 1,000 IoT
sensors on the boat. They were producing
something like 10 Gigabytes of data an hour when they were
sailing, pretty incredible. That foil that I talked about had
300 sensors on it, just on that. So they’re using this data
basically to fine tune, not just the engineering
of the system or the boat. It’s the system actually
that they’re sailing on, but also to fine tune
what the crew are doing. To learn what’s going to
make the boat go faster, learning from the
different sea conditions, learning from the
different wind conditions, and then putting that altogether in models and then using
that to help them win. One of the things they talk
about is that within the team, that every individual
within that team needs their own unique
dashboard so that they can sort of pull out
from that dashboard an understanding of what it is that they are contributing to the project. I think in Software Development as well that we need to think about what unique dashboards do we need to support the different
people on the team. Of course, the need for
explainable AI comes up, but it’s not just
explainable AI to one person, but it’s explainable AI from one stakeholder to
a different stakeholder. So having somebody in the middle
have to do that explanation. Another other thing about data
that they talk about is really important is supporting
post-mortems by the entire team. So one of the things that
they do is they sit down after the race or after
they’ve been practicing. They sit down and they play through everything they did and
they look at the data, and the whole team again is there. They look at what could
we have done differently, what could we have done better. In particular, they did
a post-mortem after this race. Actually, this wasn’t even a race. This was a training session. I don’t know if anybody saw
this in the paper in the Time, but this was Team Oracle for the USA. The sail actually on this boat is
not made of fabric, it’s stiff. You can see the foils
underneath there. So they decided to push, or the skipper, actually
decided to just push it, just a little bit further to
see what can this boat really do and pushed it just
a little bit more. Then, the conditions changed, which happens in life. The boat sank and it’s amazing. I’ve got some links here, it’s fascinating to read. This boat then got dragged out underneath the San
Francisco Golden Gate, and it was going to
be blown out to see. Anyway, eventually they
managed to get a back end, but they lost months. They lost thousands and
that not just millions, but they lost the
time before the race to be able to train for
the race because of this. But they do say if something
doesn’t break, it’s too heavy. So they do try to push
what they’re doing. So I’ll come back to that again. So another analogy that I like from this article is that great ideas
come from the front lines. So they don’t design the technology
for these boats and for these sailors that
have to actually sail the boats, in laboratories just. I mean, obviously they do
a lot of the engineering work in the laboratory and they
do a lot to simulations. But they also go out
onto the boats with the sailors and they
watch what they do in race time and also
in training time. So they learn a lot from doing that. I want to push, and I know Neil is doing that with his teams to that, it’s so important to observe how developers are using the AI
on a day-to-day basis, not just individually but as part of the team to understand how
we can improve it more. In terms of that frontline thing, here’s an example of something
that you might learn. On my boat, we call my Otto home. So my Otto home is just
basically a mechanical thing. I set in a compass direction or I can connect it to the GPS,
and it steers the boat. So we don’t have to steer the boat. We call it Otto, and
we personify Otto. Actually, this blog post talks
about the same kind of thing. When Otto screws up and we say, “Well, Otto’s cranky today.” Why did Otto screw us up? But we call our GPS, a GPS. I always wondered like, “Why? Why do we give a name
to the Otto home? But everything else is just the GPS
or the depth sounder.” This blog posts actually
shed light on this for me. So the writer of this blog post, he explains that on their boat, the auto hound, the
automation that takes over the auto hound is
like a crew member. So it is doing something that
decides where the boat’s going to go and actually
affects their life. It’s taking over what
a crew member would do. So this is the kind of thing that you can learn when you go
on the front lines, that maybe some of the AI
should be under the covers. But maybe some of the AI
should be as part of a bot within the channel where
there’s awareness and transparency, and conversation and feedback
and so on happening. Then, the idea that it’s all about just the machine or the data or engineering
winning the race, it’s not. At the end of the day, it does come down to
people but it comes down to people and how they’re
supported by the technology. So really is this
effective integration. This is just the
skippers saying that, “Yeah. He made the mistake.” That led to them losing
one of the races. I just want to also
mention this quote, because I really liked this one. I think some people have seen that cyclical graph
that shows the AI/HCI. Every time there’s an AI winter, HCI labs go up and then, then AI jumps again. That rather than doing
this cycle that we need to really think about addressing
both of these at the same time. So that HCI doesn’t have
to come in and clean up the mess that the AI
researchers leave behind. Okay. I put some discussion
points for the panel, but we can maybe come
back to these later. But I do think that we need
to think a lot more about how AI can enable or
an AI-enabled bots can support, or even potentially harm software team, collaboration,
and communication. I haven’t talked about
the risks in this talk. I’ve done that in other talks, but there’s lots of risks
and we’ve heard about lots of them at the other breakouts. How can engineering system
and development teams together embrace and evaluate
disruptions from the AI? So on the America’s Cup team, the engineering team and
the sailors are one team. I think that in development, we could think about
that a lot more as well, have a lot more closer
collaboration between engineering team and
the developers themselves. I didn’t talk a lot about dashboards, but I do think that this is
an interesting future piece of work. How these dashboards, these AI-infused dashboards could support tactical decision making, support operations,
and also post-mortems, and how they could be personalized, and how we can study them. By the way I do think
we could use bots to study some of
these things as well. So I pass it to you. I don’t know how long
I rob it on for.>>Thanks Peggy. Thanks
for the talk. So I want to set up a print
[inaudible] , print slides. If anybody, any has
questions for Peggy? Please, any questions, comment?>>[inaudible].>>Yeah. That’s very
much a work in progress.>>It’s a super
interesting [inaudible].>>It’s a really
interesting question. Yeah. Yeah. I mean,
I think right now, the way that bots are being
used and being designed, it’s very ad hoc.>>Yeah.>>So it’s like, here’s an idea for a bot. Let’s write one and let’s deploy it, and then see how it sticks
and see how it’s being used.>>Yeah.>>I think it could be
useful to take more of an engineering or architecture
or system perspective on the bots and understand
where they play. Really think about,
“Should this be a bot?” Is conversation enabled or
should it just be a command?>>I’m kind of thinking in API terms, when would this thing want to interact like
another human with this?>>[inaudible].>>Yeah. That’s
the really interesting things. Yeah. Really super interesting. Yeah.>>Okay.>>So thanks for the invitation. It’s my pleasure to be here
and talk about this area. So this has taken over my life
for the past eight years or so. Is there a remote? Here it is. So let me just explain what
this term naturalness means, there is some confusion
about this particular term. So the way I like to
explain it is that human speech and writing have evolved over thousands of years to serve certain
national human purposes. The structure and use of
these human languages, I’ll talk about where they come from, but what we’ve discovered is that the same structures and patterns
of usage also exist in code. So what does that mean? So when you think about
natural language, a good example is this guy here who is about to have a very bad day. Let’s say his children are
on the side of the pool and so how is he going
to react to this. So is he going to sit up and think of some glorious poem and
recite it or is he just simply going to
say, “Get out of here.” So this is the imperative under
which natural languages evolved. They’ve evolved to communicate very efficiently and
quickly in noisy, dangerous, and
distracted environments. Now what does this
have to do with code? Now this fellow here, let’s assume that for
a moment that he’s actually a developer and that person
is actually his manager. So is this chap now going to go off and think of glorious
continuations and more nerds and recursive this and that’s
or is he just going to write the loop that he wants to write the simplest and most mundane
may he could think of. So a lot of coding gets done under these circumstances and it’s not just simply the coding
as you’ll see later, it’s when people think about
what kind of code they want to write it’s not just themselves
they’re thinking of. That also relates to natural
language because when I’m speaking, I’m not just simply
thinking about myself and how I want to construct
my utterances. I’m also thinking
about the listener and how are they’re going to
react to my utterances. So in that sense, speaking is a very cooperative act and a very conscious and mutually
conscious act and so is coding. So essentially what
natural means here is that human utterances and human speech are constructed in noisy, dangerous, and distracted environments
and as a result, the way we speak is
highly repetitive, very predictable, and can
be modeled statistically. This is wonderful
because this is what enables things like Google Translate, Speech Recognition, and
other forms of tools that have made all this wonderful advances in natural language processing. So code is a little bit different. So code is permanently
intended to run on machines. So when a programmer
here is writing code, in some sense the end intent is that it actually
executes on a machine. Machines don’t really
care how you write it, it doesn’t matter whether you write i less than 10 or 10 bigger than i or i plus one or one plus
i, it doesn’t really matter. She has a lot of flexibility in
how she chooses to write her code. So far not much call for naturalness. But in fact, the code is
actually maintained by another developer and when
the coder writes code, she is thinking about
who’s going to maintain the code and how the maintainer
is going to read the code. In fact, when the maintenance
reads the code, it’s really a noisy channel. So maintainers are not computers, maintainers don’t do
operational semantics or denotational
semantics in their head. They’re actually genuinely reading
in a noisy channel environment. So by this I actually mean
the Shannon noisy channel. So when a developer reads code, she’s thinking to herself this is probably the computation that
their developer intended to write. The maintainer hypothesizes
and then the maintainer says, “If the developer were meaning
to implement this computation, what’s the most likely way that
she would have implemented it? So I’m really recapturing the Bayesian formulation of
the noisy channel model. So she’s going to guess how
would she have implemented this and she’s going to look for those bits in the code and
if those bits are not there, she is going to say she must
have been doing something else. She’s going to hypothesize
a different computation that the developer may
have intended and she’s going to think about how that
would have been implemented. So it really is a noisy channel. You’re not computing semantics
directly from the program using an operational channel,
you’re guessing meanings. That’s how people read code. Anyway, I can’t prove this but
this is just my hypothesis. This formulation we call
it the DEU model because it came out of our discussion between people at UC Davis and
Edinburgh and UCL. So Charles started it at
Edinburgh and L_bar at UCL. So we don’t quite have
a name for this formulation whereas times we call it
bi-modal comprehension, sometimes we call it
two-channel comprehension, we’re not sure yet
but the cool thing is that the second channel
here is really actually a noisy channel
and there’s a lot of interesting questions
that come up out of this. So because of this noisy channel that exists in code
and because it’s not formal operational semantics baked, software as it is used is
repetitive, predictable, and amenable to
statistical modeling because the same imperatives that apply to the construction of
natural language utterances apply to the construction of code. So what? It’s taken over
the last eight years of my life and there are
two aspects to this. The first aspect which I’m not going to say a lot about it
because I got the sense that really I should talk more about the engineering aspects
of this in this forum but I spent more than
half my life these days on the scientific aspects of this and we do a lot of
human subjects studies, we’re doing a lot of corpus studies, and maybe done
some eye-tracking studies. So how does it correlate to
human preference and performance. So people never say bread and
butter for a butter and bread, for example, they always
say bread and butter. Nobody ever writes i
equals one plus i. So those things are actually related and relates
to the noisy channel, you expect certain
computation to be written certain ways and if you
write it a different way, it actually impedes comprehension and impedes smooth easy reading of code. So we’re doing a lot of
human subjects studies in this area. We just finished
a Mechanical Turk study with 70 participants trying to figure out if we can predict which forms of writing code be preferred by human beings and we can. The next step for us is to do with what accuracy we can prefer how people would
prefer to see code being written, and then after that we’re going to do some code comprehension
studies to see if we can predict
which code is going to be easier for human beings
to understand and which kind of code is going to be
harder for human beings to read. This work is being done in
collaboration with Emily Morgan who was a psycholinguist at UC Davis. So this is very exciting for us because she’s generally
been studying forms of expression and natural language using a theory called rational speech act theory and we’re now trying to
apply that same theory to code. But that’s all I’m
going to say about I’m happy to discuss it more if anybody’s interested in the panel discussion. So the other part of my life is trying to explore
the engineering question. How do you exploit code
repetitiveness to help programmers? So the first paper
on this was written about eight years ago in our group and so we exploited some interesting properties
to do some application. So I’ll talk more about the applications for
the rest of this talk. So the basic general
scheme is as Neal mentioned is that you
know this is sort of the way people
used to build tools. You think of a tool and the developer decides I
need this tool so she goes off and writes a tool and then the tool can process source code
and produce results. So a lot of tools fit
into this framework. So the twist here is that because now we have
this property of nationalistic, you can take a large code corpus, estimate various kinds of
statistical models and this particular models can then improve the performance of the tool. So this is the framework
for a lot of the tools and the details depend upon
the model you want to build, the data you have, and how the model improves
the performance of the tool. So there’s lots of applications. The first application that
we did back in 2011 was code suggestion and that’s
basically you estimate a model like that and there
are various ways to do models. We’ve switched over to completely using neural network-based
models these days but there’s lots of ways to estimate these models and obviously
there’s lots of data. There’s quite a few papers
that have described this. We have something on GitHub
you can welcome to use. It’s called SLP core. A lot of people are using it. It’s until very recently, some very recent work from Edinburgh. Our model was
the best-performing model. So one very interesting problem in code that doesn’t occur in natural language is
vocabulary explosion. So in natural language, as you scan more and more test, the vocabulary starts to
grow slower and slower, and eventually, it saturates. It’s only place names and people
names that begin to grow. But every new model of code
into these is new vocabulary, so there’s a real problem
in vocabulary. Until very recently, deep-learning
models with cuttable edit. You have to cut the vocabulary
or essentially the number of parameters if the train
becomes unmanageable. The recent work from [inaudible] show how to deal with vocabulary
explosion in deep learning models using something called by-parent coding which is a way of splitting up identifiers and doing
this quite efficiently. So this is, I guess
the first application that was done. It was great to see it being used
now and shipped and actually, apparently being used
by millions of users as our demo of intelligence, and it was great to see that. The other big thing that has come up in alert is JavaScript
“de-obfuscation”. I put “de-obfuscation” in quotes because it’s not
really de-obfuscation, it’s basically replacing
dumb names with better names. So a lot of people minified
Java code and ships out, and you can recover
the names from it. So this is basically estimating
a model of this sort, estimating clear code
using minified code. By the way, all these models,
this one and that one, need data for estimation, and all these things
you have huge amounts of data. It’s not really a problem. Once you have a tool
that does minification, you can produce as much data
as you want to do this task. Again, there are various models. The first one along
these lines was a model based on conditional random fields
from Raychev et al at ETH. We’ve done some work
along these lines using phrase-based translation that works in a complimentary way to
the CRF work from Raychev et al. So actually, if you
put the two together, you get much better performance. The same technology is now
being used for recovering identifier names from binaries
in decompiled binaries. So it’s a pretty useful thing. The other exciting application
is gradual typing. So gradual typing is a framework
where developers that compromise between Java languages where
everything has to be declared, and Python languages where
nothing has to be declared. This is a way where you add
declarations to suit yourself, to look for errors in places where you think that
you might make typing errors. So this estimates, a model of this type estimate this distribution, the type, given the name
and the context. So a lot of people have used
context more recently this year. There’s some work on using
the name of the variable to estimate this distribution, and there’s a bunch
of work in this area. We’ve done work on this. So there’s been work from ETH, and Michael Pradel has now Stuttgart has also
done work in this area. So these are some of
the emerging early applications of this, the first applications. There’s a lot of work to
be done along these lines. So this is the simplest model
you can think of, and it turns out this is
actually quite useful, and that’s for checking code. So you can just simply estimate
a model over a large corpus, and if the model says
the code is weird, it turns out this actually quite
accurate in finding defects. So this was a paper in Exceed 2016. Oddly enough, I mean, it’s
not a fair comparison. But oddly enough, this
is about as good as many things like FindBugs. Of course, FindBugs tells you
what do you think this is wrong, this doesn’t tell you anything
except saying this looks weird. So this is the simplest
possible thing, but it turns out it already
is arguably useful. This was the suggestion thing
that I talked about earlier. Variable name recovery,
this can be used to. Again, you can estimate
this with any large corpus, and so you can use this
to either recover names as in the context of de-obfuscation, or you can use it to check whether you’re using
an improper variable name. So this is the gradual
typing problem. Again, I talked about
this and it again, can be obtained from
large amounts of data. So far, these are what Neil was talking about with
his inner circle stuff. There’s also more outer
circle staff that relates more to not immediate coding, but more process-oriented things. So interestingly, there’s lots of data on these lines
and there’s been a number of papers coming out recently
that allow you to repair code. So there’s been
a few different experiments using some standard
datasets to see how well machine learning models
can patch code. Most of the existing work on automatic code patching has
been using genetic approaches. In other words, you have
a hill-climbing search, searching over a lot of
possible patches trying to find something that can actually
patch your defect. The new approaches involve no search, or very little search. So they’re much more
efficient as basically you train a translation model using large amounts
of data from GitHub. Typically, the datasets
use one-line patches, and they train a translator
to translate from old code to new code using sequence to sequence
models or transformer models. Anyway, most of our work now
has been using transformers. They are not as powerful and successful as the genetic approaches,
but they’re much quicker. So this is interesting stuff to do. So the data can be trained
using commit data. You can also simply do
things like de-noising, auto encoding, and
so on and so forth. We’ve done some experiments with large datasets of student programs, and de-noising, auto-encoders can correct about 50-60 percent
of student errors. So those are syntax errors. So this is another
interesting area of research. So this is essentially
producing English from code. There’s some interesting issues
here that are going on. So there’s useful comments
and there’s useless comments. The comments that are needed by somebody who is unfamiliar
with the code is one thing. The comments that are needed by people who are familiar with the code is
something else altogether. So in other words, if the code is written the way
you expect it to be written, then this comment is most useful to people who
are unfamiliar with it. But sometimes, the code is not
written the way you expect it, and then you really need comments to explain what the code is doing. So in other words, the
summarization in this commenting, there are two different things. So Charles [inaudible] has some very interesting work
where he tries to say if the English is predictable from the code or the comment is predictable
from the code, then it’s actually probably
useless comment for most people. So there’s some interesting
stuff going on here, and we don’t quite know how
to classify the two comments. The comment that is really
a literal add one to I. Who cares? Whether it’s
something more complex, re-sync the lock or something like that which
is not what you expect, I equals I plus one to be doing. So that’s an interesting question. Then this code retrieval. I mean, we’re really far away
from the singularity of all this. This is mostly about, for example, using Stack Overflow data so that you give some English description
and you can find the code. There’s some very
interesting problems here. We’ve just finished
a code snippet parser and typer. So we can type and parse
code snippets with very high accuracy so that you can take fragments
in Stack Overflow, and you know what type the bits are and you know what syntax it is. So we can do it with, well,
over 95 percent accuracy. So you need that because if you’re trying to paste
some code from Stack Overflow, retrieve some code and paste it, you need to be able to
parse it and type it. Then finally, given some code, recommend a person and this is
for various task assignment. So this is very interesting, we’ve now managed to
train language models, specific to individual
developers in a project. So you take a general language model, trained over a large corpus, and then specialize that model
for each developer. So given a code fragment, we can score each
developer so that we know how familiar that developer
is with that code fragment. We welcome collaboration on this. So we’re trying to apply to
recommend code reviewers, but I think there’s lots of
other potential applications. Let’s say for example, you get
a warning on a piece of code, and you don’t necessarily
want to go ask the person who
implemented the code or the person last change
the code to comment on that, because that person have
committed to that module, but they may not have touched
that module in a long time. So what you’d like to know is
who’s written code like this. Language modules are pretty
good at spotting that. So we welcome partners
on this because it’s hard to do
that particular experiment, whether you’re recommending
the right person to fix a warning using open source code. So if anybody’s interested in that, we’d be happy to work
with you on this. So this is just some examples.
There’s a lot more stuff. We’re doing some stuff to
determine validity of invariants. I think there’s a lot
of opportunities. So anyway, these are all different
types of probabilistic models. How exactly you approximate
these probabilities from datasets? There’s a lot of ways to do that. So there’s many different models. So there’s discrete
traditional models. N-gram variants, there’s
tree-based models like PCFG, TSG, probabilistic
context-free grammars, and tree structure grammars that are phrase-based translation
models that have been used. There’s conditional random fields. Of course, a lot of attention is being paid to
these deep learning models, lexical models, sequence to sequence models,
sequence tagging models. So the difference between these two is that
sequence tagging model, the input and output
length is identical. So in part of speech tagging
for instance or when you want to assign types to variables and when you
guess types of variables, the input and output
sizes are equivalent. So these models generally tend to perform better because
the length is conserved anyway. Transformers are
becoming very powerful. We’ve had a lot of success with them. We hope to have some papers
coming out soon with them. They’re very easy to train, they’re very efficient, they
have enormous capacity. So it’s pretty promising and
Gated Graph Neural Network. So these models are
very promising and powerful but they’re very slow. So there’s some recent work
at Google where they’ve managed to speed this quite a bit
using data structure layouts. We’re also doing
some experiments trying to find ways to speed up the training
of Gated Graph Networks. So there’s lots of
ways to approximate these functions that I showed
you in the previous slide. It’s all about finding
enough data to train it. So there are various issues. With this, I’m going to
stop with this slide. This is some of the issues
that are faced in code. So one big problem in code
is vocabulary proliferation. The DP seems very promising and we’re getting
good results with that as well. We’ve been able to
replicate the results from certain and current patches
even in some cases improve on it. So maybe that is
the solution, I don’t know. Another big problem
is explainability. So I actually learned
a few things that at this faculty seminar that I’m
really eager to go back and try. So when you suggest
a patch to a programmer, you suggest a type in
a gradual typing environment or you suggest a change or something along these lines or just give
a normal code suggestion, it’d be nice to have
some explanations of why you’re doing that especially with patching and also with like saying this code
looks bad, you should fix. It would be nice to
have some examples. I think there are ways to deal
with this but I think this is really exciting and
interesting open area. Finally, I think that in some sense, the most exciting thing about
code is the fact that it has both operational semantics
and noisy channel going on. So you’re writing code for
the computer and you’re writing code to actually be
read by human being. So there’s a probabilistic
side to code and then there’s a deterministic
formal side to code. So how do you exploit
these two things together? I think some of
the most exciting work in this area is going
to come out of that. So I’ll stop there, questions? Have I gone over or? Sorry.>>So the last speaker, hopefully you can stay
for a few minutes longer so that you can
get through the talk. There are some really
interesting things to say related to data and
some of the work he did with the companies he
worked with. Is that correct?>>Yeah, thanks so much. We will probably go into
break at least five minutes. So hopefully you can stay for that.>>I’m going to try to do
this talk in two minutes. So basically this is my lab. We’re the Software Analysis
and Intelligence Lab. What I want to do today is I want to talk about primarily
our experience of actually having machine
learning models that actually software
developers have been using. So some of these are systems actually and I wanted
to think about it been used for the last 10 years
or actually a decade of systems that are using machine learning on software
engineering data, and can our experience are more how developers actually found them? Now, quick overview. So software developers
produce quite a bit of data. So things like code changes,
release notes, bug reports, e-mail
discussion, code reviews. We call these
development repositories but users as well
produce a lot of data. So things like crashes,
logs telemetries, reviews. We call these field repositories. Another type of data is
what other people use. So Online code,
stack overflow, GitHub. So all that information today
unfortunately is only one way. The data doesn’t go back in
your next decision-making process. The simplest thing
is think of Amazon, how Amazon tells you when
you buy A, you buy B. You could do that same thing
for software developers. Whenever people change
A, they change B. So now this is an area that
been going for many years, is called basically the Mining
Software Repositories Community. So the Mining Software
Repository Community now has been around 20 years. It’s actually considered
one of the Top 10 Venues by Google Scholar
for Systems Research. So there’s many, many people that
have been doing stuff on this. I’m just giving you a flavor
that’s inflows what I’m doing. But if you’re interested in this, so do check this community. So as I said, many companies
actually use this today in practice. So these are some of them. Now, really identified
is, there’s data, there’s a model and what we
care about is two things. So we care about a decision
and we care about insight. Now, the decision is
like you want to say, okay, it’s this piece
of code buggy or not? So that’s binary but actually for developers and
we talked to manage it, you don’t care as
much about this one, we care about how to improve
the process in the future. So that’s actually where
the insight comes in. For them, the insight is more important for them to
actually that the decision for this bug because that will
only fix this specific one. So now I’m going to give
you some flavors of that. So ideally, what you want is
when a new code change comes in, you want to be able to flag it. So red, yellow, green. So red means there is something
really bad about the code change. Yellow, this looks
a bit worrisome guys. Green is go ahead, commit that. Now, we were lucky actually that we work together with BlackBerry on this project and what
they had is they have for their developers
for every code change. So this was done on the device
software for the Blackberry’s. They would actually rank
manually the change, whether it’s high risk or low risk. Now, I want to emphasize
something which we learn is, for them they didn’t care
about buggy or not buggy. They actually cared
about risks which means a much bigger influence than
just a buggy or not buggy. For example, a change
might be super simple. You updated the UI but for
some reason that change, it’s touching a piece of code or
required the whole code base to be recertified by
actually the carriers. So it pushed the whole release
by couple of months. It’s a very simple change
but it’s a high-risk change. So now the idea is we have
all these changes here, we have risk classifications
so wouldn’t it be nice that actually we
will learn from this, build a model so now when
a new change comes in, we can actually predict that. That’s exactly what we did. So we took one year of
changes and we actually would pass them for the human developers and we’ll pass it for the model
and then we actually get. This is the risks that
actually the developers did, that this system did. This the risks that
actually the developers did and it’s 0.84 correlation. Correlations go from 0-1. Zero means is random, one it means it’s perfect. So actually we got
a good correlation here. Now, this was actually
a study where we, the student was there for a year. So he actually looked at
around 450 developers doing that across 60 teams. Now, an interesting thing
we learned out of this is actually what they did
when he started using it, they didn’t only depend
on automated system, they actually depended
on both of them. So they treated both of
them as different experts. So if the developer says this is high risk and then
our system says low risk, it will still be high risk
and the other way around. If we say it’s low risk,
developer says it’s high risk. So this is an example
where it’s not always about the system is right
and you want to replace it, is more to support the human. Now, this type of work
has actually does an open source project
called commit.guru which you can actually go in
and actually upload, it will analyze
your code base and it will give you that type of analysis. Now, a lot of companies as
well have adopted this. So Ubisoft does that and many other companies now actually
have developed this in house. Ubisoft is not independent from my team and many other
companies have done that Okay. So this is what
I call the inner loop. Now, when you talk about
the outer loop is, I’m going to look at actually at testing and
very large-scale testing. So think of Amazon. They don’t only want to make sure
that one user can buy a book, they want to make sure that
a million people could buy a book, and nothing could go wrong. Now, the problem today
is a lot of these tests, the way they are verified the
past is, did anything crash? Nothing crashed, we’re good. Well, that’s probably not
the best way to go about it. So we wanted to actually
leverage some of the data that’s produced
from these tests. Simple thing is you have a
sequence of events like, let’s say, logs in this case, so we do a lot of log analytics. We have event two, so acquiring a log is followed by releasing a log. Really what happens
is we can do this. So we look at actually what’s
happening in the test, and because it’s a test, we expect to be very repetitive. So we see any deviations
from the repetitiveness, we can actually flag it. Now the beauty of this
is we and actually most developers don’t exactly know
how the system runs at scale. But recovering that
information with the logs, we can actually flag, look that this E6 happened
and it shouldn’t happen. Now what really happens is the output out of this
is something like this. So it says, this log message here. So this Dell DVD Store is a application used by Dell to
actually benchmark their servers, so it’s basically like a DVD store. You entering a purchase. So 98 percent of the time, this log message is followed
by this long message. One percent of the time, this log message is
followed by this message, and you have 358 times
the sequence is followed. Then what happens is actually, all these sequences are sorted. So the highest one here,
like this one here, the one I expanded, actually shows you the most weirdest
sequence that we have seen. So then what happens
is the tests are now, you can copy and paste this HTML document and
e-mail to a developer, and ask them, “Okay,
what’s happening in here?” Good. So this arrived 99.9 percent reduction
in the viewed log files, and the precision is quite high, 56 to a hundred percent. Now, what was nice about this is
we actually give them as well an example of exactly like that sequence log
that went wrong in there. Now you can take
that same log sequencing and you can start adding time on it. We know event A, B, C, D happened together. Now we can say, it took
the sequence like five seconds. Now, we’ve seen that so many times, so now we can create a distribution. So on the left side is
one actually a run, and on the right side is another run. If you can see, because
of the red here, it is actually a bit slower
this run compared to the other one. Then here, we’ll actually
lays it over time again. Here what it’s saying is we
want to make sure maybe, because what can happen is, over time, the system might be
getting bad or in this case, it’s always bad here. This was an example, actually, in my SQL, there was a bug in my SQL. So this actually
eventually got fixed. These are some examples of actually
how we have used this data. Some of these systems
were actually been used as a server last 10 years. So the question is, what’s the secret for long-term industrial adoption? So is it highly scalable
top-performing models? The answer is actually,
not really the case, and I’m going to spend the next slide which is actually
the last slide to explain at least my thought about really what makes some of these things work. One of the things that’s
special about what we have is we have a human-in-the-loop. A lot of the decisions
we give them is now, this person has to go to somebody
up in management to say, for example, “We’re not
going to release the system. We’re not going to do this release. That was very hard to go. I cannot do this release because
this deep learner is saying, no.” So they really need to
have something that supports their decision because the manager will say,
get out of my room. This is one of the big challenge
here is how do you make that? Now, there are two things. What I would say are
domain challenges and some of them are actually Machine
learning challenges. I’m going to go through
them one at a time. So one of the tricks
I think we found is, it’s really important that
the decision or whatever the Machine learning system gives
you is assignable to somebody. So an example for that is in the
system that would actually say, “Look, this is a buggy change.” What was good about
that is two things, is we had a specific person, which is the developer that
did the change or who can say, “This is your problem, deal with it.” Which was very different
in prior work in the area, which [inaudible] , and they say, okay, before the release, analyze all your code changes and flag buggy files and non-buggy files, the question is, okay, so you
know this file is super buggy, so who do you give it to? Whereas this one it’s
your problem, deal [inaudible]. So that was one thing. The other thing two is timely. So again, I’m going to use
that example just for time, but it applies for
the other ones too, is we did it right there. Because it happened inside the IDE, like you had the files open,
you can deal with it now. But when you say,
look, this file that had 50 people that worked on it, some of them worked on that like a couple months ago is troublesome, it was much harder to do that. Now, Machine Learning Challenge. So explainable is a key thing, and the reason we want
the models to be explainable is a lot of the developers and managers not only care
about this release, they care about the long term. If there’s something we’re doing bad, we want to be able to flag it, and improve our processes, and this is what explainability
was a key thing. Now, another one that
actually I’ve recently started to realize is
this idea of trust. So let’s say, you have a lot of data, raw data, and over the years,
developers are smart, and they’ve actually developed their own scripts and
their own, I would say, Non-AI models, just some warning
signals that they know about. So they have a script that runs over all the raw data and if
that script says something is bad, they trust that system. So now you have two ways. You can go to the Deep learning idea, which is just pick all the raw data, or you can go the simple way. Let’s take the, what I would say, trustable dumb models, which
one would you go with? So intuitively, you would say, well, let’s take all the raw data,
get the human out of the loop. But instead actually, it is much more easier to get the system adopt. At least in our experience
we say, look, this is based on a combination
of all your basic models. You trusted all of these, and we just built on
top of your trust. So that was one essential thing. The other thing is
this maintainability aspect. A lot of corporations, the Machine learning team, there’s like one or two big teams, and these teams have
to go from different, help a group, and then
he need to go out, like more of a consulting setup. So the idea is your models, you want to be able to
set them up, leave. Not every two days get a phone call, “Hey, can you come and help
us tweak the parameters.” So this is an essential part as well. So that’s my big views on
really what I felt worked, and why some of these systems
have been used for many years. This system has been used for almost 10 years now
when I think about it, and we’re never really called
in to tweak the models on that. So with that in mind, this is basically what
I talked about today. So I talked, I introduced
to you about this idea of the mining software
repository community, and how does so much data today, and you can actually look at
the data, and produce it, and reuse it to actually
make your next decision. I give you an example of how
you could use the data about prior changes to actually detect what are changes, high-risk or low-risk. I give you an example
as well how to use log so that outer loop
aspect to actually detect performance
problems and assist them by actually mining these logs
which are rarely ever used. Then I walked through very
rapidly of some of the reasons, why I think it’s more
essential to focus on these over just blindly
the performance of the model itself. So that’s it. What is
the time? Ten minutes. You’re five minutes before
the [inaudible]. I apologize.