“Performance is a feature” — Interview with Marco Cecconi, Stack Overflow

JUG.ru Group
14 min readDec 1, 2016

--

For many years the performance aspects of .NET applications has been a hot topic. One of the earliest articles is dated by 2001 and has given many useful patters for developers.

The topic was actual ten years after and people still wonder about the best profiler tool ever on Stack Overflow. What does all this mean for modern .NET development? We decided to ask Marco Cecconi.

Marco Cecconi, a software engineer at Stack Overflow in London. Marco writes about software development, coding, architecture and team leadership. He also speaks at conferences worldwide.

QYou work for Stack Overflow and what are the main “pain” points in your solution in terms of performance?

There are two main pain points, on one hand we have to be very careful about the number of object instances we create and the impact we have on garbage collection and on the other hand we have to be very careful how we use SQL Server, write SQL queries, the way we build tables and so on.

These are the two things we care most deeply about and make the most impact on performance at the moment

QIs your solution built on .NET technology or do you have some parts in foreign languages like C++, Java, Python, etc?

I would say we’re 99% C#, although we do have some bits in C++ or in C but they’re really really minor in terms of number of lines of code. Of course we have TypeScript and JavaScript. We use Javascript on the server in order to compile Javascript into bundles and minify it so on. And we use SQL of course, that’s another language. that’s about it.

Q Why did you decide to start project on C# rather than some other languages, other technologies?

Ok, can I turn a question around bit? I think the most interesting bit to that question is why are we still using these technologies today. Stack Overflow has existed for 8 years and went from maybe a hundred thousand page views per day in the beginning to billions of pageviews per month right now and we’re still using C#, so why haven’t we switched away from it? And the answer is not because of some form of twisted loyalty to Microsoft but because it’s a language with a very good runtime that satisfies our needs. And we really don’t see a reason to invest in time to switch to something else. It works more than well enough, for us right now.

Q So you have some extra capacity to serve any spikes in visitors coming to Stack Overflow some day?

At the moment we’re at 5% capacity. We can handle 20 times the load. At which point we would suffer, we don’t really want to be 100% all the time, but right now the load hobbles between 5 and 10% so we can handle traffic spikes easily

Q Do we really want to go on with this interview because you guys seem not to face any performance issues at all (laughing)?

Oh no, we have plenty of performance issues. But you know, optimizing performance is not something that you only do when you’re dying because your CPU is at 80% all day. There’s a reason why we’re at 5% [capacity] — because we keep on optimizing, it’s not something that just happens, it’s the result of continuous performance work.

Q Okay, have you ever experienced enormous heavy problems with the performance, do you remember some special events in the market, in your milestones, that you really experienced delays or outages in your solution?

No, at that level it never happened. The reason is we have a great amount of overhead. So even if we get DDoS’d now, we’re still up. Of course it’s not possible to be always be so sure about things, but I have the impression that the first thing that would fill up in case of super-large DDoS would be our internet connection. At which point, there’s really really little we can do there.

There are events for which we can actually see spikes in the logs. So when Pokemon-Go came out, there was a huge bump in all graphs for a few days. Or the US election had a valley, significantly less people came on the site the day after the election.

We see stuff like that, but we’re talking about less than a 100% bumps.

Q Let’s move to the tools section. What’s’ your favourite tool you use for identifying the bottlenecks in your code?

We have our own tool called MiniProfiler. It’s an open-source tool, you can find it on GitHub. It works on .NET and Ruby. In .NET it works with “using” statements. We use those “using” statements to identify a section of code that we want to time. So if you want to time a call, you can wrap it in a “using” statement and by doing that we’re able to create a profile of the execution of a request. For example, we have a timer that starts when we start processing the request, and the same timer ends when we finish that request. There’re timers wrapping around each database call, each Redis call, each Elastic search call. There’re timers in the different sections of each page, so when for instance we’re rendering on the homepage — the list of questions, vs. when we’re rendering the footer or the header.

Basically the tool works by injecting these results as a reference into response headers. So what happens is that if you’re a developer, you can see in your page a small box on the top right that contains the timing of the page, and by clicking that box you can exactly see all the breakdown of what exactly happened during the rendering of that particular page.

This is the tool we use for identifying specific problems, so if I’m navigating and there’s a page that takes a long time, i can see exactly what happened.

But we use a subset of that data and store it together with the logs and we can go and run SQL queries against our logs, and create statistics on which pages perform better or worse.

Q How do you monitor your solution? Does it help you to identify the bottlenecks.

Another tool that we have is called Bosun, it’s a time-series alerting system. Basically what it does it monitors some indicators that we choose, for instance, memory, our number of allocations, and it just fires up an alarm if these change significantly.

We have another system that monitors every SQL call, basically it runs statistics on what the database is doing, so we know exactly which queries take the most time, and we know if there’s anything wrong we can go in and see exactly what’s happening

We have other monitoring tools that always monitor the server’s basic parameters, like memory and CPU. We have a tool to monitor all the exceptions happening on all the servers, all the projects.

Q Is it a tool like Telegraf?

We use a tool called OpServer, a tool that we built for this purpose, and it’s open-source.

Q Next question is about your approach to the performance optimizations. What do you do with code that“smells”?

Allocations smell bad usually, but they are not always easy to find. Sometimes the only way to find out is to take memory dumps of IIS and see what’s in them and why it takes up so much memory, because allocations can happen anywhere. It can happen in your own code, but it can happen in a library which has been called by you. So they can happen in some library that we added, it can happen in .NET libraries — for instance, StringBuilders started adding one allocation in their constructor, and we have certainly noticed that because our memory got filled up by partial strings.

“StringBuilders started adding one allocation in their constructor, and we have certainly noticed that .”

Sometimes allocations happen when you use LINQ — LINQ is terrible for allocations. Sometimes it’s optimized to not allocate, but sometimes it does, sometimes it doesn’t so it’s really hard to say when stuff gets created, when it doesn’t, it’s a bit complicated.

But this is pretty much what we always look at — number of instances we create.

Q You have mentioned the special tool you use within the StackOverflow. So can you please say a couple of words about benchmarking.

We don’t really benchmark stuff all that often in lab conditions. Ultimately what is important is how the stuff performs when you put it in production. So production is our benchmark. We just put it out there and see if it works or not.

“production is our benchmark”

And to give you an idea of why it’s important — just think about SQL queries. Let’s say you put in a badly performing SQL query. You may not see that particular query as the source of the problem when you put it out, because you could be blocking stuff, and other queries could be blocked by it. When you actually running in production, sometimes the consequences of what you’re doing are not very obvious, so it’s very very hard in general for us to create lab situations which actually replicate safely the environment, so we can run performance analysis outside of the production environment.

Q Sometimes optimized code is ugly and full of hacks. How do you work with optimized code?

Sometimes optimizations are hacks or ugly things like unrolling loops, but in most cases actually optimizations are not things that make the code uglier. In fact, I would sustain, that in many cases performing code is much more beautiful.

“…in many cases performing code is much more beautiful.”

What I’m going to say is simplistic, but I can assure you it’s true. If you write less code, it’s going to be faster. You literally write less code, so our code is like small, compact. There’s very very little ceremony in our code. In the same class you can find everything from even HTML, unfortunately, down to SQL. it’s very compact, and everything is right next to everything else. this makes it actually much more readable, because you have everything at your fingertips, you don’t have to refer to other parts of the code all the time.

Q But more compact code is usually less readable and clear, isn’t it?

Well, more compact code is not necessarily more clear or maintainable, but if your code only relies on itself and it doesn’t have dependencies, then it’s actually more readable, much more maintainable. I think that’s one of the major things — our code is very self-reliant. Think about this — say you’re writing a mock project, a very very small thing, just a test case or something which is very compact in the order of a couple of hundred lines of code. Maybe you fire up LinqPad, and put a thing there — you know, just very compact, and everything is in there. That code is going to be very clear to you. There’s a hundred lines of code, you can read them all, you can keep them in your mind, you have no doubts on how they work — they are all there. And this is what we strive for, so each feature is very very compact and self-contained.

“so each [Stack overflow’s — editor] feature is very very compact and self-contained”

Of course there are other features that are not, and the code becomes a bit more complicated, but that’s unavoidable. However I feel that 90% of the time when I’m working I’m not really working on complex code — I’m working on very very simple stuff with few moving parts, and that’s actually easy. So in the talk that I’m going to give about performance, I’m going to show code examples on how we achieve this, and you will see that the code is extremely compact and extremely simple, even though it’s super-high performance.

Q I just realized that not all the performance issues are in your own code. There’re also a runtime, a hardware and 3rd party libraries underneath. So you have to either rely on them, or know how to deal with that. What do you think about that?

We know about it very well. So our hardware is all designed by us. Of course we don’t design the motherboards, but all the specs are curated by our own SRE team — how much RAM, what models, even which brand of power strips we want to use. Everything there is specified by us, that’s why we don’t use any cloud service, we’re all self-hosted, and we actually want to control everything. Building machines that work very very well for a specific load that we have is a good part of our performance.

“yes, we know about the problem[existing performance bugs in third-party libraries], that’s why we write so many open source libraries”

And regarding 3rd party libraries — yes, we know about the problem, that’s why we write so many open source libraries — because in many cases our needs are very different from the general public needs. We tend to re-write libraries all the time. For example, we have our own caching libraries, we have our own Redis client, we have our own protobuf implementation, our own Json serializer, and so on.

Q Do you have your own C# implementation, by the way?

We do have a version of a Roslyn based compiler, to compile razor templates mostly, but we only use that for localization, we only use that to do a specific thing that we need, but not to extend the language. We use the Vanilla language itself. The reason is that if you modify it then that’s a problem like it breaks Visual Studio, it breaks everything else, so we don’t really want to do that.

Q Sounds really cool. Having in mind all the issues related to the performance optimization, if there are some signals for the developer to start optimizing the code, or not to start at all?

Always start straight away, optimize the code all the time. Performance is a feature, lack of performance is a bug.

“Always start straight away, optimize the code all the time. Performance is a feature, lack of performance is a bug.”

Q It’s a philosophical question — the IT is all about the money. And some guy could tell you — okay, how much time will you spend for the performance optimization? And you could give him some kind of estimation — like 40 hours. During those 40 hours you could either implement some end-user feature, or you could spend it for optimization which may apparently become not really necessary. It’s a kind of a trade-off.

I disagree with that. If you have a bug in production, wouldn’t you fix it?

Interviewer: Yeah, I fix it.

Right, cause it’s the same thing. Well, lack of performance is a bug. So you need to fix it, and it’s very very simple.

Interviewer: And how to detect that production system has a bug?

You measure it, you measure performance all the time. When you see the performance is lacking, when you see a problem you solve it. Of course you wouldn’t do performance optimization as a wild goose chase, going around saying — oh, I’m going to spend a week trying to improve performance. That’s not how we do it. We measure stuff. You see something broken, you see something that is really not performing well — and we fix it. In our particular case it becomes trivial to see. It’s not trivial to fix — but it’s trivial to decide. When things break in terms of performance, when the page goes from 20 milliseconds to 2 seconds — you need to fix it, you can’t have user waiting 2 seconds for a page. Or maybe some other companies can afford that, but we’re not going to do that.

“…when the page goes from 20 milliseconds to 2 seconds — you need to fix it, you can’t have user waiting 2 seconds for a page. Or maybe some other companies can afford that, but we’re not going to do that.”

Q Do you use any kind of hardware optimization like multithreading, hyper-threading, GPU calculations?

We do. In one particular case we use CUDA to make some heavily parallelized calculations. We don’t use parallelism a lot, mostly for things like build so when we need to process a bunch of files, we try to go as multi-process or multithreaded as possible. In terms of the actual code we use “async” to skip over all the external calls, like database calls, but I don’t know if we can call that multithreaded really. Let me think… In most cases the best strategy we have on the web servers is to be as fast as possible, not to spread out on different cores. Because we’re always competing against a certain number of other requests coming from other clients. The utilization of the CPU is probably best left to IIS that spreads out the requests over different CPU cores. In terms of instead having maybe libraries or applications servers in the backend, then it makes much more sense because there’s fewer and fewer requests, so it becomes better if we can use more cores per request. In fact CUDA came out of that. We saw that increasing the level of parallelism increased performance, so we said “let’s try this”, in fact the other talk that I’m going to give is actually about this.

Q Can you tell more about “NOT A ROBOT” badge?

StackOverflow is a community of developers, and we reward developers for helping us by giving them this score called reputation and by giving them badges. And these are meant to encourage people to do things that we think are positive for the community. For example, if you ask a great question, you will get a badge; if you give a great answer, you will get a badge. In these terms we have added a badge called “NOT A ROBOT” that we award to people that come and interact with some of our speakers or other representatives that go to conferences. The reason is that we realize that we’ve been speaking for a long time, and we’ve been very very active in the last few years in the conferences circuit, and we’ve noticed that people in many cases are shy, and because developers are maybe introverted, or maybe because they don’t think we’re available, or because they think Stack Overflow is some sort of alien, and they don’t want to come and talk to us. So we just built this badge to encourage people. If you come to talk to us, we give you a code, you can use that code to redeem the badge on the site, and you get a silver badge which is relatively rare, and this is the only way to get it.

We want to give them away in Helsinki and Moscow at DotNext conferences.

Q What would you recommend to all the .NET and C# developers?

“It’s important not only to do your job professionally, but also to do it with passion. It’s important to cultivate your passions, they should be thinking it’s not enough, it’s not okay if someone just does their job and they don’t care.”

To do what makes them passionate. It’s very important that development is not just a job. Development is also a creative endeavour, because differently from like maybe building a car being a developer means solving a different problem every day. It means doing something different, exploring stuff you don’t know all the time. It’s important not only to do your job professionally, but also to do it with passion. It’s important to cultivate your passions, they should be thinking it’s not enough, it’s not okay if someone just does their job and they don’t care. They should be thinking — what would make them happy, and do that.

— — —

As you can see the application performance topic is actively evolving and remain important for many .NET developers. As a reference there is an online course by Sasha Goldshtein on Pluralsight. Also Marco will give a talk on the DotNext Helsinki in just few days.

More details about the DotNext Helsinki, its program, speakers and registration are available on the official website.

We kindly thank Microsoft for the provided venue.

--

--