Will Asana comment on what caused the extended outage Monday 4/9/18?


#1

Transparency about the outage was fantastic.

But…

What caused it?
What is being done to prevent it from happening again?


#2

Any word on this?


#3

That would great indeed! cc @johnnygoodnow @Alexis


#4

We’re working on writing a blog post about this now. It should provide more context into why the outage happened, how we recovered, and what steps we’re taking to keep it from happening again.

Hopefully we’ll have it out the door soon, so stay tuned.


#5

Looking forward to it.


#6

Another outage today… and still no official comment on the April 9 extended outage… #concerned


#7

Hi @Michael_B! I’m sorry to hear you’re concerned. As Daniel mentioned, our team is working on putting together more information about this and we hope to see it out publicly soon. Unfortunately I don’t have a timeline on that. What I can tell you is that, because we at Asana use the product all day every day, we feel your pain when something less than ideal happens with the product. That also means when something unexpected happens our team jumps on it as quickly as they can and we communicate both via Asana and Slack to collaborate and solve issues efficiently. I’m seeing conversations on Slack now and I can assure you the team is doing everything they can to take care of our awesome customers like you! Feel free to reach out if you have other questions.


#8

Thanks @Alexis for your reply. It’s nice to hear that there is internal conversation ongoing. We look forward to customer-facing transparency regarding the April 9 and today’s April 24 outages. The problem and the plan ongoing.


#9

I’m catching up on Asana Community posts, and wanted to make sure that everyone was aware that we did indeed publish this blog post. This post is (to my eyes) a tad technical and doesn’t necessarily cover the background in depth behind how things are put together, so here are some context bullet points:

  • Asana was originally built on a Javascript framework for both client and server (which enabled a “write the code once and run it both on client and server in order to maintain the same state for both” philosophy). Over the past couple of years we’ve migrated over to separate client and server implementations (which is how Asana has gotten much faster!) but still have some parts of the app that are served by a Javascript back end. The main component that was initially affected by this outage was the “page load server”, that is, the part of Asana that gives you the bundle of Javascript that your browser runs.
  • Since Asana was first constructed before Node.js was the clear winner for standalone JS applications, we’ve switched back and forth between several implementations of a Javascript interpreter. As these have varied in implementation completeness and sometimes have memory management issues, we’ve constructed our webservers to serve a certain number of incoming requests and then terminate themselves to allow the OS to clean up memory.
  • The *nix operating system call to create a new process is called “fork” and essentially takes the currently running process and splits it into 2: the continuing-to-run parent process and a new child that gets its own state. This is a fast operation, because “under the hood” the operating system shares as much memory as possible and only creates a new memory space for the state that is different in the child from the parent using copy on write.
  • The process to spin up a new Javascript server is, in contrast, relatively slow - tens of seconds. If we started a brand new process every time we needed a new copy of our server, some percentage of page load requests (when you first go to asana.com) would take a long time instead of a fraction of a second (the time difference between when your requests first hits asana.com and when you start to see the loading spinner).
  • To combat a delay from creating a new webserver from scratch, we create one and allow it to load everything and get ready up to the point that it would actually handle incoming requests - and then make it wait forever. We then use this preloaded version to fork from to create new servers that actually serve the request. We call this process the “zygote”.
  • The “main” or “master” database is one of the few single points of failure for Asana. It contains as little as possible state - only the things that absolutely must be completely shared or completely unique between users and across domains. Among these pieces of information are object IDs, the primary keys used for all the pieces of information stored in Asana. Our data model is based on an object-key-value data store, so these IDs are used for many many things in Asana - each task, project, comment, user, etc… has an ID. We are (and have been) working on isolating this database more and more over time to protect it from overloading, but we aren’t completely there yet.

The tl;dr behind what went wrong was that a small change had big consequences for causing us to not handle load well (we were failing to create the “zygote” process on our webservers). The fix for this was pretty fast - roll back to an earlier version of Asana (we keep older versions in standby for quick fixes when something seems wrong). When we fixed that, all Asana clients tried to reconnect at the same time during peak load on Monday morning, overloading our master DB, and this required bringing up a new, larger (in CPU and memory) copy of that machine. That was very slow, leading to the long duration of the downtime.