Is Go Good for Parallel Programming?

Go is a language designed for better concurrency. Although concurrency is not really parallelism, the Go's approach also makes it suitable for parallel programming. But just how exactly suitable is it in term of performance? Out of curiosity, I've done 2 simple experiments to access the performance of parallel handling in Go.

How parallel programming in Go works

Before getting to the experiment, let's take a look at the Go's approach to parallelism. Go has Goroutines - concurrently executing functions multiplexed across system threads. By default, a Go program uses only 1 system thread for all of its Goroutines, but that number can be easily changed by setting GOMAXPROCS.

Goroutines synchronize with each other via channels, a concept adopted from Communicating Sequential Process (CSP). This resembles message passing in frameworks like MPI and is the key behind parallel programming in Go. However, unlike MPI that follow distributed memory model, all Goroutines of a program share the same memory space. This allows channel communication to be more lightweight but makes the language not very suitable in distributed environments.

1st Experiment: Partitioning

This experiment is set out to measure the speedup obtained by adding more processors in a parallel computation in Go. We're gonna work on a simple theoretical problem of counting from 0 to N when N is a large number. We're not interested in the actual execution time but the speedup with respect to the number of processors. The code is as follow:

func count(start uint64, end uint64) uint64 {...}

func main() {
  c := make(chan uint64, np)
  var result uint64 = 0
  ...// start timing    
  for i := 0; i < np - 1; i++ {
    var start uint64 = uint64(i) * N / uint64(np)
    var end uint64 = uint64(i + 1) * N / uint64(np)
    go func(c chan uint64, i int, start uint64, end uint64) {
      var myresult uint64 = count(start, end)
      c <- myresult
    }(c, i, start, end)

  // Main thread does final calculation and collect results
  result += count(uint64(np - 1) * N / uint64(np), N)
  for i := 0; i < np - 1; i++ {
    result += <- c
  ...// end timing

The code is run on an Intel 2.8GHz Processor with 16 CPUs. I only measure the speedup up to 8 CPUs for various problem sizes. Below is the result:


This shows that given a large enough problem size to be partitioned, the Go runtime can make full use of the available processors to achieve nearly perfect speedup.

2nd Experiment: Communication

This experiment compares the communication overhead between Go and MPI. The program is a simulation of a master-slave architecture in which one process sends out computation tasks to its workers. This model is prominent in systems in which the computation is not known beforehand (e.g. a server serving real-time requests). The performance of the system depends on the consistency of the master when the number of parallel workers increase. In the implementation, the amount of work is kept consistent as the number of processors increase and the performance is evaluated based on the speedup attained.


func main() {
  work_c, done_c, finish := make(chan int, np - 1), make(chan int, np - 1), make(chan int)
  n := 100 // amount of work to be sent out

  // Master Goroutine
  go func() {
    ...// start timing
    for i := 0; i < np - 1; i++ {
      work_c <- 1
    for {
      <- done_c
      n --
      if n > 0 { work_c <- 1 } else { finish <- 1 }
    ...// end timing

  // Slave Goroutines
  for i := 0; i < np - 1; i++ {
    go func(id int) {
      for {
        <- work_c
        time.Sleep(100 * time.Millisecond) // simulate computation
        done_c <- 1

  <- finish


int main(int argc, char* argv[]) {
  int rank, size;
  MPI_Init (&argc, &argv);
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);
  MPI_Comm_size (MPI_COMM_WORLD, &size);
  int n = 100, buf = 1;

  if (rank == 0) {
    // Master process
    ...// start timing
    for (int i = 1; i < size; i++) {
      MPI_Send(&buf, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
    while (1) {
      MPI_Status status;
      MPI_Recv(&buf, 1, MPI_INT, MPI_ANY_SOURCE, WORK_TAG, MPI_COMM_WORLD, &status);
      int slave_id = status.MPI_SOURCE;
      n --;
      if (n > 0) {
        MPI_Send(&buf, 1, MPI_INT, slave_id, WORK_TAG, MPI_COMM_WORLD);
      } else {
    ...// end timing
    // Send termination signal to slaves
    for (int i = 1; i < size; i++) {
  } else {
    // Slave process
    while (1) {
      MPI_Status status;
      MPI_Recv(&buf, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
      if (status.MPI_TAG == WORK_TAG) {
        struct timespec tim, tim2;
        tim.tv_sec = 0;
        tim.tv_nsec = 100 * 1000000L;
        nanosleep(&tim, &tim2);
        MPI_Send(&buf, 1, MPI_INT, 0, WORK_TAG, MPI_COMM_WORLD);
      } else {

  return 0;

The result obtained by both implementations can be seen in the following chart:


According to the measurement, Go is able to achieve much better speedup than MPI. This may be due to the fact that Goroutines and channel communication are more lightweight than message passing in MPI (just a hypothesis).


The 2 experiments, although simple, show that Go is a great candidate for parallel computing. The Go runtime is able to allocate OS threads efficiently to achieve perfect speedup in traditional partitioning problems. Not only that, Go also performs well in systems that require high interaction among processes. Therefore, for people about to learn parallel programming, Go can be a very good starting point. If you're interested in more complex parallel problems in Go, this is one of the papers on the topic I've found.

What Computer Science Can Teach You

I've just officially graduated. My 4 years in university passed by in a blink of an eye. I suffered a great deal of pain during the last 4 years, but also learned a lot. For anyone still pursuing or thinking about pursuing a degree in CS, here's what you can gain after going through the same journey I did.

CS teaches you how to build things

A few weeks ago, I wrote a simple application that pulled Facebook feed data and filtered out the useful pieces of information I really wanted. I also set up Adium - an open source messaging client - as the primary Facebook messenger via XMPP/Jabber. I never had to visit Facebook for more than 30 minutes per week ever since.

It is just a tiny example of the mentality that we CS students are taught: Fix things that are broken and build things that are needed. Using Facebook has been very distracting for me, and yet I could not just delete my account because of some usefulness still embedded inside. So I built my own Facebook experience, making it work the way I wanted.

For us CS people, we're given a super power of understanding technology and using to our best advantage. If you find something broken, fix it. If you see anything missing, build it. It is how we're taught and what makes us different from others. It doesn't need to be the next Facebook or Google. Just start small by building something for your own use and slowly make an impact on the people around you. The world is run by computers. Computers are made by computer scientists, software and hardware engineers. So in some sense, we're the people capable of running the world and making it better.

CS teaches you how to make compromises

Most CS graduates will end up making software for a living at some points in our lives. A piece of software cannot run on its own but in an ecosystem of hardware architecture, operating systems, compilers, browsers, third-party libraries…, each of which has room for error. It's almost impossible to get a software product completely right and working flawlessly.

Therefore, the art of making software has pretty much been the art of making compromises. When writing code, programmers constantly have to make hard trade-off decisions, specifically between security, performance, matainability, and business values. A system with additional security layers is often slower. A low-budget application developed in a short period of time is often harder to maintain in the long run. It's the same thing in life. We all have to make compromises and trade-offs at some point, and CS students are no strangers to the concept.

CS teaches you various areas of computing

Some of the best programmers I've worked with are self-taught, which makes me wonder what a CS degree is truly worth nowadays. It is not that hard to learn programming and be awesome at it, so why bother going to school? After completing my degree, I've somewhat found the answer: CS-trained programmers find it easier to branch out to different work areas.

CS teaches the fundamentals, from programming methodologies to distributed system, algorithm design, and many many more. With that range of knowledge, CS-trained programmers can potentially fit in multiple fields in the industry. A front-end engineer can later be a kernel developer or make games for a living as long as she has the required knowledge to do so. It doesn't mean self-taught programmer cannot achieve the same thing, but with the CS fundamentals it's definitely much easier.

CS teaches you how to make yourself persuasive

Theoretical computer science is among my most favorite topics. Even though I didn't do well in the respective courses, I surely learned a lot. The thing that I will never forget is Prof Hifeng's saying "If you think you're right, prove it. It's not my job to prove you're wrong". It signifies the importance of proofs in theoretical CS. A solution that works in some cases doesn't mean it's correct until it's shown to work in all situations.

And this is not only in CS. I've learned that in order to make my self persuasive, I need to *prove* that what I'm saying is correct. In many cases, a rigourous proof is not possible, but it's a good practice to think about the reasons behind every statement made. Statements like "Python is better than Ruby" need a lot more convincing reasons than just "I like it more". To make it persuasive, first define the criteria for comparison, then show how Python is superior, and finally cite your references. It's hard for people to disagree with you when presenting yourself in this way.

CS teaches you how to not care about grades

Grades are an important part of university life. It's good to have good grades, but only if you don't have to die for it. Grades don't come naturally for some people. There are people like me who need a lot more effort to maintain a reasonable transcript, and consequently reduce the amount of time they can spend on more exciting things. In CS, there are many more opportunities to explore, more things to learn, and more projects to work on. If good grades are expensive, it's better to spend your time doing other things. Doing internship, getting involved in open source communities, or even building your own startup are among just a few.

CS teaches you to make a difference

It's true that nowadays CS graduates can find high-paying jobs more easily than students in other fields. But it's not just about the paycheck. As mentioned above, you're given a super power of fixing and building things. Use that power for the betterment of others. Build tools to help yourself, friends and family live easier lives. Start small and then potentially grow into something big. Even if you don't make money out of what you build, at least someone will find it useful.

Javascript As a Compile Target: A Performance Breakthrough

Things are getting more and more interesting in the font-end community. Whenever we think Javascript has reached its limit, something comes out that pushes it to the next level. Mozilla has been working on a couple of interesting research projects that can redefine the performance of the web. Emscripten is one of them. It is a compiler that compile native languages like C/C++ into highly-performant Javascript code. The format for the compiled Javascript is ASM.JS - recently regarded as the Assembly of the Web.

Why compile to Javascript

The browser can only run Javascript - that's a hard truth that will probably never change. Even though Javascript is a fairly fast dynamic typing language, its performance is still not good enough for things like graphic intensive games. The language's dynamic nature is the main reason for the performance drawbacks, specifically:

  • Type inference: Modern Javascript engines infer type at runtime to make the correct memory for machine instructions. For example, Javascript numbers are all 64-bit floating point, but the Just-in-time compiler may attempt to infer the correct type like 32-bit integer (or more like 31-bit signed integer) to speed up run time memory access. This increases JIT compilation time, resulting in slower application startup.
  • Deoptimization/recompilations: Besides type inference, JS engines also do other optimizations involving type guessing and variable caching. But due to the dynamic nature of the language, variable types may change and caches can be invalidated at any point. When that happens, the engine needs to deoptimize and sometimes even recompile to generate better Assembly.
  • Garbage collection: Garbage collection blocks. The more garbage to collect, the slower it is.

Emscripten is set out to address those drawbacks. It compiles native code into highly optimized Javascript, so that Javascript engines don't need to to just-in-time compilation and optimization. With ASM-supported Javascript engine like OdinMonkey, the optimized Javascript can be compiled ahead-of-time and executed directly. The compiled code can even be cached to minimize subsequent startup time.

How does it work

Compiler front-end like Clang compiles native C/C++ code into LLVM bytecode. Emscripten takes the bytecode and turns it into Javascript instead of machine instructions. The default format for the compiled Javascript is ASM.JS. The main idea behind ASM is that it uses typed arrays as virtual memory. Typed arrays are a set of classes designed for working with raw binary data. There are a few pre-defined arrays like Int8Array, Int16Array, Float32Array, Float64Array…Together they make up the virtual heap for every compiled ASM application. Specifically every generated ASM files contain this piece of code to initialize the virtual memory:

var buffer = new ArrayBuffer(TOTAL_MEMORY);
HEAP8 = new Int8Array(buffer);
HEAP16 = new Int16Array(buffer);
HEAP32 = new Int32Array(buffer);
HEAPU8 = new Uint8Array(buffer);
HEAPU16 = new Uint16Array(buffer);
HEAPU32 = new Uint32Array(buffer);
HEAPF32 = new Float32Array(buffer);
HEAPF64 = new Float64Array(buffer);

Now let's look at what happens when compiling the below piece of C++ code using Emscripten:

int main() {
   printf("hello, world!\n");
  return 1;

The generated JS file is 2000 line long. Most of it are internal ASM modules. Here are a couple of interesting parts directly related to the C++ code above:

First is initial memory initialization:

allocate([104,101,108,108,111,44,32,119,111,114,108,100,33,10,0,0], "i8", ALLOC_NONE, Runtime.GLOBAL_BASE);

The method allocate() put an array of data (the character array in this case) of some certain type into the memory heap. The type of this sequence is 8-bit unsigned integer which corresponds to the UInt8Array. ALLOC_NONE tells the method not to allocate on the memory stack just yet. Then in the main function:

function _main() {
  var $1 = 0, $vararg_buffer = 0, $vararg_lifetime_bitcast = 0, label = 0, sp = 0;
  sp = STACKTOP;
  $vararg_buffer = sp;
  $vararg_lifetime_bitcast = $vararg_buffer;
  $1 = 0;
  STACKTOP = sp;return 1;

It calls _printf() with a pointer to the beginning of the string and pointer to the argument list residing on the stack. STACKTOP is the pointer to the current location of the stack in the virtual memory. The _printf function formats the output, write the result onto the stack, and then to stdout. After the method execution finishes, stachRestore() is called to restore the stack's top pointer to the default position. This makes sure stack memory only lasts for 1 execution context and will be overriden in subsequent contexts.

function _fprintf(stream, format, varargs) {
  // int fprintf(FILE *restrict stream, const char *restrict format, ...);
  var result = __formatString(format, varargs);
  var stack = Runtime.stackSave();
  var ret = _fwrite(allocate(result, 'i8', ALLOC_STACK), 1, result.length, stream);
  return ret;

function _printf(format, varargs) {
  // int printf(const char *restrict format, ...);
  var stdout = HEAP32[((_stdout)>>2)];
  return _fprintf(stdout, format, varargs);

This is just a glimpse of what goes on behind the scene of ASM. There are so many more internal modules and libraries included in the generated JS file that's impossible to go through completely. You can find the entire specification for the language here. The spec is not fully implemented yet, but the performance result so far is very promising.

What does the performance look like

According to the benchmarking result by Mozilla, ASM code is about 2x slower than native code running on OdinMonkey, which is comparable to Java and C#. It is promised to get even better up to 70% of native speed after optimizing for float32 operations instead of double64. The result is as follow (lower is better):


Current Javascript engines can also run ASM code but still needs to run it through the interpreter and JIT compiler. With such large amount of generated code, the performance in this case is not that good. It's unlikely that Chrome's V8 will optimize for ASM anytime soon. Therefore, Firefox and Mozilla is quite (slightly) ahead in the web performance race.

The future

Game programers will probably benefit the most from Emscripten and ASM. Currently, native games written in a subset of OpenGL can easily be ported to the browser without any additional effort. You can find some demos here.

As for web developers, I don't think the technologies will have a very huge impact. Normal Javascript is already fast enough, and if correctly written, 99% of web applications can run as smoothly as their native counterparts. But who knows, with such powerful tool in their disposal, creative developers can come up with all sort of crazy things. Maybe we'll see a new generation of higly complex interactive websites that are impossible to do with the current web stack.

Other projects

One thing to note here is that Emscripten and ASM are two separated project. ASM is set out to be the universal Javascript compile target, not just for only Emscripten. There are also other compilers like Mandreel or JSIL that can benefit from the format as well. So far, only Emscripten is using ASM as the default compile target, but other projects' implementation is on the way. I'm particularly interested in compiling LLJS to ASM. If ASM is like Assembly, LLJS is like C++ for writing readable and performant low-level code. LLJS already has its own compile target, but with ASM, its performance can get even better.

Some More JavaScript Weirdness

JavaScript is a pretty fun language with many "weird" behaviors that make developers want to kill themselves. Some behaviors are quite common like variable hoisting or global scope pollution, some are almost unknown to the majority of frontend developers. Below is the list of weird JavaScript features that I know of (and it's certainly not the complete list):

Primitive vs Object

JavaScript primitives are not instances of the associating object types even though they look like so. For example:

var test = "test";
String.prototype.testFunction = function() { return 0; };
console.log(test.testFunction());  // 0

// but...
console.log(test instanceof String);  // false
console.log(test === new String("test"));  // false

So be careful when using String or Number objects. Use primitives wherever possible unless you know what you're doing.


Arrays are also objects and their lengths are calculated as the last array index plus 1. So don't do this:

var arr = [1, 2];
arr[4] = 3;
console.log(arr.length);  // 5

Also it leaves "holes" inside the array which causes some array methods stop working:

var arr = [1, 2];
arr[4] = 3;  // [1, 2, undefined, undefined, 3];

for (var i = 0; i < arr.length; i++) {
  // TypeError: Cannot call method 'toPrecision' of undefined

Array.prototype.sort defaults to the lexicographical comparison function:

[11, 3, 2].sort();  // [11, 2, 3]

Therefore always pass in a comparison function when calling sort():

[11, 3, 2].sort(function(a, b) { return a - b; });


Don't rely on typeof for logic control other than checking for undefined. It outputs some strange stuff:

typeof null;  // "object"
typeof NaN;   // "number"
typeof [];    // "object"


All numbers in JavaScript are IEEE_754 64-bit floating point values which use 53 bits as the mantissa. That means the largest integer you can have is 2^53, not 2^32 or 2^64 like in C or Java. Any operation that stretch beyond the largest or smallest integer will be ignored:

var x = Math.pow(2, 53);
x === x + 1;  // true

Unlike arithmatic operators, bitwise operators only work with int32, so:

var x = Math.pow(2, 53);
x / 2;  // 4503599627370496
x >> 1;  // 0

The examples above are taken from this SO answer.

Truthy and falsey

false, 0, "", null, undefined and NaN evaluate to false. The rest are true. However, the fun starts when we compare those value:

false == 0;         // true
false == "";        // true
0 == "";            // true
null == false;      // false
null == null;       // true
undefined == false; // false
null == undefined;  // true
NaN == false;       // false
NaN == NaN;         // false
1 == true;          // true
[0] == true;        // false

That's why the triple equal operator (===) exists. Always use strict comparison to avoid punching yourself when writing JS code.

If you think you know JavaScript well enough, take this quizz. I only got 11/37 :(

What makes Javascript slow?

This post consolidates some of the most notable frontend performance issues related to Javascript and desktop browsers. For mobile web performance, you can read this article from Sencha.

Is Javascript really slow?

No. To be precise, a programming language neither fast nor slow. It's just a language. What's slow is the interpreter/compiler that the language runs on and the environment it interacts with. Modern Javascript engines are not slow, in fact they're blazing fast and highly optimized compare to other interpreted languages like Python and Ruby. To prove that, let's take the Javascript component out of the browser and see how it does. We can install a Javascript engine like V8 (from Chrome) or SpiderMonkey (from Firefox) directly and run some benchmarking tests. On Mac, the two of them can be trivially installed via Homebrew.

brew install v8
brew install spidermonkey

Let's use V8 as it's the fastest out there at the moment. Here are the results for a test for multiplying two 100x100 matrices:

Python 2.7.3           225ms
Ruby 2.0.0             216ms
Javascript V8          23ms

As you can see, the algorithm, which runs in O(n^3), is much faster in Javascript V8 than Python and Ruby. Now let's take this test and run it on Chrome that has V8 embedded. The result is even more surprising:

Chrome 31.0 (V8 3.21)  11ms

So it looks like V8, which is already very fast, is even more optimized to run on the browser. There are some more comprehensive benchmarking tests that confirm the speed of Javascript. You can take a look here or here.

So Javascript is fast, but why are developers still complaining about its performance?

The number one culprit: The Browser(s)

Javascript is just one part of the browser. There are still two more components that make web applications work: markup and CSS. Let's take a look at each of them:

HTML and the DOM

This is the source of all evil. DOM operations are expensive, for example, let's take a look at this code which creates 5000 DOM elements and adds it to a blank page:

for (var i = 0; i <= 5000; i++) {
  var add = document.createElement('div');
  add.innerHTML = 'Item ' + i;

It has roughly 200 times less operations than multiplying 100x100 matrix but takes 53ms, almost 5 times slower, on Chrome 31.0 with V8 3.21. On older browsers, it's even much worse, especially IE 6-8. So Javascript isn't who to blame here. It's the DOM.

A big problem with the code above is that any changes to the DOM causes repaint and reflow. That means the browser has to re-render the part of the page affected by the DOM changes. As you might expect, this is expensive and should be avoided as much as possible. A general rule of thumb is to minimize DOM transactions, i.e. don't touch the DOM unless you absolutely have to. A good technique is using DocumentFragment to batch appending multiple DOM elements to the page.

var fragment = document.createDocumentFragment();

for (var i = 0; i <= 5000; i++) { 
  var add = document.createElement('div');
  add.innerHTML = 'Item ' + i;


This is treated as only one transaction by the DOM API, and therefore results in only one reflow. However, modern browsers already optimize for this to make our lives a lot easier. But it doesn't matter if our users are still stuck with browser versions from a couple of years ago.


How exactly does CSS work? CSS is simply a style sheet that the browser consults before rendering the DOM on the web page. The key thing to note here is that CSS is consulted after the DOM has been generated. That means it also involves DOM traversal and sometimes can cause performance problems.

Contrary to popular belief, the browser reads CSS rules from right to left, not left to right. For example this rule:

treehead treerow treecell .odd {…}

is read as: look for all elements with class odd, then traverse up the DOM tree, filter out the ones not belong to treecell, and then up to treerow and treehead. For the reason why browsers do that, see this SO answer.

Let's do a quick measurement to see how bad this CSS descendant selector actually is. We can use Chrome's Speed Tracer for this purpose. The result below are obtained from SpeedTracer for a document with 100 divs element with the class odd, only a fraction of which match the desired selector.


And here is the one for the same document but all the desired elements share a custom desired class. The selector is applied to the desired class directly:

It's almost 3 times improvement in style recalculation time. To be fair, most of the time we don't need to care about CSS performance as modern browsers optimize it quite well (the two versions above are not much different in the latest versions of Chrome). But it's always good to follow the good practices especially avoiding descendant and child selectors. Also it's good to run your application through SpeedTracer to identify performance issues early on.

Javascript: The slow parts

As we already know, Javascript is a relatively fast scripting language. Most of the frontend performance problems are caused by the DOM and browser interaction, not the language itself. However, there are features in Javascript that might be problematic if used incorrectly. Below are some of the notable ones:

Prototyping inheritance

Looking up variables in long prototype chains is not a good thing, especially when it's repeated over and over again. So if you find yourself accessing inherited data frequently, it's better to cache the data in local variables. = "Ryan";

var doSomething = function() {
  // Caching
  var name =;

  for (var i = 0; i < 100; i++) {
    // console.log(; // bad

Function scope

Similar to protyping inheritance, looking up data in long function scope chains can also be costly. Again, caching is the key here:

var func1 = function() {
  var name = "Ryan";

  var func2 = function() {
    var nameCache = name;  // Caching

    for (var i = 0; i < 100; i++) {


for…in and forEach loops are quite poor in performance compare to the normal for loop. I personally think for…in should be avoided most of the time as it doesn't provide much benefit. forEach should only be used when you need to make use of the function callback it provides. For most of the time, the good old for loop is sufficient. Also, caching array length can provide some more performance gain:

var length = arr.length;

for (var i = 0; i < length; i++) {
     // Do something

Single threaded

The biggest disadvantage of Javascript is the lack of multi-threaded support. That means heavy computation cannot be split up into concurrent tasks to make it faster. There's nothing developers can do about it, and it's pretty much unneccessary anyways. Frontend development doesn't have to deal with IO, which is the most expensive operation in the concurrent programming world. I've also never run into any algorithmic computation too heavy to be required to split up into multi threads.

Note that optimizing the above Javascript features doesn't provide much performance gain compare to optimizing DOM manipulation. Unless you're working with very old browsers, this shouldn't be much of a concern.


Javascript is probably the most misunderstood language in the world. It's a fast scripting language, much faster than Ruby or Python, but the browser has given Javascript such a bad image. The DOM is slow, and Javascript has done the best it could to offset the many issues with DOM manipulation. The most important thing to take away for frontend developers is to know where and when to touch the DOM and question every DOM manipulation. Also it's crucial to always know your users and the browsers they're using. There's no point in optimizing for performance when the users' browsers already do it for you. For more in depth views on frontend performance optimization, please check out the references below.


Nicholas C. Zakas: Speed Up Your Javascript

Ariya Hidayat and Jarred Nicholls: Hacking WebKit & Its JavaScript Engines

Steve Souders: High Performance Websites

Sencha: 5 Myths About Mobile Web Performance

Google Developers: SpeedTracer Examples

Google Developers: Web Performance Best Practices

Mozilla: Writing efficient CSS