On coarse- vs. fine-grained synchronization

Performance related topics seem to constantly attract attention and everyone seems to have strong opinions on the matter. Writing code that can be proved [1] to perform according to a certain pattern is probably not the worst place to start with, although the web is full of counterintuitive [2] examples [3] where hardware details weigh in too strongly to ignore.
And sometimes healthy scepticism. Recently the topic of concurrent batch-modification of a data structure came up. In the concrete example, multiple threads merge multiple lists of objects into a hash map, something along the line of:
Map map;
void addToMap(List list){
   for (Thing thing:list)
        map.put(thing.id, thing);
}

‘addToMap’ is called from multiple threads, so some synchronization is in order. The discussion revolved around whether the necessary synchronized statement is best put outside or inside the for loop:

void addToMap(List list){
   synchronized(map){
        for (Thing thing:list)
             map.put(thing.id, thing);
   } 
}

vs

void addToMap(List list){
   for (Thing thing:list)
        synchronized(map){
             map.put(thing.id, thing);
        }
   } 
}
Two theoretical considerations collide here: the first example reduces context switches and synchronization effort but keeps a longer lock on the map. The second example keeps very short locks on the map, thus increasing concurrency but does so at the expense of synchronization effort. It turns out that, as always, it depends.

The case for fine-grained synchronization

I’m purposefully misusing the term here as it usually refers to the amount of locks involved in access to data structures. Here I’ll rebrand it to refer to the second example: synchronization for every access to the map.
There is really not much to say about this example: the access pattern theoretically maximises concurrency because it minimizes the time any thread holds locks on the map. However, if synchronization as a statement has a cost (and it has, as we’ll see shortly), it begs the question as to how much that slows down the thread in terms of CPU spent?

The case for coarse-grained synchronization

The first example obtains a lock on the map and inserts an entire list of objects into that map while still holding the same lock. While this blocks access of other threads to the map, it might – if synchronization does have a cost – result into a reduced overall utilization which leaves more CPU time available to other threads.

To the test

So I decided to write a short test and try out a simple scenario: multiple threads want to add and update entries to a map:
public class SynchronizedPerformanceTest {

 final int numOfWorkers = 4;
 final int batchSize = 100;
 final long durationMs = 20 * 1000;

 class Weather {

  private Map degrees = new HashMap();

  public void updateDegrees(Map postalCode2Degrees) {
   synchronized (degrees) { //modify here
    for (String postalCode : postalCode2Degrees.keySet()) {
     updateDegree(postalCode, postalCode2Degrees.get(postalCode));
    }
   }
  }

  public void updateDegree(String postalCode, Integer degree) {
   degrees.put(postalCode, degree);
  }

 }

 class Worker extends Thread {

  private Weather weather;

  public Worker(Weather weather) {
   this.weather = weather;
  }

  public void run() {
   Random r = new Random();
   Map postalCodes2Degrees = new HashMap();
   while (running.get()) {
    postalCodes2Degrees.clear();
    int postalCode = r.nextInt(1000);
    int degrees = r.nextInt(100);
    for (int i = 0; i < batchSize; i++) {
     postalCode++;
     degrees++;
     postalCodes2Degrees.put("" + postalCode, degrees);
    }
    weather.updateDegrees(postalCodes2Degrees);
    iterations.addAndGet(postalCodes2Degrees.size());
   }
  }
 }

 AtomicBoolean running = new AtomicBoolean();
 List workers;
 Weather weather;
 AtomicLong iterations = new AtomicLong();

 public void startWorkers() {
  running.set(true);
  for (Worker worker : workers)
   worker.start();
 }

 public void stopWorkers() throws Exception {
  running.set(false);
  for (Worker worker : workers)
   worker.join();
 }

 @Before
 public void setup() {
  iterations.set(0);
  weather = new Weather();
  workers = new ArrayList();
  for (int i = 0; i < numOfWorkers; i++)
   workers.add(new Worker(weather));
 }

 @After
 public void teardown() {
  running.set(false);
 }

 @Test
 public void test() throws Exception {
  startWorkers();
  Thread.sleep(durationMs);
  stopWorkers();
  System.out.println((iterations.get() * 1000 / durationMs)
    + " iterations / sec");
 }
}

Setup

The test was run with 1,2,3 and 4 threads with constant batch sizes (I didn’t see interesting variations in time when deviating from 100) with coarse and fine synchronization on a 2.5 GHZ i5 mac (4 cores) with oracle java 1.7.45 (-Xmx1024m). I obtained the following result matrix:
Threads Batch Size Iterations/sec Notes
1 100 10 E6 Fine-grained synchronization
2 100 5.5E6
3 100 7.6E6
4 100 6.6E6
1 100 11 E6 Coarse-grained synchronization
2 100 20 E6
3 100 17 E6
4 100 17 E6

Discussion

The particular experiment clearly favours coarse-grained synchronization which outperforms fine-grained synchronization by a 3:1 factor. To some extent this is to be expected [4]: synchronization does incur a cost which the CPU has to pay. A naive look at the example under discussion hints towards this cost being 10% (10 million insertions for fine-grained synchronization vs 11 million insertions for coarse-grained synchronization). The time spent in synchronization [the JVM produces less effective bytecode, performs more locks ( = OS calls), erects memory barriers and flushes caches] is subtracted from the time available to threads to actually produce the test data.
As the amount of threads increases both implementations behave differently. The coarse-grained implementation has a sweet-spot at 2 threads but consistently performs well with multiple threads. The fine-grained synchronization however does not seem to benefit at all by concurrency.
While not listed here, increasing the thread count above the available number of cores did not yield significantly different results. An interesting observation was that at no time did the CPU utilization surpass 300%, thus both implementations are constrained by locks on the singleton data structure they are modifying.

Conclusion

The particulars of hardware implementations, JVM implementations and operating system architectures make it hard to predict whether fine- or coarse-grained locking of data structures under concurrent access will perform better. As demonstrated with a simple example, theoretical considerations favouring the former are often outweighed by the practical implications of the execution environment. The overall cost of completing a task must always be taken into consideration, rendering an old proverb more current than ever: measure twice, cut once.

 Resources

[1] Computational Complexity Theory
http://en.wikipedia.org/wiki/Computational_complexity_theory

[2] Why is processing a sorted array faster than an unsorted array?
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

[3] Playing with the CPU pipeline
http://lolengine.net/blog/2011/9/17/playing-with-the-cpu-pipeline

[4] Java Language Best Practices
http://docs.oracle.com/cd/A97688_16/generic.903/bp/java.htm#1005570

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.