Skip to content

Sample ~ Efficient Collections

Vlad Ureche edited this page Jul 20, 2015 · 31 revisions

If you haven't read the [[introduction|Tutorial--Introduction]] and [[first example|Tutorial--Example-(Part-1)]] already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin.

This section will show a benchmark we have randomly picked from the Rosetta code repository: The Hamming Numbers benchmark.

The benchmark files are available in the ildl-plugin repository, in the tests/benchmarks/src/ildl/benchmark/hamming directory. If you have imported the ildl-* projects in the Scala IDE, this benchmark is available under the ildl-benchnarks project, in the src directory, in package ildl.benchmark.hamming.

Problem

Hamming numbers are a monotonically increasing sequence of numbers formed using just the prime factors 2, 3 and 5:

H(i,j,k) = 2^i + 3^j + 5^k

The tricky part in generating the sequence of Hamming numbers is that the three factors i, j and k are not monotonically increasing, not even if we sum them up. For example, 4 (which corresponds to factors i=2, j=0, k=0) comes before 5 (which corresponds to factors i=0, j=0, k=1).

Solution

We have taken the 2nd solution on the Rosetta code website:

class Hamming extends Iterator[BigInt] {
  import scala.collection.mutable.Queue
  val q2 = new Queue[BigInt]
  val q3 = new Queue[BigInt]
  val q5 = new Queue[BigInt]
  def enqueue(n: BigInt) = {
    q2 enqueue n * 2
    q3 enqueue n * 3
    q5 enqueue n * 5
  }
  def next = {
    val n = q2.head min q3.head min q5.head
    if (q2.head == n) q2.dequeue
    if (q3.head == n) q3.dequeue
    if (q5.head == n) q5.dequeue
    enqueue(n)
    n
  }
  def hasNext = true
  List(q2, q3, q5) foreach (_ enqueue 1)
}

This code located in HammingNumbers.scala, in four versions, corresponding to the original code followed by 3 transformed versions (which have the same code, except for the adrt scopes).

There is one change we made from the original benchmark: In Scala, the mutable.Queue only allows queueing a batch of elements, not just one. Since we only enqueue one element at a time, we created an implicit wrapper which adds the enqueue1 method:

  // we want to be able to enqueue a single element at once -- please see the
  // comment above for the explanation of enqueue1 vs enqueue. Note that we
  // have made QueueWithEnqueue1 a value class, thus preventing the creation
  // of an object in order to perform the enqueue1 operation:
  implicit class QueueWithEnqueue1[T](val q: Queue[T]) extends AnyVal {
    def enqueue1(t: T) = q.enqueue(t)
  }

This method, when invoked, prepares a batch of one element and uses the enqueue method. However, since the code never introduces an improved queue, which can introduce one element at a time, our transformation intercepts the enqueue1 method and uses the more efficient 1-element enqueue of our data structure.

We will now describe the optimizations that we've done on this code using ildl transformations.

Transformations

Several factors influence the overall speedup. To see how each change influences the performance, we added intermediate steps to the transformation, allowing us to pinpoint exactly which transformation adds more benefit.

The follwing diagram shows the three transformation steps we implemented:

BigInt        ===> BigInt      ===> java.lang.Long ===> scala.Long
Queue[BigInt] ===> FunnyQueue* ===> FunnyQueue*    ===> FunnyQueue*
   \                  ^                  ^                  ^
    \________________/                  /                  /
     \ step1.QueueOfLongAsFunnyQueue   /                  /
      \                               /                  /
       \                             /                  /
        \___________________________/                  /
         \ step2.BigIntAsLong +                       /
          \ step2.QueueOfLongAsFunnyQueue            /
           \                                        /
            \______________________________________/
              step3.BigIntAsLong +
               step3.QueueOfLongAsFunnyQueue

* FunnyQueue-s are specialized by hand for the element type

We will now describe each step individually.

Step 1: The Circular Buffer, called FunnyQueue

As the title suggests, transforming the queue into a circular array-based buffer brought the most benefit, since it achieved two results at once: being backed by a fixed-size array, it both improved locality and reduced access time. However, this fixed-size circular buffer will not be able to handle any random number of elements -- we need to give an upper bound at construction time.

Furthermore, as shown before, Queue.enqueue takes a batch of elements, which adds further overhead for packing and unpacking them. By intercepting enqueue1 allowed us to specialize the case where we introduce only one element at a time, bringing an overall speedup of 2.4x to the benchmark.

The FunnyQueue code is:

/** An array-based ring buffer used as a queue.
 *  This is the version that stores BigInts. */
class FunnyQueue {

  private[this] final val MAX = 6000
  private[this] val array = new Array[BigInt](MAX)
  private[this] var index_start = 0
  private[this] var index_stop = 0

  def enqueue(l: BigInt): Unit = {
    array(index_stop) = l
    index_stop = (index_stop + 1) % MAX
  }

  def dequeue(): BigInt = {
    val res = array(index_start)
    index_start = (index_start + 1) % MAX
    res
  }

  def head(): BigInt =
    array(index_start)
}

The code is located in FunnyQueue.scala files attached to each step. Also note that the FunnyQueue implementation is specialized by hand to the type we're expecting to use.

Since in the first step we're only changing the Queue implementation, there is a single transformation description in the DescrObject.scala file corresponding to step1:

/**
 *  A transformation object that transforms the Queue[BigInt] to a [[FunnyQueue]].
 *  @see the comment in [[ildl.benchmark.hamming.HammingNumbers]] for more information
 */
object QueueOfLongAsFunnyQueue extends TransformationDescription {

  // coercions:
  def toRepr(in: Queue[BigInt]): FunnyQueue @high =
    throw new Exception("We shouldn't need this!")
  def toHigh(q: FunnyQueue @high): Queue[BigInt] =
    throw new Exception("We shouldn't need this!")

  // constructor:
  def ctor_Queue(): FunnyQueue @high =
    new FunnyQueue()


  // extension methods and implicits:
  def implicit_QueueWithEnqueue1_enqueue1(q: FunnyQueue @high)(bi: BigInt): Unit = {
    q.enqueue(bi)
  }

  def extension_enqueue(q: FunnyQueue @high, bis: BigInt*): Unit = {
    // we don't support more than one element :)
    assert(bis.size == 1)
    val bi = bis.apply(0)
    assert(bi.isValidLong)
    q.enqueue(bi.longValue())
  }

  def extension_dequeue(q: FunnyQueue @high): BigInt = q.dequeue()

  def extension_head(q: FunnyQueue @high): BigInt = q.head()
}

The first thing about the FunnyQueue transformation is that the toHigh and toRepr methods are not implemented, instead choosing to throw an exception. In this case, the two conversion methods are only useful to guide the compiler into matching the correct types to transform, but will throw an exception if a FunnyQueue leaks into the rest of the code.

Right now, we're using a runtime exception to signal the value escaping, but, since this information is available at compile-time, we plan to introduce an annotation which tells the compiler that the conversions should not be introduced, and instead producing a compile-time error explaining how the value leaked outside the transformed scope (or scopes, as we are allowed to pass encoded values among scopes).

Another thing to notice is that the transformation only supports the basic queue operations: head, enqueue, enqueue1 and dequeue. We can get away with this as we don't use the other operations in the example. In a real scenario, the programmer would have to write additional cases for all the desired operations.

Step 2: Scope Reduction

In the benchmark, we aim at finding the 10001-th element of the Hamming numbers. For this, there is no need for the full scala.BigInt range: the long integer suffices. Therefore, this transformation step switched from the scala.BigInt object (backed by the java.math.BigInteger) to the java.lang.Long object.

Reducing the range saves some memory and some cycles when operating on the data, although there's not a lot of saving since both scala.BigInt and java.lang.Long are heap-allocated objects. Compared to the last transformation, this only brings a 25% improvement, for a total 3.0x improvement over the original case.

The FunnyQueue is now adapted to store java.lang.Long objects and the there are two transformations:

  • one from Queue[BigIng] to FunnyQueue (similar to the one above)
  • one from BigInt to java.lang.Long, shown below.
/**
 *  A transformation object that transforms [[BigInt]] values into [[java.lang.Long]]s.
 *  @see the comment in [[ildl.benchmark.hamming.HammingNumbers]] for more information
 */
object BigIntAsLong extends TransformationDescription {

  // coercions:
  def toRepr(high: BigInt): Long @high = {
    assert(high.isValidLong)
    high.longValue()
  }
  def toHigh(repr: Long @high): BigInt = BigInt(repr)

  // extension methods:
  def extension_*(recv: Long @high, other: Long @high): Long @high =
    // note: Math.multiplyExact requires Java 8
    // java.lang.Math.multiplyExact(recv, other)
    Long.valueOf(recv.longValue() * other.longValue())

  def extension_+(recv: Long @high, other: Long @high): Long @high =
    // note: Math.multiplyExact requires Java 8
    // java.lang.Math.addExact(recv, other)
    Long.valueOf(recv.longValue() + other.longValue())

  def extension_==(recv: Long @high, other: Long @high): Boolean =
    // note: Math.multiplyExact requires Java 8
    // java.lang.Math.addExact(recv, other)
    recv == other

    def extension_min(recv: Long @high, other: Long @high): Long @high =
    if (recv.longValue() < other.longValue())
      recv
    else
      other
}

The two transformations are stored in DescrObject.scala in package step2.

This transformation is interesting for another reason: it is the first one to use nested scopes:

  adrt(step2.BigIntAsLong) {
    adrt(step2.QueueOfLongAsFunnyQueue) {
      class HammingADRT_2 extends Iterator[BigInt] {
        val q2 = new Queue[BigInt]
        val q3 = new Queue[BigInt]
        val q5 = new Queue[BigInt]
        def enqueue(n: BigInt) = {
          q2 enqueue1 n * 2
          q3 enqueue1 n * 3
          q5 enqueue1 n * 5
        }
        def next = {
          val n = q2.head min q3.head min q5.head
          if (q2.head == n) q2.dequeue
          if (q3.head == n) q3.dequeue
          if (q5.head == n) q5.dequeue
          enqueue(n)
          n
        }
        def hasNext = true
        q2 enqueue 1
        q3 enqueue 1
        q5 enqueue 1
      }
    }
  }

This snippet of code is taken from the HammingNumbers.scala file.

Step 3: Stack Allocation

The final step to allocate long integers on the stack. To do this, we switch from java.lang.Long to scala.Long, which the Scala compiler backend automatically unboxes in method signatures and the underlying array used by the FunnyQueue data structure.

This brings an additional 33% improvement, for a total of 4.0x faster execution compared to the original code. What should be noted is the fact that all 4 versions of the code look identical, except for the adrt scopes.

Benchmark

Running the benchmark can be done in sbt:

$ cd ildl-plugin
$ sbt 'ildl-benchmarks/runMain ildl.benchmark.hamming.BenchmarkRunner'

The results obtained on our test machine are:

| Benchmark               | Time | Speedup | Garbage* | GC time* | Garbage+ | GC time+ |
|                         | (ms) |         | (MB)     | (ms)     | (MB)     | (ms)     |
| ----------------------- | ---- | ------- | -------- | -------- | -------- | -------- |
| 10001-th number, orig.  | 6.56 |    none |        0 |        0 |       31 |       11 |
| 10001-th number, step 1 | 2.70 |    2.4x |        0 |        0 |       31 |       11 |
| 10001-th number, step 2 | 2.16 |    3.0x |        0 |        0 |       31 |       12 |
| 10001-th number, step 3 | 1.64 |    4.0x |        0 |        0 |       31 |       10 |
  • * GC pause time during the benchmark run
  • + GC pause time between the benchmark runs (mainly preparing the data)
  • only counting GC cycles occurred during the benchmark run

Conclusion: Using nested scopes allows the code to transform multiple aspects of the code, in this case producing speedups between 2.4x and 4x.

From Here:

Frog Work Ahead: Some of the pages of this wiki are in flux. If you can't find what you are looking for, please [file a bug](https://github.com/miniboxing/ildl-plugin/issues).
Clone this wiki locally