Stamping Out Overflow Checks in Ruby
Chris Seaton, 26 September 2021
Someone on Reddit asked if it was possible to turn off overflow checking on integer arithmetic in Ruby. They said that they could guarantee that their arithmetic would not overflow, and therefore they’d like to not pay the cost for a pointless check.
As several people said in the thread, this isn’t possible in the standard implementation of Ruby, MRI, but it is possible and safe in TruffleRuby. And the great thing is it already happens automatically if you can prove using your code that your arithmetic will not overflow! Further than that, if your code doesn’t prove it already then it’s simple to add that proof using standard Ruby. To see if we can go a bit further, I’ve also tried adding a new zero-cost way to add proof that’s less safe but could be a tiny bit faster.
We can show that this does actually have a measurable impact on performance, so can be worth doing, especially since it can be done with plain Ruby code.
It’s worth talking about this because it’s a good example of some of the subtle things that modern compilers do that people often aren’t aware of.
Overflow checks
Your processor has hardware support for integer arithmetic, but the integers it uses have a fixed-size, such as 32 or 64 bits. If the arithmetic operation gives you a result that’s too large, it gets wrapped around to a different value that isn’t what you’d want. Ruby integers don’t have a fixed-size - they allow arbitrarily-large integers. So how can you use fixed-size hardware arithmetic to implement unlimited-size Ruby arithmetic? It’s possible to detect if the result of a hardware arithmetic operation was too large. This is the overflow check that we’re talking about. After every arithmetic operation that can overflow, Ruby checks if it did overflow, and if it did, it re-does the arithmetic in a software library that supports arbitrarily-large integers.
Here’s some example Ruby code - an add operation.
def add_any(a, b, c)
a + b + c
end
TruffleRuby and Graal use a graph data structure to represent your program as it’s being compiled. I’ve talked about this elsewhere, but you can basically think about it as a flowchart of your code. The arrows represent control and data flowing through your program, and the boxes represent operations and calculations. We’re using a tool developed at Shopify called Seafoam to visualise the graph.
Here’s the graph of that example Ruby code. The T(7)
, T(8)
, and T(9)
nodes are a
, b
, and c
. You can see that their values flow through IntegerAddExact
operation nodes. Exact
means that it will check for overflow.
Here’s the resulting machine code. You can see the add
and jo
machine instructions. jo
stands for jump-on-overflow. It means jump to some other code, to handle the arithmetic in software, if the result was too large for the hardware integer.
0x11bb1f909: add esi, eax
0x11bb1f90b: jo 0x11bb1f9fc
...
0x11bb1f91f: add r10d, esi
0x11bb1f922: jo 0x11bb1f9de
You can imagine how that sounds bad - an extra check to make after every arithmetic operation - and why people would want to remove it. There’s immediate good news and bad news here. The good news is that most people will tell you that the jo
instruction is very cheap to execute. The processor is good at pipelining it and it doesn’t really have a large impact on performance on its own. The bad news is that the presence of the instruction prevents operations being packed as effectively as they can be, preventing further optimisations.
Stamps - micro-types within the compiler
We need to talk about a new concept before we can explain what we’re going to do next.
Ruby is a dynamically typed language (except for recent static typing ideas), but did you know that a good compiler will automatically discover static types as it’s compiling your dynamically typed program? There’s a simple version of this - such as knowing that a
, b
, and c
in the code above are all integers, but there’s also something more subtle going on that few people are aware of - stamps.
Stamps are a kind of micro-types that exist just within the compiler. Like all types they represent information about values, but unlike types you’ve usually worked with, stamps can give us a variety of properties, such as whether an object can be null or not, or whether a value is positive or negative, or even which individual bits are set in an integer. That’s the property we’re going to be interested in.
We can see an example of stamps in Ruby if we use the &
bit-wise and operation.
def stamp(a)
a & 0xff
end
In the compiler graph of this code, you can see the stamp annotated on the edge that represents the value flowing from node 834 to 880. It says value [0 - 255]
, which means there is a stamp on this edge that guarantees that the value will definitely be in that inclusive range.
What’s the point of stamps? Well if you’re compiling an operation, you can look at the stamps of the values that flow into the operation, and compile it differently based on what the stamp guarantees about the value.
Here’s the big idea - if you are adding two values where the stamp says that they are small enough that they cannot possibly overflow, then the compiler can automatically remove the overflow check. Your program needs to have some kind of proof in order to generate these stamps in order to do that. We’ll explain that next.
Masking to add stamps
One way we can add this proof is to simply use the &
operation. We can modify our add
to mask the three values to be in the range of a single byte, then we are proving that the operations cannot possibly overflow.
def add_masks(a, b, c)
a &= 0xff
b &= 0xff
c &= 0xff
a + b + c
end
Now the graph doesn’t contain those Exact
nodes anymore. It’s much simpler, with basic +
nodes instead. You can see how this has happened - the stamps flowing into them show that the values are small enough that they cannot possibly overflow. Note that the second +
has a stamp going in [0 - 510]
and coming out [0 - 765]
.
And the machine code.
0x1211f4095: and esi, 0xff
0x1211f409b: and r10d, 0xff
0x1211f40a2: and eax, 0xff
0x1211f40a8: add r10d, eax
0x1211f40ab: add r10d, esi
This is what we were originally after! So that’s one way of doing it - add masks to prove the range of your values to prove that your arithmetic will not overflow, and a good compiler will use that information to remove the overflow check.
Notice that the value with the stamp is stored in a local variable before we then retrieve it and use it - the stamp can live through being stored and retrieved, that’s not a problem.
An intrinsic to add stamps
But one problem we did introduce with masking is that the mask instruction is actually run. We get extra and
instructions. They’re pointless as they shouldn’t be actually doing anything, since we knew (through our own reasoning as the programmer) that the value is already within that range. We were just letting the compiler know that. If we’re really super-sure of that, and prepared to accept the risk of being wrong, could we do better?
We could write a special function to inject the stamp directly, without actually performing an operation. I’ve implemented this in Graal and TruffleRuby as Truffle::Graal.inject_stamp(value, 0xff)
. This is implemented as compiler intrinsic - that means a method call which is recognised by the compiler and treated differently. Here’s the implementation in Graal.
r.register2("injectStamp", int.class, int.class, new InvocationPlugin() {
@Override
public boolean inlineOnly() {
return true;
}
@Override
public boolean apply(GraphBuilderContext b, ResolvedJavaMethod targetMethod, Receiver receiver, ValueNode value, ValueNode stamp) {
b.addPush(JavaKind.Boolean, new InjectStampNode(value, stamp));
return true;
}
});
This says that when the compiler sees a call to injectStamp
, to run this code instead. It adds a special InjectStampNode
into the graph. You’ll never see this node though - it has an optimisation to simplify itself.
public void simplify(SimplifierTool tool) {
if (!hasUsages()) {
return;
}
if (stamp.isConstant()) {
int stampValue = stamp.asJavaConstant().asInt();
replaceAtUsagesAndDelete(graph().addOrUnique(
new PiNode(value, StampFactory.forInteger(32, 0, stampValue))));
}
}
The simplify operation gets the value of the stamp - the mask, and injects it using something called a π node, which means a node that is just there to add a stamp. Remember that it’s actually the overflow operation which looks for the stamp. Here we can see it checking if the stamp permits the operation to overflow or not. You can see that if it cannot, it replaces itself with the boolean constant false
, meaning it can never overflow.
public ValueNode canonical(CanonicalizerTool tool, ValueNode forX, ValueNode forY) {
if (forX.isConstant() && !forY.isConstant()) {
return new IntegerAddExactOverflowNode(forY, forX).canonical(tool);
}
if (forX.isConstant() && forY.isConstant()) {
return canonicalXYconstant(forX, forY);
} else if (forY.isConstant()) {
long c = forY.asJavaConstant().asLong();
if (c == 0) {
return LogicConstantNode.forBoolean(false);
}
}
if (!IntegerStamp.addCanOverflow((IntegerStamp) forX.stamp(NodeView.DEFAULT), (IntegerStamp) forY.stamp(NodeView.DEFAULT))) {
return LogicConstantNode.forBoolean(false);
}
return this;
}
Here’s our add
with the intrinsic instead.
def add_stamps(a, b, c)
a = Truffle::Graal.inject_stamp(a, 0xff)
b = Truffle::Graal.inject_stamp(b, 0xff)
c = Truffle::Graal.inject_stamp(c, 0xff)
a + b + c
end
Now in the graph you can see it’s simpler still, because the &
operations have gone away. This dynamic Ruby code now looks almost as simple as the static Java equivalent.
And in the machine code… wow look at that - just two simple machine add
instructions and no and
instructions. Shame about the intermediate mov
- that’s due to register allocation decisions. Due to hardware register renaming, it doesn’t really make any difference apart from being an unfortunate wart right in the middle of our example!
0x12457278d: add eax, dword ptr [r10*8 + 0xc]
0x124572795: mov r10d, eax
0x124572798: add r10d, esi
What happens if we were wrong about our assumptions of the code and we’re injecting a stamp which could be wrong?
If we write a program that injects a stamp which is purposefully wrong we can observe an overflow and wrap-around. This program injects a stamp that applies correctly for the first ten seconds, then it starts being passed values outside that range. But by then we’re in our optimized machine code without an overflow check, and the exception on the second-to-last line is triggered - the value will be negative because it was wrapped around and there was no overflow check.
def foo(a, b)
a = Truffle::Graal.inject_stamp(a, 0xff)
b = Truffle::Graal.inject_stamp(b, 0xff)
a + b
end
start = Time.now
until Time.now - start > 10
a = rand(0xff)
b = rand(0xff)
raise if foo(a, b) < 0
end
puts 'state change'
loop do
a = 0x7FFFFFFF
b = 0x7FFFFFFF
raise if foo(a, b) < 0
end
Measuring the impact
Does this actually do anything useful?
I took this code from ChunkyPNG
, a Ruby graphics library. It’s been slightly simplified. It composes two colours based on a transparency value. The colours are stored as packed RGB values, and composing them requires some integer arithmetic.
def r(value)
(value & 0x00ff0000) >> 16
end
def g(value)
(value & 0x0000ff00) >> 8
end
def b(value)
value & 0x000000ff
end
def rgb(r, g, b)
r << 16 | g << 8 | b
end
def int8_mult(a, b)
t = a * b + 0x80
((t >> 8) + t) >> 8
end
def interpolate_quick(fg, bg, alpha)
alpha_com = 255 - alpha
new_r = int8_mult(alpha, r(fg)) + int8_mult(alpha_com, r(bg))
new_g = int8_mult(alpha, g(fg)) + int8_mult(alpha_com, g(bg))
new_b = int8_mult(alpha, b(fg)) + int8_mult(alpha_com, b(bg))
rgb(new_r, new_g, new_b)
end
If we look at the graph and machine code of this version we can see a ton of overflow checks.
...
0x120f8329e: sub esi, eax
0x120f832a0: jo 0x120f835fc
0x120f832a6: mov r10d, dword ptr [r11 + 0x2c]
0x120f832aa: mov r10d, dword ptr [r10*8 + 0xc]
0x120f832b2: mov edx, r10d
0x120f832b5: and edx, 0xff0000
0x120f832bb: shr edx, 0x10
0x120f832be: imul edx, eax
0x120f832c1: jo 0x120f835b8
0x120f832c7: add edx, 0x80
0x120f832cd: jo 0x120f83612
0x120f832d3: mov r8d, edx
0x120f832d6: sar r8d, 8
0x120f832da: mov r9d, r8d
0x120f832dd: add r9d, edx
0x120f832e0: jo 0x120f835c5
0x120f832e6: mov r9d, r10d
0x120f832e9: and r9d, 0xff00
0x120f832f0: shr r9d, 8
0x120f832f4: mov ecx, r9d
0x120f832f7: imul ecx, eax
0x120f832fa: nop word ptr [rax + rax]
0x120f83300: jo 0x120f835ad
0x120f83306: imul r9d, eax
0x120f8330a: add r9d, 0x80
0x120f83311: jo 0x120f8351c
0x120f83317: mov ecx, r9d
0x120f8331a: sar ecx, 8
0x120f8331d: mov ebx, ecx
0x120f8331f: add ebx, r9d
...
But we know pixel values are always in the range of a byte, so we know all these checks are a pointless waste. We can fix this with adding just one line of code!
First, we can turn it into a benchmark by composing some whole random images.
a, b, c = [], [], []
1_000.times do
a.push rgb(rand(255), rand(255), rand(255))
b.push rgb(rand(255), rand(255), rand(255))
c.push rgb(rand(255), rand(255), rand(255))
end
Benchmark.ips do |ips|
ips.report('interpolate_quick') do
alpha = rand(255)
a.size.times do |n|
c[n] = interpolate_quick(a[n], b[n], alpha)
end
end
end
This produces 1062 bytes of machine code.
Now let’s try masking. We’ll just add one line - to mask alpha
to be in the range of a byte.
def interpolate_quick(fg, bg, alpha)
alpha &= 0xff
alpha_com = 255 - alpha
new_r = int8_mult(alpha, r(fg)) + int8_mult(alpha_com, r(bg))
new_g = int8_mult(alpha, g(fg)) + int8_mult(alpha_com, g(bg))
new_b = int8_mult(alpha, b(fg)) + int8_mult(alpha_com, b(bg))
rgb(new_r, new_g, new_b)
end
This produces a far simpler graph, and 640 bytes of machine code, and now no overflow nodes and no jo
instructions - it worked - we completely removed the cost of overflow from this code.
...
0x124f5a3b9: and edx, 0xff0000 # the mask
...
0x124f5a3df: sub edx, r10d
0x124f5a3e2: mov r9d, esi
0x124f5a3e5: and r9d, 0xff0000
0x124f5a3ec: shr r9d, 0x10
0x124f5a3f0: imul r9d, edx
0x124f5a3f4: lea r9d, [r9 + 0x80]
0x124f5a3fb: mov ecx, r9d
0x124f5a3fe: shr ecx, 8
0x124f5a401: add ecx, r9d
0x124f5a404: shr ecx, 8
0x124f5a407: add r8d, ecx
0x124f5a40a: shl r8d, 0x10
0x124f5a40e: mov r9d, eax
0x124f5a411: and r9d, 0xff00
0x124f5a418: shr r9d, 8
0x124f5a41c: imul r9d, r10d
0x124f5a420: lea r9d, [r9 + 0x80]
0x124f5a427: mov ecx, r9d
0x124f5a42a: shr ecx, 8
0x124f5a42d: add ecx, r9d
...
Now let’s try our intrinsic.
def interpolate_quick(fg, bg, alpha)
alpha = Truffle::Graal.inject_stamp(alpha, 0xff)
alpha_com = 255 - alpha
new_r = int8_mult(alpha, r(fg)) + int8_mult(alpha_com, r(bg))
new_g = int8_mult(alpha, g(fg)) + int8_mult(alpha_com, g(bg))
new_b = int8_mult(alpha, b(fg)) + int8_mult(alpha_com, b(bg))
rgb(new_r, new_g, new_b)
end
This produces 614 bytes of machine code - under 60% of the original code size. It’s more compact still, because we don’t have the actual masking and
instruction now.
We know the machine code looks a lot better - a lot more compact and straight-forward. Does it have an impact on actual running time?
Yes it’s nearly twice as fast to use a mask! All for adding one line of pure Ruby code. The intrinsic doesn’t really add much beyond that - it’s all within the margin of error.
Didn’t we say that jo
didn’t add much of an impact? Well that’s what I was always taught - maybe I should reconsider that and measure it a bit more. Also, removing the linear overflow checks generally relaxes the entire graph, allowing perhaps a better scheduling. I could measure that as well.
And just for fun, let’s compare TruffleRuby, with its sophisticated stamp system, to MRI and some optimising Ruby implementations.
The mask doesn’t do much for the other simpler implementations, but TruffleRuby is up to 200x faster than YARV, up to 140x faster than MJIT, and up to 50x faster than JRuby, for this benchmark.
So there you go - yes there is a way using pure Ruby to hint to a compiler that an arithmetic operation cannot overflow - you just have to write a proof that the compiler can use - and a real Ruby compiler is able to take advantage of it (maybe the other compilers should as well), and it really does have a useful impact on performance.
Related idea in asm.js
Do you think you’ve seen this idea of masking values with & 0xff
before? asm.js is a subset of JavaScript designed to run more efficiently by using more hardware primitives in the same way we are. It uses a pattern value | 0
to similarly force a value to be treated as a machine integer.
- More information about TruffleRuby
- Stamping Out Overflow Checks in Ruby
- The Future Shape of Ruby Objects
- Seeing Escape Analysis Working
- Understanding Basic Truffle Graphs
- Context on STM in Ruby
- Seeing Register Allocation Working in Java
- Understanding Basic Graal Graphs
- Understanding Programs Using Graphs
- Low Overhead Polling For Ruby
- Top 10 Things To Do With GraalVM
- Ruby Objects as C Structs and Vice Versa
- Understanding How Graal Works — a Java JIT Compiler Written in Java
- Flip-Flops — the 1-in-10-million operator
- Deoptimizing Ruby
- Very High Performance C Extensions For JRuby+Truffle
- Optimising Small Data Structures in JRuby+Truffle
- Pushing Pixels with JRuby+Truffle
- Tracing With Zero Overhead in JRuby+Truffle
- How Method Dispatch Works in JRuby+Truffle
- A Truffle/Graal High Performance Backend for JRuby