Raytracing – Floating Point Performance in 2020

So, I starting playing with Raytracers again – writing my own, following the guidance of Raytracing from the Ground Up. In doing so, I’ve gotten an opportunity to throw the code over various modern Multi-core machines to get a bit of an interesting look into FP performance in modern systems.

The raytracer in question is up on Github – it’s a simple raytracer that is FP intensive, but not particularly memory intensive. That said, if it’s compiled against debug libraries, it performs extremely poorly due to the Qt (Ed: and MSVC) debugging hooks. It doesn’t try to be too clever since most modern CPUs are fast enough to not need to be.

Last time I did this, it was with smallpt, a decade ago and the story was very different.

This time, rough times have been gathered for a 800×800 raytrace, 100 samples per pixel, with DoF simulation, intersecting with about 30 objects. This produces a minimum of 64 million ray intersections. A sample output is below.

i7-3770kLinux23.1s4/85.3sg++ 7.5.0
i7-4770Windows24.0s4/86.5sMSVC 2019
Ryzen 7 3700XWindows23.2s8/162.7sMSVC 2019
i9-9900kWindows16.4s8/162.3sMSVC 2019
Rough Execution Times – by CPU

In the above table, STT is single-thread time (in seconds), C/T is the Cores vs Threads count. MTT is the multi-threaded time using all available hardware threads. Compiler is the compiler used to produce the binary.

This is a very unscientific test – testing has been done without load shedding, and is typically roughly taken from the worse non-outlier of 3-4 runs and rounded to the nearest 100ms. As my raytracer has a Qt UI, there is measurable UI overhead on Windows – that’s partially reflected by the difference in the 3770 and 4770 times.

[Edit 2020-06-30: I’ve written a basic CLI front-end to the raytracer and that’s also revealed there’s compiler optimisation differences too – I’ll post an update with new times when I’ve finished collecting them]

There is no explicit vectorisation in any build. Autovectorisation is enabled for AVX, but not AVX-512.

Windows builds were performed using MSVC 2019 without tuning biased towards intel or AMD and the same binary set was used on all systems tested.

The surprising result is just how poorly the Ryzen does compared to the older Generation i7s, given that the 3rd and 4th gen i7s are 7-8 years old now – it’s only saving grace is the fact it has double the cores packed into the chip.

It’s also important to note that this is purely a from-cache FP intensive task – it barely hits the memory interface, so the much faster memory interfaces in the later CPUs will not make a difference here.

Hardware Restoration Tech

Restoring an InfoTower 1000

Just a few notes from this one.

I rescued some old gear from ACMS with the intent of at least getting some of it operational whilst I kept a hold of it.

One of the items was a DEC InfoTower system with 7 CD-ROM drives.

The InfoTower was a DEC InfoServer 1000 in a steel cabinet (labelled BA56A) , with a single power supply to power it and all the drives connected.  InfoServers could also be used to drive tape and conventional disk units so they could be served to VMS and other systems over an ethernet LAN.

InfoTowers were available with 4 or 7 CD-ROMs, and whilst a lot of photos out there have these with the older caddied RRD42s and similar, the one I picked up has the RRD43 tray drives.

When I got around to inspecting it, it clearly had a blown power-supply – you could smell the electrolyte from the capacitors the moment the rear service cover was removed.

Interestingly enough, the InfoTowers use an old AT-style power supply which plugs into a backplane which connects to the power-connector made available to each drive bay.

You can replace the power-supply fairly trivially with a modern ATX one if you get a ATX to AT power supply loom.  (I picked up one from ebay).  The power supply you install will need to be no bigger than a standard profile power-supply (so the very large monsters like the old Corsair HX1000 are completely out of the question) and needs at least 3 Molex plugs.   The original supply was a 230W supply.  I used a cheap Aywun 500W power supply since I couldn’t get anything smaller with sufficient Molex plugs.

Unfortunately, most of the eBay looms have the switch soldered to them, rather than connected by spades (like the original AT supplies used), which is wasteful and needless as AC-safe spade connectors are quite cheap.  I cut the spade connectors off of the old power-supply’s switch leads and spliced them onto the switch-cable from the loom.  That said, if I had uncrimped spade shoes and insulators, that would have been a better choice than using the old connectors.

One trap to watch, however, is that the pin-out of the P8/P9 connector on the backplane is the reverse of the convention from PCs – whilst the two connectors are side-by-side, you must connect them with the ground wires outside, not in the middle.  (You should double-check this before you swap the supply over, but I doubt there’s that much variance in the units).  Fortunately the PCB is labelled, so a bit of investigative work should help you verify this before you smoke your new power supply.

AUI to 10Base-T (Twisted-Pair) adapters are also fairly easy to get hands on these days too – I was able to pick up a few more so I’ll have enough to hook up my MicroVAXen as well as the Infoserver.


Wind Correction on the HP48/50

Out of frustration with how slow working on an E6B can be during flight planning, I’ve been looking into electronic options for replacing it that do not involve buying an overpriced specialised calculator.

Whilst solutions like the CX-2 can be used in the US in exams, they can’t be used here, which defeats any benefit of using a type specific calculator over a more generic programmable.

Also, the pricing is such that a CX-2 costs about 2/3rd the price of a good programmable like the HP50G.

So I’ve opted to buy a 50G and implement as much of the E6B functionality as is practical for planning.

My initial efforts are as below!

Wind Correction on the HP50G (easily adapted to the non-CAS 48G by removing the →NUM instructions and the UNROT instructions with ROT ROT).

≪ 'curTAS' STO ≫ 'TAS' STO
≪ 'curWV' STO 'curWHT' STO ≫ 'WV' STO

These should all be put into their own folder. They work using globals to store values that carry over between specific operations.

It’s also written with RPN mode in mind. I’m not entirely sure how anybody gets anything done fast on a calculator using algebraic modes.

Oh, and set Degrees for angles unless you do your flight planning in Radians.

Using the above:

Change to the folder, and select the custom menu.

Punch in your TAS, press the TAS softkey.

Punch in your wind as two values – the heading and velocity – and press the WV softkey.

Punch in your desired track and press TRKT. The heading correcting for wind and ground speed will pushed onto the stack.

You only need enter a new WV or TAS if they change – they will persist between multiple TRKT operations.

Music Photography

Japan Music Festival at the Roller Den, Sydney

I tagged along to the JMF on a whim, partially as a fan of JRock/JPop – and partially as a photographer looking for somewhere to practice my camera-craft during my off-season.

It was a great show.  Ended up buying Jill’s and kaimokujisho’s albums.  Should have picked up Sparky’s album too, but I was suffering pretty badly from the volume levels and was trying to make it out without spending all of my cash.

Unfortunately for me, I made the amateur’s mistake of not taking earplugs with me – my ears finally stopped ringing 3 days later.

It was a great first outing for my new second-hand 70-200 2.8 VR, which performed admirably in the conditions.  I had also completely forgotten some of the crazy with the D3 and the space remaining counter, so I thought I was running out of space by the end of kaimokujisho’s set – many poor (but potentially salvageable) bursts got deleted, and I thought I had significantly less space during 101A’s set than I actually did, so shot hyper conservatively…  then I discovered about 3 minutes into Jill’s set I had a whole memory card free.  I’ll try to remember about the damned counter next time.

Software Development

Detecting Transaction Failures in Rails (with PostgreSQL)

So, Rails4 added support for setting the transaction isolation level on transactions. Something Rails has needed sorely for a long time.

Unfortunately nowhere is it documented how to correctly detect if a Transaction has failed during your Transaction block (vs any other kind of error, such as constraints failures).

The right way seems to be:

RetryLimit = 5 # set appropriately...

txn_retry_count = 0
  Model.transaction(isolation: :serializable) do
    # do txn stuff here.
rescue ActiveRecord::StatementInvalid => err
  if err.original_exception.is_a?(PG::TransactionRollback)
    txn_retry_count += 1
    if txn_retry_count < RetryLimit 

The transaction concurrency errors are all part of a specific family, which the current stable pg gem correctly reproduces in it’s exception heirachy. However, ActiveRecord captures the exception and raises it as a statement error, forcing you to unwrap it one layer in your code.

Software Development

On Python and Pickles

Currently foremost in my mind has been my annoyances with Python.

My current gripes have been with pickle.

Rather than taking a conventional approach and devising a fixed protocol/markup for describing the objects and their state, they invented a small stack based machine which the serialisation library writes bytecode to drive in order to restore the object state.

If this sounds like overengineering, that’s because it is. It’s also overengineering that’s introduced potential security problems which are difficult to protect against.

Worse than this, rather than throwing out this mess and starting again when it was obvious that it wasn’t meeting their requirements, they just continued to extend it, introducing more opcodes.

Nevermind that when faced up against simpler serialisation approaches, such as state marshalling via JSON, it’s inevitably slower, and significantly more dangerous.

And then people like the celery project guys go off and make pickle the default marshalling format for their tools rather than defaulting to JSON (which they also support).

Last week, I got asked to assist with interpreting pickle data so we could peek into job data that had been queued with Celery. From Ruby.  The result was about 4 hours of swearing and a bit of Ruby coding to produce unpickle. I’ve since tidied it up a bit, written some more documentation, and published it (with permission from my manager of course).

For anybody else who ever has to face off against this ordeal, there’s enough documentation inside the python source tree (see Lib/ and Lib/ that you can build the pickle stack machine without having to read too much of the original source.  It also helps if you are familiar with Postscript as the pickle machine’s dictionary, tuple and list constructors work very similarly to Postscript’s array and dictionary constructs (right down to the use of a stack mark during construction).


Animania Sydney 2011 Photos and Wrap-up

Well, Animania has come and gone.  I’ve posted my photos up to Flickr as usual.  Spent plenty of time chatting with Kris from and a few other of the regular anime convention photography crowd whom I’ve been interacting since I got back into things.

Very low shutter count again this time – I think only 100 for the entire weekend over both cameras, and even that may be higher than reality.  After SMASH! I’m finding I’m a lot more selective over how I take shots, and I’m shooting a lot more with Flash Exposure Bracketing turned off, so I’m losing a few more “would have been good” shots than I used to, but have a lot less to cull overall, and the average quality is a lot higher.

The bulk of Saturday’s photos were taken single strobe – the later ones done with strobe off camera.  Mostly used the 5D, but took a few notables with the 50D + 50mm/1.8 combo mostly so I would have some material to defend my “you can do this with cheap, mainstream gear” claims. 🙂

The majority of Sunday’s photos were taken dual strobe – main off camera, on-camera unit for fill.  Having two 580s is handy that way.  Only took the 5D, but switched between the 70-200 and the 17-40 a fair bit.  Really need to pick up a 24-70 at some point.


SMASH 2011 Photos and Wrap-up

SMASH! has come and gone. It was an excellent effort given it was it’s first time at the Sydney Convention Centre.

Unfortunately, I was quite sick this past week and have only just gotten over that – so I cut my kit down heavily (I only brought one camera, 2 lenses and my speedlights), kept my activity down to a pretty low level and literally only took a few (about 40 total) photos.

To contrast, at both this year and last year’s Supanovas, my shot counts were up around 2000 photos for the whole weekend, of which I have to cull pretty heavily to get down to the 100-200 photos that get posted. Also, because I’ve taken so many photos of so many people, I feel compelled to at least publish some of the poorer photos if they’re still ‘viable’ because it features a cosplayer who has gone to great efforts to make their costume, etc, and I simply don’t have a better photo of them.

This time, because it was so short and sweet, the culling and selection was extremely easy, and the quality of the results compared to some of the other events speaks for itself.

The gallery of photos can be found on Flickr.

Technical Details

(Because for once, my workflow stripped the EXIF – I’ll work out why later…)

All photos were taken with an EOS 5D Mk1 with EF 70-200mm f/2.8L IS. (I had my EF 17-40mm f/4L in my bag, but didn’t use it at all).

The first 6 (predominantly outdoor) photos had fill-flash from a single on-camera Speedlite 580EX II with Rouge Flexible Flash Bounce Card. After I shot the first few photos, I also connected the Speedlite to a CP-E3 battery pack to improve it’s cycle time

The last 6 (indoor) photos were lit using an off-camera Speedlite 580EX II with Rogue Flexible Flash Bounce Card. Triggering was done via a standard Canon ST-E2 trigger. Flash held by captive flash bunny (thanks Retro!) – the camera weighs about 3kg in this configuration and I can’t balance the 70-200mm for conventional shooting one handed without introducing a LOT of shake to the camera.

All flash metering was Automatic. No Auto Exposure Bracketing or Flash Exposure Bracketing (FEB) was in use.

All post-production was done in Aperture from Camera RAWs. All edits are crop, exposure, dynamic range, and vignetting only. For once, I needed to do almost no cropping.


Spring War Photos

Mordenvale‘s Spring War has come and gone for this year. It was a good event with great fighting (although the first day of war saw me forced to retire due to hand injury)

I took my full camera kit with me to the event and got some nice photos during the first few rounds of the “Attain Speed” tournament.

These are up on Flickr.



[Updated 31 Aug 2010]
[Updated again 6 Sep 2010]

Just ran smallpt against a few machines here:

CPU OS Compiler Cores / Processors Execution Time(s) – 100spp – in seconds
AMD Athlon64 3800+ Linux amd64 G++ 4.4.1 1 365.181
Intel Xeon 2.4GHz Linux i386 G++ 4.4.3 2 x 2-way HT 358.000
Intel Itanium2 900Mhz (McKinley) Linux ia64 G++ 4.3.2 1 1366.38
Sun UltraSparc 3i @ 1Ghz Solaris 10, 64-bit Sparc G++ 3.4.3 1 3384.46
Intel Core2Duo E6850 (3.0Ghz) Linux amd64 G++ 4.2.4 1 x Dual-core 177.46
Intel Core2Duo P8700 (2.53GHz) OS X 10.6.4 G++ 4.2.1 1 x Dual-core 138.36
Intel Core2Duo E5200 (2.5GHz) Linux amd64 G++ 4.4.3 1 x Dual-core 142.50
Intel Core2Duo E8400 (3.0GHz) Linux amd64 G++ 4.4.3 (static link) 1 x Dual-core 117.96

These figures are in no way scientific and should be considered ballpark figures only.  No efforts were made to reduce system load in order to run these tests, but systems used for these tests weren’t particularly loaded to begin with.

Linux builds were compiled with whatever the latest version of G++ installed was, using -O2 (except for the ia64 run which was built with -O3 by accident)

OS X refused to build a binary with OpenMP support that didn’t die very rapidly from a bus error. As a result, the test couldn’t utilise both CPU cores.  Please adjust expectations accordingly.  Build was with -O2 -ffast-math.

[Edits below]

The OSX figures have been updated to use OpenMP thanks to Brian’s advice.  Built using -O2.

The rather noticeable difference in speed between the E6850 and the P8700 is probably due to the different memory systems or the lower core/bus contention on the P8700 (although if it was the latter, I’d expect the margin to be smaller – the difference is only 9 vs 9.5) – it’s hard to say without doing more digging to see where this is slowing down.

The E6850 box is using an XFX branded nVidia nForce 680i motherboard which only provides a DDR2 memory interface – and the system in question is decked out with 4GBs of Corsair low-latency DDR2-800.

The P8700 is an Apple Macbook Pro 13″ 2.53Ghz (Mid-2009) which uses the stock 4GBs of DDR3-1066.

I’ve just added my work E5200 to the mix, and it too is getting scores comparable to the Penryn. I’ll have to re-run on the E6850 to verify the times.

[Updated again]

After a bit of research, I’ve managed to isolate the cause of the speed discrepency to be most likely the result of the upgrades to the design from the Conroe to the Penryn/Wolfdale family. I am surprised that the result is so pronounced.

[Updated again again]

I found an E8400 (Wolfdale 3.0Ghz, 1333MHz FSB) system to run smallpt on, and sure enough, it scores proportionally to the Penryn and E5200.