So, I starting playing with Raytracers again – writing my own, following the guidance of Raytracing from the Ground Up. In doing so, I’ve gotten an opportunity to throw the code over various modern Multi-core machines to get a bit of an interesting look into FP performance in modern systems.
The raytracer in question is up on Github – it’s a simple raytracer that is FP intensive, but not particularly memory intensive. That said, if it’s compiled against debug libraries, it performs extremely poorly due to the Qt (Ed: and MSVC) debugging hooks. It doesn’t try to be too clever since most modern CPUs are fast enough to not need to be.
Last time I did this, it was with smallpt, a decade ago and the story was very different.
This time, rough times have been gathered for a 800×800 raytrace, 100 samples per pixel, with DoF simulation, intersecting with about 30 objects. This produces a minimum of 64 million ray intersections. A sample output is below.
CPU | OS | STT | C/T | MTT | Compiler |
---|---|---|---|---|---|
i7-3770k | Linux | 23.1s | 4/8 | 5.3s | g++ 7.5.0 |
i7-4770 | Windows | 24.0s | 4/8 | 6.5s | MSVC 2019 |
Ryzen 7 3700X | Windows | 23.2s | 8/16 | 2.7s | MSVC 2019 |
i9-9900k | Windows | 16.4s | 8/16 | 2.3s | MSVC 2019 |
In the above table, STT is single-thread time (in seconds), C/T is the Cores vs Threads count. MTT is the multi-threaded time using all available hardware threads. Compiler is the compiler used to produce the binary.
This is a very unscientific test – testing has been done without load shedding, and is typically roughly taken from the worse non-outlier of 3-4 runs and rounded to the nearest 100ms. As my raytracer has a Qt UI, there is measurable UI overhead on Windows – that’s partially reflected by the difference in the 3770 and 4770 times.
[Edit 2020-06-30: I’ve written a basic CLI front-end to the raytracer and that’s also revealed there’s compiler optimisation differences too – I’ll post an update with new times when I’ve finished collecting them]
There is no explicit vectorisation in any build. Autovectorisation is enabled for AVX, but not AVX-512.
Windows builds were performed using MSVC 2019 without tuning biased towards intel or AMD and the same binary set was used on all systems tested.
The surprising result is just how poorly the Ryzen does compared to the older Generation i7s, given that the 3rd and 4th gen i7s are 7-8 years old now – it’s only saving grace is the fact it has double the cores packed into the chip.
It’s also important to note that this is purely a from-cache FP intensive task – it barely hits the memory interface, so the much faster memory interfaces in the later CPUs will not make a difference here.