Categories
Tech

smallpt

[Updated 31 Aug 2010]
[Updated again 6 Sep 2010]

Just ran smallpt against a few machines here:

CPU OS Compiler Cores / Processors Execution Time(s) – 100spp – in seconds
AMD Athlon64 3800+ Linux amd64 G++ 4.4.1 1 365.181
387.036
Intel Xeon 2.4GHz Linux i386 G++ 4.4.3 2 x 2-way HT 358.000
363.824
Intel Itanium2 900Mhz (McKinley) Linux ia64 G++ 4.3.2 1 1366.38
1366.28
Sun UltraSparc 3i @ 1Ghz Solaris 10, 64-bit Sparc G++ 3.4.3 1 3384.46
Intel Core2Duo E6850 (3.0Ghz) Linux amd64 G++ 4.2.4 1 x Dual-core 177.46
180.05
Intel Core2Duo P8700 (2.53GHz) OS X 10.6.4 G++ 4.2.1 1 x Dual-core 138.36
139.68
Intel Core2Duo E5200 (2.5GHz) Linux amd64 G++ 4.4.3 1 x Dual-core 142.50
145.98
Intel Core2Duo E8400 (3.0GHz) Linux amd64 G++ 4.4.3 (static link) 1 x Dual-core 117.96
118.42

These figures are in no way scientific and should be considered ballpark figures only.  No efforts were made to reduce system load in order to run these tests, but systems used for these tests weren’t particularly loaded to begin with.

Linux builds were compiled with whatever the latest version of G++ installed was, using -O2 (except for the ia64 run which was built with -O3 by accident)

OS X refused to build a binary with OpenMP support that didn’t die very rapidly from a bus error. As a result, the test couldn’t utilise both CPU cores.  Please adjust expectations accordingly.  Build was with -O2 -ffast-math.

[Edits below]

The OSX figures have been updated to use OpenMP thanks to Brian’s advice.  Built using -O2.

The rather noticeable difference in speed between the E6850 and the P8700 is probably due to the different memory systems or the lower core/bus contention on the P8700 (although if it was the latter, I’d expect the margin to be smaller – the difference is only 9 vs 9.5) – it’s hard to say without doing more digging to see where this is slowing down.

The E6850 box is using an XFX branded nVidia nForce 680i motherboard which only provides a DDR2 memory interface – and the system in question is decked out with 4GBs of Corsair low-latency DDR2-800.

The P8700 is an Apple Macbook Pro 13″ 2.53Ghz (Mid-2009) which uses the stock 4GBs of DDR3-1066.

I’ve just added my work E5200 to the mix, and it too is getting scores comparable to the Penryn. I’ll have to re-run on the E6850 to verify the times.

[Updated again]

After a bit of research, I’ve managed to isolate the cause of the speed discrepency to be most likely the result of the upgrades to the design from the Conroe to the Penryn/Wolfdale family. I am surprised that the result is so pronounced.

[Updated again again]

I found an E8400 (Wolfdale 3.0Ghz, 1333MHz FSB) system to run smallpt on, and sure enough, it scores proportionally to the Penryn and E5200.

Categories
Software Development

Adventures in 64bit cleanup

I’ve been doing a bit of clean-up in linux/FOSS code for 64bit systems and it’s starting to scare me just how much crap filters into Linux distributions every now and then without anybody noticing it.

nss-mdns was today’s violator – the Multicast DNS NSSwitch module (Multicast DNS is sometimes better known as Bonjour or Avahi).

What’s particularly disturbing is that reading through the code reveals that the author suffered from the fatal “all the world is 32-bit” mindset when he wrote it.  I’m surprised nobody else picked up the unaligned access warnings flying up their console, then again, very few people use Itaniums or other 64-bit systems with strict alignment as a desktop system these days.

A small amount of hackery and fidgeting later, the error has gone away (yay!), and the bugfix was submitted.

The other fun fix was surpressing the unaligned access fix-up handler in parrot configuration tests so it could actually work out the correct pointer alignment size.  This little piece of magic is done by using prctl(). The fix was submitted here.

Categories
Software Development

ia64: Plan9, Compilers and ABIs

So, I have my second-hand HP vx2000 (Single-CPU Itanium2 workstation) running in my room.  (OK, this itself is a mistake – it’ll be moved into the home office once I get sick of the added head in my room).

For some bizare reason, I seem to have come up with the idea that trying to port Plan9 to it would be a good idea.

I’ve started studying the architecture and standard ABI documentation and I’m still trying to get my head around little details, but the whole thing seems pretty doable if I beat kencc into shape first.

The standard ABI register usage suggests a mixture of caller-save/callee-save conventions (some of the global registers are available as caller-save scratch) – this should only require minimal changes to kencc as it’s a case of teaching kencc to work out how many extra registers it thinks it needs for any given proc for optimal results, and allocating them dynamically via the appropriate mechanism, and then ignoring their save/restore on call/return.  That itself shouldn’t hurt kencc much (unlike on sparc32, etc, where you need to work almost exclusively in the callee-save model to get best results if you want to use register windows, and that’s fairly contrary to how kencc thinks and allocates registers), but will make context switching and debugging a bit more complicated.

Alternatively, we could just ignore register spill-fill and try to cram ourselves into the scratch registers only.  This would probably sit well with most plan9 developers.

Last (and equally insane option) is to meet minimum requirements for spill/fill (so EFI calls that allocate registers won’t kill us), but allocate all the registers and treat them as caller-save globals

This will make context saves even more expensive (saving 128 64-bit registers WILL suck), but is simple.

Anyway, this isn’t the really hard bit – as far as I can tell, the hard bit is fixing the 9 assembler/loader to produce good ia64 machine code and pick sensible optimisations.

Categories
Tech

At least I know I’m not imagining it…

I’ve recently had the displeasure of having to update the copy of WANPIPE that we ship with our product at work from the old stable 2.2 family to the beta 3.3 family in order to support their new Synchronous Serial adapter (The A14x family which is replacing the old S514x family).  We use these cards to support Frame Relay communications.  Frame Relay is still reasonably popular in Australia for private point to point communications, and, to be frank, our product with a supported Sangoma card and annual support probably still costs substantially less than the cheapest Cisco with 100mbit ethernet + sync serial for frame relay support and equivalent support.

So far I’m not  impressed with the A142 kit or drivers.

The kit itself is pretty shoddy.  Whilst the card is a nice small dual-layer card which will fit into a low-profile PCI slot (and even comes with the half-height edge-bracket), the cabling that comes with it is atrocious.  The card has a mini-centronics connector with screw-terminals which the Y cable attaches to  (A142 is a two port card, and is the smallest they offer now) and that itself is OK.  The dodgy comes in with the V.35 cable kit which attaches a V.35 to DB25  cable to the DB25-Y cable that you screw into the card.  The main problem being that both the DB25M on the V.35 cable and the DB25F on the Y cable have screw terminals – so you can’t secure the two to each other making it the weakest (and usually highest-tension because the Y cable is short and usually dangles off the back of the system) join in your V.35 cable run.

And then there were the drivers…  After spending a while chasing my own tail because my old 2.4.30 build tree had been damaged (but was still churning out valid modules, just without valid ksyms), I finally was able to get a build of the 3.3 modules that worked.

Then I had to slave out and accommodate the new user-space tools, changed configuration file layout for wanconfig (the WANPIPE stack configuration tool), all to discover that they’ve managed to break DLCI state indication in two places:  First of all, you used to be able to check IFF_RUNNING to find out if the DLCI was active or not, not anymore.  Secondly, the LIP layer (Sangoma’s own WAN stack) reports via dmesg transitions in card and DLCI state.  I did my usual FRAD power-down test to make sure it was even tracing DLCI failure correctly, and LIP didn’t even notice the DLCI had gone silent/failed until I turned the FRAD back on and it was in it’s resynchronising state.

At least I got an email back from Sangoma technical support confirming that they had indeed broken the support unintentionally, and it’ll be fixed next release.  Shame they didn’t mention when that would be.