This is a valid point. I actually did a bit more digging after I wrote the original article but I haven't had time to compile the results. I found some dramatic differences in compiler performance between the debug and the non-debug versions of the Visual C++ compiler in particular. It's dramatically slower while running the debug version of the code vs the release version. This is expected behaviour of course, which made it a complete surprise when I saw the exact opposite results using the Embarcadero C++ compiler (an older, customized version of Clang).
I was putting it all together in a spreadsheet and corresponding article and have a draft of the follow-up, but alas life took priority and it's not completed. The moral of the story is it's probably not worth it skipping std::atomic as the performance improvements aren't significant enough to make it worthwhile, especially considering that other CPU architectures might not support the naked reads.