Why do you think predictability is not possible on a multithreaded preemptively scheduled system? I agree that a nop or idle implementation would likely include a nanosecond interval, the implementation is then down to the target compiler and runtime.A NOP for timing purposes on anything other than a single threaded, single core non-instruction-cached CPU is pretty much useless.
On small CPUs/MCUs with predictable instruction cycle timing, a NOP takes a finite and predictable amount of time to execute (sans any interrupt processing that may occur before or after the instruction). But the execution time is dependent on clock speed, which is not the same for all applications, and may not be the same even within one single application.
Therefore, the construct NOP(n) is also non-portable.
The proper, portable way, is to have a macro called something like NOP_ns(n), where the preprocessor would generate he proper number of NOPs based on a predefined clock speed.
This is similar to how the macros Delay_us() and Delay_ms() work.