However, when I try to do so step by step (by replacing all gates using NAND and cancelling two NAND gates that are in cascade), I always end up getting 5 NAND gates instead of 4. Is there a step by step procedure by which XOR gates are implemented using just 4 NAND gates?

XOR using 4 NAND

Reduction which I used and the resulting circuit