replacings a bunch of FMAs by fewer selp/setp is making code slower,I wish #NVidia would release also cycle-timings for the #Cuda ptx ISA