CUDA Optimizations

Lately, I have been working on CUDA about how small changes can impact performance of application.

GPUs are more optimized for crunching huge amounts of data. What if we need to transfer small data? Like a constant or flags back and forth? In this experiment memory allocation and memory transfer time is checked. The order of memory allocation is changed

After profiling the code, the time of allocating is more for application with larger data allocating first. And the total memory copy time for transferring large data first takes more time. The data transfer of small data is taking more time than transferring large data in large data first scenario.

