When we represent uncompressed audio digitally, we often represent as a one-dimensional array of values, corresponding to the amplitude of the waveform over time.
However, psycho-acoustically, audio does not feel like it is described by a single real number for each moment of time. Rather, when he hear audio, we notice all sorts of qualities of the audio like pitch and timbre. These qualities are best described as “spectral properties”. They are properties most readily identified from the Short Time Fourier Transform (STFT) of the signal. Here is an example of plotting the STFT of someone speaking the word “one”.
Although not totally intuitive to read, such two-dimensional plots of audio tend to give a better (more interpretable) representation of what an audio clip sounds like than the waveform.
The importance of this observation comes in when we are trying to do further processing on audio, for example running machine learning algorithms on audio. Performing a STFT to audio before feeding it into an algorithm can make it easier for that algorithm to pick up on the features that our ear hears. This approach can be used, for example, in using a CNN (Convolutional Neural Net) to detect features in audio. First we perform the STFT, then we can treat the plot (known as a spectrogram) as an image, and apply 2D convolutional layers to it to isolate features. This will often yield better results than trying to use 1D convolution on the raw audio waveform.
Hope this gives you some ideas when working with audio!