Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding and bits_per_sample option to save function #1226

Merged
merged 12 commits into from
Feb 12, 2021

Conversation

mthrok
Copy link
Collaborator

@mthrok mthrok commented Feb 2, 2021

The current release version's "sox_io" backend's save function changes the encoding of the audio file based on the dtype of the Tensor. For example, if the dtype is "float32", then it will be saved as 32bit floating point PCM. This behavior was taken from SciPy's scipy.io.wavefile.write function. However it was pointed out that this is inconvenient for torchaudio users. Because most torchaudio's functionality works on float32 Tensor yet, the common audio formats typically retains only 16 bit, such as 16 bit signed integer PCM.

To resolve the inconvenience while keeping the functionality to support different encodings, this PR

  1. Add encoding and bits_per_sample parameters to save function.
  2. For non-compressed format (such as "wav"), it defaults to 16-bit signed integer PCM. (This is BC-breaking behavior if users were dumping Tensor object without converting to the matching dtype) The default encoding/bit depth are now determined by dtype, which is not BC-breaking.

Note in #1204, we attempted to solve the issue by adding dtype argument, but after giving more thoughts I concluded that this approach is limited and the design is not intuitive, so we will replace the dtype argument with encoding and bits_per_sample. This will allow other encodings such as mu-law and a-law.
For the detailed behavior of the encoding and bits_per_sample, see the table at the bottom of this description.

Along the way, the implementation of type conversion is updated to use the implementations that libsox provides. So the new implementation of save function yields the same result as sox command.

Table: the provided values and the actual values to be used

  • For WAV format without encoding or bits_per_sample option, the encoding configuration is determined by dtype.
  • If an invalid value is provided, it raises RuntimeError
  • Other formats, like mp3 and vorbis, do no support changing encoding and bits_per_sample, so if users provide some values, they will be ignored with warning.

Provided arguments

Resulting value (if differ)

format

encoding

bits_per_sample

encoding

bits_per_sample

wav, amb

None

None

*

*

8

PCM_U

 

16

PCM_S

 

24

PCM_S

 

32

PCM_S

 

PCM_S

None


32

16


 

24


 

32


 

PCM_U

None


8

8


 

PCM_F

None


32

32


 

ULAW

None


8

8


 

ALAW

None


8

8


 

FLAC

N/A

None

 

24

8


 

16


 

24


 

Shpere

None

None

PCM_S

32

8

PCM_S

 

16

PCM_S

 

24

PCM_S

 

32

PCM_S

 

PCM_S

None


32

8


 

16


 

24


 

32


 

ULAW

None


8

8


 

ALAW

None


8

8


 

16


 

24


 

32

 

 

@mthrok mthrok added this to the v0.8 milestone Feb 3, 2021
@mthrok mthrok force-pushed the save-encoding branch 10 times, most recently from f0f83a0 to 5d6f45e Compare February 9, 2021 22:32
@mthrok mthrok force-pushed the save-encoding branch 2 times, most recently from 44ea317 to ae897e6 Compare February 10, 2021 23:41
@mthrok mthrok changed the title Save encoding Add encoding and bits_per_sample option to save function Feb 10, 2021
throw std::runtime_error("Unsupported dtype.");
namespace {

std::tuple<sox_encoding_t, unsigned> get_save_encoding_for_wav(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function implements the mapping from user-provided encoding configuration to the actual encoding configuration for "wav" and "amb" format.

throw std::runtime_error(message.str());
}

std::tuple<sox_encoding_t, unsigned> get_save_encoding(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function implements the mapping from user-provided encoding configuration to the actual encoding configuration.

@mthrok mthrok marked this pull request as ready for review February 11, 2021 05:43

Unverified

The committer email address is not verified.
@cpuhrsch
Copy link
Contributor

Two general comments:

  1. Mapping from -1 to 1 32 bit floating point numbers to 16 bits is lossy in the sense that 16 bits are not enough to store the significant bits of the float. This matters also because this PR changes the default from 32 to 16, which then introduces a loss.
  2. When the user went out of their way to request a specific bits_per_sample and a particular format plus encoding doesn't allow that, I think we should throw an error and not a warning, because the request behavior doesn't work. We can still change that if user feedback indicates that they prefer it to just work (and if so, why not just use None to begin with).

// Convert to sox_sample_t (int32_t) and write to buffer
SOX_SAMPLE_LOCALS;
const auto dtype = tensor_.dtype();
if (dtype == torch::kFloat32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you want you should be able to use a switch statement for this as well

case Encoding::NOT_PROVIDED:
switch (bits_per_sample) {
case BitDepth::NOT_PROVIDED:
if (dtype == torch::kFloat32)
Copy link
Contributor

@cpuhrsch cpuhrsch Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you should be able to use the switch statement for these types too, if you want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which type is it? I am getting the following errors; I tried with c10::kFloat and others but got the same result

../../torchaudio/csrc/sox/utils.cpp:219:23: error: switch quantity not an integer
           switch(dtype) {
                       ^
../../torchaudio/csrc/sox/utils.cpp:220:25: error: could not convert ‘torch::kFloat32’ from ‘const c10::ScalarType’ to ‘<type error>’
             case torch::kFloat32:
                         ^~~~~~~~

Copy link
Contributor

@cpuhrsch cpuhrsch Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the conversion to ScalarType, which will also guarantee that the TypeMeta is indeed a ScalarType (which all your cases seem to assume).

@mthrok mthrok merged commit c3cb201 into pytorch:master Feb 12, 2021
@mthrok mthrok deleted the save-encoding branch February 12, 2021 22:14
mthrok pushed a commit to mthrok/audio that referenced this pull request Feb 26, 2021
This commit sends a new event to GA that tracks tutorial clicks with the tutorial URL

Co-authored-by: Brandon Green <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants