At the time of this writing, 2011, Pure Data still works with
single precision floating point format as it's numeric datatype. With
double precision, some issues could be solved, like indexing of large
audiobuffers. But at what cost? Obviously, double
precision numbers consume double memory. But is double precision also
slower? Is it possible at all to have Pd work at double precision? |
In order to test performance and evaluate pros and cons, a double
precision Pd must first be built. In the past, work has been done to
enable double precision builds of Pd. However,
there is still code in Pd's core which is known to fail with type
double. This page documents alternatives to such
bits of code, and shows test results of double precision Pd,
performance-wise and precision-wise.
So far, I have mainly concentrated on the phase-wrapping classes
which were not yet ready for double precision because they use a
type-specific optimization trick. These classes are: phasor~, cos~,
osc~, vcf~ and tabosc4~. Another known incompatibility is macro
PD_BIGORSMALL in the API
header m_pd.h. Further, I have fixed issues which popped up while
testing mentioned rewritten classes. At the page bottom, you can
download .patch files to implement the code changes in vanilla Pd
0.43-0. A set of tests is included in the .zip. Mind that this is a
work in progress, and issues (hopefully only minor) are still to be
expected.
The precision of 32 bit floats is equivalent to 24 bit integers,
with the added advantage of a huge range and the associated ease of
programming. So what need is there to build a double
precision Pd? In exceptional cases, 32 bit floats can not properly
express a calculation result, due to rounding. Depending on the
calculation type, this may translate as aliasing in the audio or some
other unexpected effect. There is
always the option to write such calculations in a C external, using
integers for indexing and doubles for running sums and interpolation.
But for me, it is a delight to have double precision directly in a Pd
patch. Pd
is a fast dsp prototyping tool, exactly because you do not need to write things in tedious
low level syntax. Also, for educational purposes it is better to have
calculations in a Pd patch than hidden in a C file. I collected some
examples to illustrate precision bottlenecks in real world dsp
situations.
The first example is where I wanted to produce a 20 seconds
exponential chirp for IR measurement. When this was done as a Pd patch
with regular math objects, my colleague in the IR project perceived a
weird modulation in the high frequency end. A print of time indexes,
expressed in unit seconds, soon pointed to the cause. Numbers were
rounded to their bones. In some math routines, a different order of
operations may help out, but not in the exponential chirp case. The
chirp generator was done as a C external with double precision inside.
The time index sequences below, from regular Pd and my experimental
double precision Pd respectively, illustrate how double precision
numbers have enough 'numerical headroom' for this particular case,
where
the single precision numbers have not.
current Pd 0.43 |
double precision |
Another, possibly familiar, example is where unit vectors
corresponding to low frequencies are computed.
Cosine tends to be 1., while sine tends to be identical to the angle in
radians. This is caused by rounding. Together these incorrect sine
and cosine still produce a vector with radius ~1. If you use them both,
for example to make an oscillator or resonator, you can go down to
rather low but undefined frequencies. In contrast, doubles have just
enough
bits to represent sine and cosine for LFO frequencies with sufficient
accuracy. Looking at these numbers I guess that doubles would not
be of help in calculating planetary orbits, with their extremely low
frequencies!
current Pd 0.43 |
double precision |
You have probably noticed that single precision floats are
represented in Pd with a maximum of 6 significant digits. This is done
to hide inconsistencies like 33 * 0.3 = 9.900001, which happens when 7
decimal digits are shown. The limitation to 6 significant digits is
however very annoying when we want to type or see large integers, like
indexes of long audio buffers.
In calculations, Pd can use all integers from 0 to
2^24 = 16,777,216, over 16 million. Numbers with up to 8 decimal
digits. But in the GUI, 1 million is shown in scientific notation as
1e+06. This scientific notation also uses a maximum of 6 significant
digits. Therefore, the next representable number is 1.00001e+06, that
is 1 million and 10. See below how the result of 2^20 is rounded in the
GUI, as an example. The fact that numbers are also stored with patches
this way, makes
things worse. At 44100 Hz samplerate, one million samples is about 23
seconds - not at all that much. Longer audio buffers are allowed, but
handling them in the GUI becomes tricky.
current Pd 0.43 |
double precision |
Apart from the representation issue, 32 bit float
resolution leaves little room for interpolation in the case of long
audio buffers. The [soundfiler~] object loads a maximum of 4 million
samples into an array, longer soundfiles are truncated by default (can
be overruled). At 44100 Hz
samplerate, this is about 1.5 minute. The [tabplay4~] object can still
produce a decent sound with that length, provided you do not play at
too low speed.
The table below shows some fractional indexes in double precision
for speed 0.3. In single precision, these numbers can not all be
expressed, and aliasing is the audible result. Double precision
instead, has enough bits to interpolate in buffers of any length your
RAM will permit. Therefore, the maximum buffer length for [soundfiler~]
can be
extended when building Pd for doubles.
large indexes for fractional speed 0.3, in double precision |
The above mentioned examples suggest that double precision is needed
whenever you want to do something eccentric in Pd: very long buffers,
very low frequencies. In such cases it can make a difference for sound
quality indeed.
However, it is certainly not the case that double precision Pd
sounds more beautiful in general. If you listen to [osc~] at any
audible frequency, it is unlikely you'll perceive any difference
between the two precisions. With Fourier analysis, I could not detect
impurities above -100 dB in single precision [osc~]. The object reads
from a 512 point table, with linear interpolation, and this will only
suffer from imprecision at very low subsonic frequencies. The same
observation holds for other table-reading classes, and for the filters.
These classes are designed to function with sufficient precision under
'normal' conditions.
Therefore, it makes no sense to load an arbitrary patch in double
precision Pd and expect to hear better sound. Single precision Pd can
sound very beautiful as well. Problems with sound quality are more
often result of poorly controlled dynamics, than of precision issues or
other deficiencies in Pd objects.
While having limited effect on overall sound quality, double
precision can still be beneficial for the quality of Pd programming in
a different sense. Precision-robustness, combined with full numerical
control from within the GUI, will convert Pd from a pocket calculator
into a scientific calculator. Any user - artist or scientist - could
verify meticulous math stuff directly in Pd, rather than bc, Octave,
Grapher, C, or whatever inconvenient combination of these. As a
computer language, Pd could be used for math classes and assignments at
any level. Abstractions illustrating advanced math topics could be
made. A wider use of Pd as a serious math language could have an impact
on technical knowledge shared within Pd community. In my view, double
precision could affect sound quality in particular via this detour.
The rewritten phase-wrapping classes are optimized as much as possible.
They were profiled individually to make sure none of them suffers
performance loss. This work was done on an Intel Mac, with SSE
instructions automatically generated by the gcc compiler. Later, I
found that Pd compiled for Linux uses the FPU for all floating point
operations, while CPU type is not explicitely defined. I reckoned with
this to make Pd run smooth in both cases.
In order to check overall performance, I searched for pure vanilla
patches, employing a representative mix of objects. The following works
seem suitable for benchmarking: Chaosmonster1 from Martin
Brinkmann, and Cave of Creation
initiated by Hamster on the Pd forum. Both have an emphasis on
vcf~, osc~ and delaylines. As such, they test the modified
phase-wrapping method, the PD_BIGORSMALL definition which is used for
every sample written to a delay line, and speed of memory access for
floats and doubles in general. Since the new code must also perform
well
in a single precision build, comparisons includes three Pd builds:
original = official vanilla Pd 0.43-0
single = vanilla Pd 0.43-0, double-ready modified, but built in single
precision
double = vanilla Pd 0.43-0, double-ready modified, built in double
precision
The benchmarks in table 1 were done in the Shark
profiler application on a MacBook with Intel core2duo 2 GHz.
Numbers represent CPU time as a percentage of total CPU time. Note that
Pd dsp can only use one core at a time. Table 2 represents a Panasonic
Toughbook coreduo 1.83 GHz with Debian Squeeze installed. The results
were read from the top command and are less precise. Top shows
percentages per core by default, and in this case, by lack of Symmetric
Multi-Processing support, that happens to equal the percentage of total
CPU time. Cross-comparisons with figures in table 1 are therefore
irrelevant. Builds with and without SSE instructions are shown in table
2.
Cave of Creation (1 instance) Chaosmonster1 (10 instances) |
original 12.1 24.0 |
single 10.5 22.0 |
double 10.3 23.6 |
table 1: percentage of total cpu time (+/- 0.1) on MacBook core2duo
2 GHz with OSX 10.5
Cave of Creation (1 instance) Chaosmonster1 (10 instances) |
original 45 SSE 71 SSE |
single 35.5 SSE 68.5 SSE |
double 42 SSE 72 SSE |
table 2: percentage of total cpu time (+/- 1) on Panasonic Toughbook
coreduo 1.83 GHz with Debian Squeeze, no SMP support
From this (admittedly small) set of benchmark tests, I conclude that
single precision builds of double-ready Pd may be a tiny bit faster
than current Pd, with SSE more than with FPU instructions. Double
precision Pd is slower than singe precision compiled from the same C
code, but differences are so small that optimizations in the
double-ready code make the net result comparable to current Pd.
More benchmark testing is needed to derive final conclusions. The
rewritten code parts were kind of 'tuned' to the SSE instructions in
the developement process, and modified again when I noticed pathetic
performance on the FPU. Now it is still the question how it behaves on
PPC and ARM.
Expressed in number of bytes, or bits for that matter, double
precision requires two times more data processing than single
precision. How come the performance differences are still so small? It
seems that Pd does not profit from 'vectorization'. On Intel, simple
operations on floating point numbers can be
done in the 128 bit wide xmm registers. Instructions are available to
process two doubles or four floats simultaneously. An example is MOVAPD
- Move Aligned Packed Double-Precision Floating-Point Values. With
proper optimization options set, the compiler will try to implement
such vectorized instructions.
Conditions for compiler autovectorization are rather strict. Basically, data must be accessible as blocks with a size matching the corresponding instruction. For many of the SIMD (Single Instruction Multiple Data) instructions, 16 byte aligned data is required. Pd's signal block size is set in runtime, and in general the compiler knows little about data alignment in Pd. Other obstacles for autovectorization are: potential pointer aliasing, conditional statements, and loops with too complicated routines. All these obstacles are frequent in Pd.
When Pd is compiled with gcc flags -ftree-vectorize and
-ftree-vectorizer-verbose=5, the compiler comments indicate that not a
single performance-critical dsp loop in Pd can be autovectorized. Pure
Data operates largely unvectorized. Single precision Pd exploits ~1/4
of the processing capacity in the xmm registers, and double precision
Pd exploits ~1/2 of it. It would be a challenge to modify Pure Data in
a way to profit a bit more from autovectorization, provided this would
not affect functionality, flexibility and portability. But that could
be a major project in itself. The purpose of this digression on
vectorization was to understand the surprisingly small performance loss
of double precision Pd. In terms of CPU load, we pay too much for
single precision, and get double precision at the same price.
There seems to be no performance penalty for double precision Pd, but
in terms of memory space the price is of course high. How much audio
can be loaded in 1 gigabyte (1024^3 or 2^30 bytes) of memory? This
would be 2^28 samples for 32 bit float format and 2^27 samples for 64
bit. Assuming 44100 samplerate, this tranlates to about 100 minutes of
single precision audio and 50 minutes of double precision audio.
If Pd is built for double precision, there is no way to make an
exception for audio buffers. Considering the fact that most audio is
stored on disk as 16 bit integers, the sample format conversion to 64
bit floating point seems ridiculous precision overkill, no good for
nothing. Of course, recent computers do have a lot of memory. But in
order to use it all, you may need 64 bit address space, which is yet
another complicated topic.
All wouldn't be so problematic when Pd could load soundfiles in the
background with low priority. At the moment this is not the case. If
you want to play soundfiles in live performance, you need to load all
files into memory before the session, otherwise dropouts will ruin your
show. This annoyance, which is in itself unrelated to the precision
question, asks for a solution anyway.
It is possible to simultaneously run a single precision and a double
precision build of Pd, but they can not be mixed. Function symbols are
identical for both precisions, and fatal symbol collisions can happen
if externals of the wrong type happen to be in Pd's search path, and
loaded accidentally. The different type sizes will definitely lead to a
segmentation fault.
Here's the git repository where you can browse and get the double-ready
code:
https://github.com/pd-projects/pd-double
Test patches for pd-double are in this downloadable .zip file:
pd-double-testpatches.zip, 52 KB
Double precision builds of vanilla Pd are available for OSX and Linux
via this page:
http://puredata.info/downloads/pd-double
Some partly succeeded double precision builds of Pd-extended 0.43.1
are available from the autobuilds:
http://autobuild.puredata.info/auto-build
Windows builds are not yet available because of
double-precision-related bugs in some parts of Windows-specific code.
I've uploaded a video clip with single/double precision demo here.
The patch used in this video clip is included in the .zip file with
test patches, so you can try the demo if you've built or installed
pd-double.