mirror of
				https://github.com/cookiengineer/audacity
				synced 2025-10-31 22:23:54 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			420 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			420 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 | |
| <html>
 | |
| <head>
 | |
| 
 | |
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
 | |
| <title>Ogg Vorbis Documentation</title>
 | |
| 
 | |
| <style type="text/css">
 | |
| body {
 | |
|   margin: 0 18px 0 18px;
 | |
|   padding-bottom: 30px;
 | |
|   font-family: Verdana, Arial, Helvetica, sans-serif;
 | |
|   color: #333333;
 | |
|   font-size: .8em;
 | |
| }
 | |
| 
 | |
| a {
 | |
|   color: #3366cc;
 | |
| }
 | |
| 
 | |
| img {
 | |
|   border: 0;
 | |
| }
 | |
| 
 | |
| #xiphlogo {
 | |
|   margin: 30px 0 16px 0;
 | |
| }
 | |
| 
 | |
| #content p {
 | |
|   line-height: 1.4;
 | |
| }
 | |
| 
 | |
| h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
 | |
|   font-weight: bold;
 | |
|   color: #ff9900;
 | |
|   margin: 1.3em 0 8px 0;
 | |
| }
 | |
| 
 | |
| h1 {
 | |
|   font-size: 1.3em;
 | |
| }
 | |
| 
 | |
| h2 {
 | |
|   font-size: 1.2em;
 | |
| }
 | |
| 
 | |
| h3 {
 | |
|   font-size: 1.1em;
 | |
| }
 | |
| 
 | |
| li {
 | |
|   line-height: 1.4;
 | |
| }
 | |
| 
 | |
| #copyright {
 | |
|   margin-top: 30px;
 | |
|   line-height: 1.5em;
 | |
|   text-align: center;
 | |
|   font-size: .8em;
 | |
|   color: #888888;
 | |
|   clear: both;
 | |
| }
 | |
| </style>
 | |
| 
 | |
| </head>
 | |
| 
 | |
| <body>
 | |
| 
 | |
| <div id="xiphlogo">
 | |
|   <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.Org"/></a>
 | |
| </div>
 | |
| 
 | |
| <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
 | |
| 
 | |
| <h2>Abstract</h2>
 | |
| 
 | |
| <p>The Vorbis audio CODEC provides a channel coupling
 | |
| mechanisms designed to reduce effective bitrate by both eliminating
 | |
| interchannel redundancy and eliminating stereo image information
 | |
| labeled inaudible or undesirable according to spatial psychoacoustic
 | |
| models. This document describes both the mechanical coupling
 | |
| mechanisms available within the Vorbis specification, as well as the
 | |
| specific stereo coupling models used by the reference
 | |
| <tt>libvorbis</tt> codec provided by xiph.org.</p>
 | |
| 
 | |
| <h2>Mechanisms</h2>
 | |
| 
 | |
| <p>In encoder release beta 4 and earlier, Vorbis supported multiple
 | |
| channel encoding, but the channels were encoded entirely separately
 | |
| with no cross-analysis or redundancy elimination between channels.
 | |
| This multichannel strategy is very similar to the mp3's <em>dual
 | |
| stereo</em> mode and Vorbis uses the same name for its analogous
 | |
| uncoupled multichannel modes.</p>
 | |
| 
 | |
| <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
 | |
| later implement a coupled channel strategy. Vorbis has two specific
 | |
| mechanisms that may be used alone or in conjunction to implement
 | |
| channel coupling. The first is <em>channel interleaving</em> via
 | |
| residue backend type 2, and the second is <em>square polar
 | |
| mapping</em>. These two general mechanisms are particularly well
 | |
| suited to coupling due to the structure of Vorbis encoding, as we'll
 | |
| explore below, and using both we can implement both totally
 | |
| <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
 | |
| to uncoupled modes], as well as various lossy models that seek to
 | |
| eliminate inaudible or unimportant aspects of the stereo image in
 | |
| order to enhance bitrate. The exact coupling implementation is
 | |
| generalized to allow the encoder a great deal of flexibility in
 | |
| implementation of a stereo or surround model without requiring any
 | |
| significant complexity increase over the combinatorially simpler
 | |
| mid/side joint stereo of mp3 and other current audio codecs.</p>
 | |
| 
 | |
| <p>A particular Vorbis bitstream may apply channel coupling directly to
 | |
| more than a pair of channels; polar mapping is hierarchical such that
 | |
| polar coupling may be extrapolated to an arbitrary number of channels
 | |
| and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
 | |
| surround. However, the scope of this document restricts itself to the
 | |
| stereo coupling case.</p>
 | |
| 
 | |
| <a name="sqpm"></a>
 | |
| <h3>Square Polar Mapping</h3>
 | |
| 
 | |
| <h4>maximal correlation</h4>
 | |
|  
 | |
| <p>Recall that the basic structure of a a Vorbis I stream first generates
 | |
| from input audio a spectral 'floor' function that serves as an
 | |
| MDCT-domain whitening filter. This floor is meant to represent the
 | |
| rough envelope of the frequency spectrum, using whatever metric the
 | |
| encoder cares to define. This floor is subtracted from the log
 | |
| frequency spectrum, effectively normalizing the spectrum by frequency.
 | |
| Each input channel is associated with a unique floor function.</p>
 | |
| 
 | |
| <p>The basic idea behind any stereo coupling is that the left and right
 | |
| channels usually correlate. This correlation is even stronger if one
 | |
| first accounts for energy differences in any given frequency band
 | |
| across left and right; think for example of individual instruments
 | |
| mixed into different portions of the stereo image, or a stereo
 | |
| recording with a dominant feature not perfectly in the center. The
 | |
| floor functions, each specific to a channel, provide the perfect means
 | |
| of normalizing left and right energies across the spectrum to maximize
 | |
| correlation before coupling. This feature of the Vorbis format is not
 | |
| a convenient accident.</p>
 | |
| 
 | |
| <p>Because we strive to maximally correlate the left and right channels
 | |
| and generally succeed in doing so, left and right residue is typically
 | |
| nearly identical. We could use channel interleaving (discussed below)
 | |
| alone to efficiently remove the redundancy between the left and right
 | |
| channels as a side effect of entropy encoding, but a polar
 | |
| representation gives benefits when left/right correlation is
 | |
| strong.</p>
 | |
| 
 | |
| <h4>point and diffuse imaging</h4>
 | |
| 
 | |
| <p>The first advantage of a polar representation is that it effectively
 | |
| separates the spatial audio information into a 'point image'
 | |
| (magnitude) at a given frequency and located somewhere in the sound
 | |
| field, and a 'diffuse image' (angle) that fills a large amount of
 | |
| space simultaneously. Even if we preserve only the magnitude (point)
 | |
| data, a detailed and carefully chosen floor function in each channel
 | |
| provides us with a free, fine-grained, frequency relative intensity
 | |
| stereo*. Angle information represents diffuse sound fields, such as
 | |
| reverberation that fills the entire space simultaneously.</p>
 | |
| 
 | |
| <p>*<em>Because the Vorbis model supports a number of different possible
 | |
| stereo models and these models may be mixed, we do not use the term
 | |
| 'intensity stereo' talking about Vorbis; instead we use the terms
 | |
| 'point stereo', 'phase stereo' and subcategories of each.</em></p>
 | |
| 
 | |
| <p>The majority of a stereo image is representable by polar magnitude
 | |
| alone, as strong sounds tend to be produced at near-point sources;
 | |
| even non-diffuse, fast, sharp echoes track very accurately using
 | |
| magnitude representation almost alone (for those experimenting with
 | |
| Vorbis tuning, this strategy works much better with the precise,
 | |
| piecewise control of floor 1; the continuous approximation of floor 0
 | |
| results in unstable imaging). Reverberation and diffuse sounds tend
 | |
| to contain less energy and be psychoacoustically dominated by the
 | |
| point sources embedded in them. Thus, we again tend to concentrate
 | |
| more represented energy into a predictably smaller number of numbers.
 | |
| Separating representation of point and diffuse imaging also allows us
 | |
| to model and manipulate point and diffuse qualities separately.</p>
 | |
| 
 | |
| <h4>controlling bit leakage and symbol crosstalk</h4>
 | |
| 
 | |
| <p>Because polar
 | |
| representation concentrates represented energy into fewer large
 | |
| values, we reduce bit 'leakage' during cascading (multistage VQ
 | |
| encoding) as a secondary benefit. A single large, monolithic VQ
 | |
| codebook is more efficient than a cascaded book due to entropy
 | |
| 'crosstalk' among symbols between different stages of a multistage cascade.
 | |
| Polar representation is a way of further concentrating entropy into
 | |
| predictable locations so that codebook design can take steps to
 | |
| improve multistage codebook efficiency. It also allows us to cascade
 | |
| various elements of the stereo image independently.</p>
 | |
| 
 | |
| <h4>eliminating trigonometry and rounding</h4>
 | |
| 
 | |
| <p>Rounding and computational complexity are potential problems with a
 | |
| polar representation. As our encoding process involves quantization,
 | |
| mixing a polar representation and quantization makes it potentially
 | |
| impossible, depending on implementation, to construct a coupled stereo
 | |
| mechanism that results in bit-identical decompressed output compared
 | |
| to an uncoupled encoding should the encoder desire it.</p>
 | |
| 
 | |
| <p>Vorbis uses a mapping that preserves the most useful qualities of
 | |
| polar representation, relies only on addition/subtraction (during
 | |
| decode; high quality encoding still requires some trig), and makes it
 | |
| trivial before or after quantization to represent an angle/magnitude
 | |
| through a one-to-one mapping from possible left/right value
 | |
| permutations. We do this by basing our polar representation on the
 | |
| unit square rather than the unit-circle.</p>
 | |
| 
 | |
| <p>Given a magnitude and angle, we recover left and right using the
 | |
| following function (note that A/B may be left/right or right/left
 | |
| depending on the coupling definition used by the encoder):</p>
 | |
| 
 | |
| <pre>
 | |
|       if(magnitude>0)
 | |
|         if(angle>0){
 | |
|           A=magnitude;
 | |
|           B=magnitude-angle;
 | |
|         }else{
 | |
|           B=magnitude;
 | |
|           A=magnitude+angle;
 | |
|         }
 | |
|       else
 | |
|         if(angle>0){
 | |
|           A=magnitude;
 | |
|           B=magnitude+angle;
 | |
|         }else{
 | |
|           B=magnitude;
 | |
|           A=magnitude-angle;
 | |
|         }
 | |
|     }
 | |
| </pre>
 | |
| 
 | |
| <p>The function is antisymmetric for positive and negative magnitudes in
 | |
| order to eliminate a redundant value when quantizing. For example, if
 | |
| we're quantizing to integer values, we can visualize a magnitude of 5
 | |
| and an angle of -2 as follows:</p>
 | |
| 
 | |
| <p><img src="squarepolar.png" alt="square polar"/></p>
 | |
| 
 | |
| <p>This representation loses or replicates no values; if the range of A
 | |
| and B are integral -5 through 5, the number of possible Cartesian
 | |
| permutations is 121. Represented in square polar notation, the
 | |
| possible values are:</p>
 | |
| 
 | |
| <pre>
 | |
|  0, 0
 | |
| 
 | |
| -1,-2  -1,-1  -1, 0  -1, 1
 | |
| 
 | |
|  1,-2   1,-1   1, 0   1, 1
 | |
| 
 | |
| -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3  
 | |
| 
 | |
|  2,-4   2,-3   ... following the pattern ...
 | |
| 
 | |
|  ...   5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9
 | |
| 
 | |
| </pre>
 | |
| 
 | |
| <p>...for a grand total of 121 possible values, the same number as in
 | |
| Cartesian representation (note that, for example, <tt>5,-10</tt> is
 | |
| the same as <tt>-5,10</tt>, so there's no reason to represent
 | |
| both. 2,10 cannot happen, and there's no reason to account for it.)
 | |
| It's also obvious that this mapping is exactly reversible.</p>
 | |
| 
 | |
| <h3>Channel interleaving</h3>
 | |
| 
 | |
| <p>We can remap and A/B vector using polar mapping into a magnitude/angle
 | |
| vector, and it's clear that, in general, this concentrates energy in
 | |
| the magnitude vector and reduces the amount of information to encode
 | |
| in the angle vector. Encoding these vectors independently with
 | |
| residue backend #0 or residue backend #1 will result in bitrate
 | |
| savings. However, there are still implicit correlations between the
 | |
| magnitude and angle vectors. The most obvious is that the amplitude
 | |
| of the angle is bounded by its corresponding magnitude value.</p>
 | |
| 
 | |
| <p>Entropy coding the results, then, further benefits from the entropy
 | |
| model being able to compress magnitude and angle simultaneously. For
 | |
| this reason, Vorbis implements residue backend #2 which pre-interleaves
 | |
| a number of input vectors (in the stereo case, two, A and B) into a
 | |
| single output vector (with the elements in the order of
 | |
| A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
 | |
| each vector to be coded by the vector quantization backend consists of
 | |
| matching magnitude and angle values.</p>
 | |
| 
 | |
| <p>The astute reader, at this point, will notice that in the theoretical
 | |
| case in which we can use monolithic codebooks of arbitrarily large
 | |
| size, we can directly interleave and encode left and right without
 | |
| polar mapping; in fact, the polar mapping does not appear to lend any
 | |
| benefit whatsoever to the efficiency of the entropy coding. In fact,
 | |
| it is perfectly possible and reasonable to build a Vorbis encoder that
 | |
| dispenses with polar mapping entirely and merely interleaves the
 | |
| channel. Libvorbis based encoders may configure such an encoding and
 | |
| it will work as intended.</p>
 | |
| 
 | |
| <p>However, when we leave the ideal/theoretical domain, we notice that
 | |
| polar mapping does give additional practical benefits, as discussed in
 | |
| the above section on polar mapping and summarized again here:</p>
 | |
| 
 | |
| <ul>
 | |
| <li>Polar mapping aids in controlling entropy 'leakage' between stages
 | |
| of a cascaded codebook.</li>
 | |
| <li>Polar mapping separates the stereo image
 | |
| into point and diffuse components which may be analyzed and handled
 | |
| differently.</li>
 | |
| </ul>
 | |
| 
 | |
| <h2>Stereo Models</h2>
 | |
| 
 | |
| <h3>Dual Stereo</h3>
 | |
| 
 | |
| <p>Dual stereo refers to stereo encoding where the channels are entirely
 | |
| separate; they are analyzed and encoded as entirely distinct entities.
 | |
| This terminology is familiar from mp3.</p>
 | |
| 
 | |
| <h3>Lossless Stereo</h3>
 | |
| 
 | |
| <p>Using polar mapping and/or channel interleaving, it's possible to
 | |
| couple Vorbis channels losslessly, that is, construct a stereo
 | |
| coupling encoding that both saves space but also decodes
 | |
| bit-identically to dual stereo. OggEnc 1.0 and later uses this
 | |
| mode in all high-bitrate encoding.</p>
 | |
| 
 | |
| <p>Overall, this stereo mode is overkill; however, it offers a safe
 | |
| alternative to users concerned about the slightest possible
 | |
| degradation to the stereo image or archival quality audio.</p>
 | |
| 
 | |
| <h3>Phase Stereo</h3>
 | |
| 
 | |
| <p>Phase stereo is the least aggressive means of gracefully dropping
 | |
| resolution from the stereo image; it affects only diffuse imaging.</p>
 | |
| 
 | |
| <p>It's often quoted that the human ear is deaf to signal phase above
 | |
| about 4kHz; this is nearly true and a passable rule of thumb, but it
 | |
| can be demonstrated that even an average user can tell the difference
 | |
| between high frequency in-phase and out-of-phase noise. Obviously
 | |
| then, the statement is not entirely true. However, it's also the case
 | |
| that one must resort to nearly such an extreme demonstration before
 | |
| finding the counterexample.</p>
 | |
| 
 | |
| <p>'Phase stereo' is simply a more aggressive quantization of the polar
 | |
| angle vector; above 4kHz it's generally quite safe to quantize noise
 | |
| and noisy elements to only a handful of allowed phases, or to thin the
 | |
| phase with respect to the magnitude. The phases of high amplitude
 | |
| pure tones may or may not be preserved more carefully (they are
 | |
| relatively rare and L/R tend to be in phase, so there is generally
 | |
| little reason not to spend a few more bits on them)</p>
 | |
| 
 | |
| <h4>example: eight phase stereo</h4>
 | |
| 
 | |
| <p>Vorbis may implement phase stereo coupling by preserving the entirety
 | |
| of the magnitude vector (essential to fine amplitude and energy
 | |
| resolution overall) and quantizing the angle vector to one of only
 | |
| four possible values. Given that the magnitude vector may be positive
 | |
| or negative, this results in left and right phase having eight
 | |
| possible permutation, thus 'eight phase stereo':</p>
 | |
| 
 | |
| <p><img src="eightphase.png" alt="eight phase"/></p>
 | |
| 
 | |
| <p>Left and right may be in phase (positive or negative), the most common
 | |
| case by far, or out of phase by 90 or 180 degrees.</p>
 | |
| 
 | |
| <h4>example: four phase stereo</h4>
 | |
| 
 | |
| <p>Similarly, four phase stereo takes the quantization one step further;
 | |
| it allows only in-phase and 180 degree out-out-phase signals:</p>
 | |
| 
 | |
| <p><img src="fourphase.png" alt="four phase"/></p>
 | |
| 
 | |
| <h3>example: point stereo</h3>
 | |
| 
 | |
| <p>Point stereo eliminates the possibility of out-of-phase signal
 | |
| entirely. Any diffuse quality to a sound source tends to collapse
 | |
| inward to a point somewhere within the stereo image. A practical
 | |
| example would be balanced reverberations within a large, live space;
 | |
| normally the sound is diffuse and soft, giving a sonic impression of
 | |
| volume. In point-stereo, the reverberations would still exist, but
 | |
| sound fairly firmly centered within the image (assuming the
 | |
| reverberation was centered overall; if the reverberation is stronger
 | |
| to the left, then the point of localization in point stereo would be
 | |
| to the left). This effect is most noticeable at low and mid
 | |
| frequencies and using headphones (which grant perfect stereo
 | |
| separation). Point stereo is is a graceful but generally easy to
 | |
| detect degradation to the sound quality and is thus used in frequency
 | |
| ranges where it is least noticeable.</p>
 | |
| 
 | |
| <h3>Mixed Stereo</h3>
 | |
| 
 | |
| <p>Mixed stereo is the simultaneous use of more than one of the above
 | |
| stereo encoding models, generally using more aggressive modes in
 | |
| higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
 | |
| 
 | |
| <p>It is also the case that near-DC frequencies should be encoded using
 | |
| lossless coupling to avoid frame blocking artifacts.</p>
 | |
| 
 | |
| <h3>Vorbis Stereo Modes</h3>
 | |
| 
 | |
| <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
 | |
| constructed out of lossless and point stereo. Phase stereo was used
 | |
| in the rc2 encoder, but is not currently used for simplicity's sake. It
 | |
| will likely be re-added to the stereo model in the future.</p>
 | |
| 
 | |
| <div id="copyright">
 | |
|   The Xiph Fish Logo is a
 | |
|   trademark (™) of Xiph.Org.<br/>
 | |
| 
 | |
|   These pages © 1994 - 2005 Xiph.Org. All rights reserved.
 | |
| </div>
 | |
| 
 | |
| </body>
 | |
| </html>
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 |