mirror of
				https://github.com/cookiengineer/audacity
				synced 2025-11-04 16:14:00 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			666 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
			
		
		
	
	
			666 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
<?xml version="1.0" standalone="no"?>
 | 
						|
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
 | 
						|
                "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [
 | 
						|
 | 
						|
]>
 | 
						|
 | 
						|
<section id="vorbis-spec-intro">
 | 
						|
<sectioninfo>
 | 
						|
<releaseinfo>
 | 
						|
 $Id: 01-introduction.xml,v 1.1.1.1 2004-11-13 16:51:21 mbrubeck Exp $
 | 
						|
</releaseinfo>
 | 
						|
</sectioninfo>
 | 
						|
<title>Introduction and Description</title>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Overview</title>
 | 
						|
 | 
						|
<para>
 | 
						|
This document provides a high level description of the Vorbis codec's
 | 
						|
construction.  A bit-by-bit specification appears beginning in 
 | 
						|
<xref linkend="vorbis-spec-codec"/>.
 | 
						|
The later sections assume a high-level
 | 
						|
understanding of the Vorbis decode process, which is 
 | 
						|
provided here.</para>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Application</title>
 | 
						|
<para>
 | 
						|
Vorbis is a general purpose perceptual audio CODEC intended to allow
 | 
						|
maximum encoder flexibility, thus allowing it to scale competitively
 | 
						|
over an exceptionally wide range of bitrates.  At the high
 | 
						|
quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
 | 
						|
it is in the same league as MPEG-2 and MPC.  Similarly, the 1.0
 | 
						|
encoder can encode high-quality CD and DAT rate stereo at below 48kbps
 | 
						|
without resampling to a lower rate.  Vorbis is also intended for
 | 
						|
lower and higher sample rates (from 8kHz telephony to 192kHz digital
 | 
						|
masters) and a range of channel representations (monaural,
 | 
						|
polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
 | 
						|
discrete channels).
 | 
						|
</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Classification</title>
 | 
						|
<para>
 | 
						|
Vorbis I is a forward-adaptive monolithic transform CODEC based on the
 | 
						|
Modified Discrete Cosine Transform.  The codec is structured to allow
 | 
						|
addition of a hybrid wavelet filterbank in Vorbis II to offer better
 | 
						|
transient response and reproduction using a transform better suited to
 | 
						|
localized time events.
 | 
						|
</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Assumptions</title>
 | 
						|
 | 
						|
<para>
 | 
						|
The Vorbis CODEC design assumes a complex, psychoacoustically-aware
 | 
						|
encoder and simple, low-complexity decoder. Vorbis decode is
 | 
						|
computationally simpler than mp3, although it does require more
 | 
						|
working memory as Vorbis has no static probability model; the vector
 | 
						|
codebooks used in the first stage of decoding from the bitstream are
 | 
						|
packed in their entirety into the Vorbis bitstream headers. In
 | 
						|
packed form, these codebooks occupy only a few kilobytes; the extent
 | 
						|
to which they are pre-decoded into a cache is the dominant factor in
 | 
						|
decoder memory usage.
 | 
						|
</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis provides none of its own framing, synchronization or protection
 | 
						|
against errors; it is solely a method of accepting input audio,
 | 
						|
dividing it into individual frames and compressing these frames into
 | 
						|
raw, unformatted 'packets'. The decoder then accepts these raw
 | 
						|
packets in sequence, decodes them, synthesizes audio frames from
 | 
						|
them, and reassembles the frames into a facsimile of the original
 | 
						|
audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
 | 
						|
minimum size, maximum size, or fixed/expected size.  Packets
 | 
						|
are designed that they may be truncated (or padded) and remain
 | 
						|
decodable; this is not to be considered an error condition and is used
 | 
						|
extensively in bitrate management in peeling.  Both the transport
 | 
						|
mechanism and decoder must allow that a packet may be any size, or
 | 
						|
end before or after packet decode expects.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis packets are thus intended to be used with a transport mechanism
 | 
						|
that provides free-form framing, sync, positioning and error correction
 | 
						|
in accordance with these design assumptions, such as Ogg (for file
 | 
						|
transport) or RTP (for network multicast).  For purposes of a few
 | 
						|
examples in this document, we will assume that Vorbis is to be
 | 
						|
embedded in an Ogg stream specifically, although this is by no means a
 | 
						|
requirement or fundamental assumption in the Vorbis design.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The specification for embedding Vorbis into
 | 
						|
an Ogg transport stream is in <xref linkend="vorbis-over-ogg"/>.
 | 
						|
</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Codec Setup and Probability Model</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis' heritage is as a research CODEC and its current design
 | 
						|
reflects a desire to allow multiple decades of continuous encoder
 | 
						|
improvement before running out of room within the codec specification.
 | 
						|
For these reasons, configurable aspects of codec setup intentionally
 | 
						|
lean toward the extreme of forward adaptive.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The single most controversial design decision in Vorbis (and the most
 | 
						|
unusual for a Vorbis developer to keep in mind) is that the entire
 | 
						|
probability model of the codec, the Huffman and VQ codebooks, is
 | 
						|
packed into the bitstream header along with extensive CODEC setup
 | 
						|
parameters (often several hundred fields).  This makes it impossible,
 | 
						|
as it would be with MPEG audio layers, to embed a simple frame type
 | 
						|
flag in each audio packet, or begin decode at any frame in the stream
 | 
						|
without having previously fetched the codec setup header.
 | 
						|
</para>
 | 
						|
 | 
						|
<note><para>
 | 
						|
Vorbis <emphasis>can</emphasis> initiate decode at any arbitrary packet within a
 | 
						|
bitstream so long as the codec has been initialized/setup with the
 | 
						|
setup headers.</para></note>
 | 
						|
 | 
						|
<para>
 | 
						|
Thus, Vorbis headers are both required for decode to begin and
 | 
						|
relatively large as bitstream headers go.  The header size is
 | 
						|
unbounded, although for streaming a rule-of-thumb of 4kB or less is
 | 
						|
recommended (and Xiph.Org's Vorbis encoder follows this suggestion).</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Our own design work indicates the primary liability of the
 | 
						|
required header is in mindshare; it is an unusual design and thus
 | 
						|
causes some amount of complaint among engineers as this runs against
 | 
						|
current design trends (and also points out limitations in some
 | 
						|
existing software/interface designs, such as Windows' ACM codec
 | 
						|
framework).  However, we find that it does not fundamentally limit
 | 
						|
Vorbis' suitable application space.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Format Specification</title>
 | 
						|
<para>
 | 
						|
The Vorbis format is well-defined by its decode specification; any
 | 
						|
encoder that produces packets that are correctly decoded by the
 | 
						|
reference Vorbis decoder described below may be considered a proper
 | 
						|
Vorbis encoder.  A decoder must faithfully and completely implement
 | 
						|
the specification defined below (except where noted) to be considered
 | 
						|
a proper Vorbis decoder.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Hardware Profile</title>
 | 
						|
<para>
 | 
						|
Although Vorbis decode is computationally simple, it may still run
 | 
						|
into specific limitations of an embedded design.  For this reason,
 | 
						|
embedded designs are allowed to deviate in limited ways from the
 | 
						|
'full' decode specification yet still be certified compliant.  These
 | 
						|
optional omissions are labelled in the spec where relevant.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Decoder Configuration</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Decoder setup consists of configuration of multiple, self-contained
 | 
						|
component abstractions that perform specific functions in the decode
 | 
						|
pipeline.  Each different component instance of a specific type is
 | 
						|
semantically interchangeable; decoder configuration consists both of
 | 
						|
internal component configuration, as well as arrangement of specific
 | 
						|
instances into a decode pipeline.  Componentry arrangement is roughly
 | 
						|
as follows:</para>
 | 
						|
 | 
						|
<mediaobject>
 | 
						|
<imageobject>
 | 
						|
 <imagedata fileref="components.png" format="PNG"/>
 | 
						|
</imageobject>
 | 
						|
<textobject>
 | 
						|
  <phrase>decoder pipeline configuration</phrase>
 | 
						|
</textobject>
 | 
						|
</mediaobject>
 | 
						|
 | 
						|
<section><title>Global Config</title>
 | 
						|
<para>
 | 
						|
Global codec configuration consists of a few audio related fields
 | 
						|
(sample rate, channels), Vorbis version (always '0' in Vorbis I),
 | 
						|
bitrate hints, and the lists of component instances.  All other
 | 
						|
configuration is in the context of specific components.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Mode</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Each Vorbis frame is coded according to a master 'mode'.  A bitstream
 | 
						|
may use one or many modes.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The mode mechanism is used to encode a frame according to one of
 | 
						|
multiple possible methods with the intention of choosing a method best
 | 
						|
suited to that frame.  Different modes are, e.g. how frame size
 | 
						|
is changed from frame to frame. The mode number of a frame serves as a
 | 
						|
top level configuration switch for all other specific aspects of frame
 | 
						|
decode.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
A 'mode' configuration consists of a frame size setting, window type
 | 
						|
(always 0, the Vorbis window, in Vorbis I), transform type (always
 | 
						|
type 0, the MDCT, in Vorbis I) and a mapping number.  The mapping
 | 
						|
number specifies which mapping configuration instance to use for
 | 
						|
low-level packet decode and synthesis.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Mapping</title>
 | 
						|
 | 
						|
<para>
 | 
						|
A mapping contains a channel coupling description and a list of
 | 
						|
'submaps' that bundle sets of channel vectors together for grouped
 | 
						|
encoding and decoding. These submaps are not references to external
 | 
						|
components; the submap list is internal and specific to a mapping.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
A 'submap' is a configuration/grouping that applies to a subset of
 | 
						|
floor and residue vectors within a mapping.  The submap functions as a
 | 
						|
last layer of indirection such that specific special floor or residue
 | 
						|
settings can be applied not only to all the vectors in a given mode,
 | 
						|
but also specific vectors in a specific mode.  Each submap specifies
 | 
						|
the proper floor and residue instance number to use for decoding that
 | 
						|
submap's spectral floor and spectral residue vectors.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
As an example:</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Assume a Vorbis stream that contains six channels in the standard 5.1
 | 
						|
format.  The sixth channel, as is normal in 5.1, is bass only.
 | 
						|
Therefore it would be wasteful to encode a full-spectrum version of it
 | 
						|
as with the other channels.  The submapping mechanism can be used to
 | 
						|
apply a full range floor and residue encoding to channels 0 through 4,
 | 
						|
and a bass-only representation to the bass channel, thus saving space.
 | 
						|
In this example, channels 0-4 belong to submap 0 (which indicates use
 | 
						|
of a full-range floor) and channel 5 belongs to submap 1, which uses a
 | 
						|
bass-only representation.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Floor</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis encodes a spectral 'floor' vector for each PCM channel.  This
 | 
						|
vector is a low-resolution representation of the audio spectrum for
 | 
						|
the given channel in the current frame, generally used akin to a
 | 
						|
whitening filter.  It is named a 'floor' because the Xiph.Org
 | 
						|
reference encoder has historically used it as a unit-baseline for
 | 
						|
spectral resolution.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
A floor encoding may be of two types.  Floor 0 uses a packed LSP
 | 
						|
representation on a dB amplitude scale and Bark frequency scale.
 | 
						|
Floor 1 represents the curve as a piecewise linear interpolated
 | 
						|
representation on a dB amplitude scale and linear frequency scale.
 | 
						|
The two floors are semantically interchangeable in
 | 
						|
encoding/decoding. However, floor type 1 provides more stable
 | 
						|
inter-frame behavior, and so is the preferred choice in all
 | 
						|
coupled-stereo and high bitrate modes.  Floor 1 is also considerably
 | 
						|
less expensive to decode than floor 0.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Floor 0 is not to be considered deprecated, but it is of limited
 | 
						|
modern use.  No known Vorbis encoder past Xiph.org's own beta 4 makes
 | 
						|
use of floor 0.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The values coded/decoded by a floor are both compactly formatted and
 | 
						|
make use of entropy coding to save space.  For this reason, a floor
 | 
						|
configuration generally refers to multiple codebooks in the codebook
 | 
						|
component list.  Entropy coding is thus provided as an abstraction,
 | 
						|
and each floor instance may choose from any and all available
 | 
						|
codebooks when coding/decoding.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Residue</title>
 | 
						|
<para>
 | 
						|
The spectral residue is the fine structure of the audio spectrum
 | 
						|
once the floor curve has been subtracted out.  In simplest terms, it
 | 
						|
is coded in the bitstream using cascaded (multi-pass) vector
 | 
						|
quantization according to one of three specific packing/coding
 | 
						|
algorithms numbered 0 through 2.  The packing algorithm details are
 | 
						|
configured by residue instance.  As with the floor components, the
 | 
						|
final VQ/entropy encoding is provided by external codebook instances
 | 
						|
and each residue instance may choose from any and all available
 | 
						|
codebooks.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Codebooks</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Codebooks are a self-contained abstraction that perform entropy
 | 
						|
decoding and, optionally, use the entropy-decoded integer value as an
 | 
						|
offset into an index of output value vectors, returning the indicated
 | 
						|
vector of values.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The entropy coding in a Vorbis I codebook is provided by a standard
 | 
						|
Huffman binary tree representation.  This tree is tightly packed using
 | 
						|
one of several methods, depending on whether codeword lengths are
 | 
						|
ordered or unordered, or the tree is sparse.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
The codebook vector index is similarly packed according to index
 | 
						|
characteristic.  Most commonly, the vector index is encoded as a
 | 
						|
single list of values of possible values that are then permuted into
 | 
						|
a list of n-dimensional rows (lattice VQ).</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
 | 
						|
<section>
 | 
						|
<title>High-level Decode Process</title>
 | 
						|
 | 
						|
<section>
 | 
						|
<title>Decode Setup</title> 
 | 
						|
 | 
						|
<para>
 | 
						|
Before decoding can begin, a decoder must initialize using the
 | 
						|
bitstream headers matching the stream to be decoded.  Vorbis uses
 | 
						|
three header packets; all are required, in-order, by this
 | 
						|
specification. Once set up, decode may begin at any audio packet
 | 
						|
belonging to the Vorbis stream. In Vorbis I, all packets after the
 | 
						|
three initial headers are audio packets. </para>
 | 
						|
 | 
						|
<para>
 | 
						|
The header packets are, in order, the identification
 | 
						|
header, the comments header, and the setup header.</para>
 | 
						|
 | 
						|
<section><title>Identification Header</title>
 | 
						|
<para>
 | 
						|
The identification header identifies the bitstream as Vorbis, Vorbis
 | 
						|
version, and the simple audio characteristics of the stream such as
 | 
						|
sample rate and number of channels.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Comment Header</title>
 | 
						|
<para>
 | 
						|
The comment header includes user text comments ("tags") and a vendor
 | 
						|
string for the application/library that produced the bitstream.  The
 | 
						|
encoding and proper use of the comment header is described in 
 | 
						|
<xref linkend="vorbis-spec-comment"/>.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Setup Header</title>
 | 
						|
<para>
 | 
						|
The setup header includes extensive CODEC setup information as well as
 | 
						|
the complete VQ and Huffman codebooks needed for decode.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>Decode Procedure</title>
 | 
						|
 | 
						|
<highlights>
 | 
						|
<para>
 | 
						|
The decoding and synthesis procedure for all audio packets is
 | 
						|
fundamentally the same.
 | 
						|
<orderedlist>
 | 
						|
<listitem><simpara>decode packet type flag</simpara></listitem>
 | 
						|
<listitem><simpara>decode mode number</simpara></listitem>
 | 
						|
<listitem><simpara>decode window shape (long windows only)</simpara></listitem>
 | 
						|
<listitem><simpara>decode floor</simpara></listitem>
 | 
						|
<listitem><simpara>decode residue into residue vectors</simpara></listitem>
 | 
						|
<listitem><simpara>inverse channel coupling of residue vectors</simpara></listitem>
 | 
						|
<listitem><simpara>generate floor curve from decoded floor data</simpara></listitem>
 | 
						|
<listitem><simpara>compute dot product of floor and residue, producing audio spectrum vector</simpara></listitem>
 | 
						|
<listitem><simpara>inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I</simpara></listitem>
 | 
						|
<listitem><simpara>overlap/add left-hand output of transform with right-hand output of previous frame</simpara></listitem>
 | 
						|
<listitem><simpara>store right hand-data from transform of current frame for future lapping</simpara></listitem>
 | 
						|
<listitem><simpara>if not first frame, return results of overlap/add as audio result of current frame</simpara></listitem>
 | 
						|
</orderedlist>
 | 
						|
</para>
 | 
						|
</highlights>
 | 
						|
 | 
						|
<para>
 | 
						|
Note that clever rearrangement of the synthesis arithmetic is
 | 
						|
possible; as an example, one can take advantage of symmetries in the
 | 
						|
MDCT to store the right-hand transform data of a partial MDCT for a
 | 
						|
50% inter-frame buffer space savings, and then complete the transform
 | 
						|
later before overlap/add with the next frame.  This optimization
 | 
						|
produces entirely equivalent output and is naturally perfectly legal.
 | 
						|
The decoder must be <emphasis>entirely mathematically equivalent</emphasis> to the
 | 
						|
specification, it need not be a literal semantic implementation.</para>
 | 
						|
 | 
						|
<section><title>Packet type decode</title> 
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis I uses four packet types. The first three packet types mark each
 | 
						|
of the three Vorbis headers described above. The fourth packet type
 | 
						|
marks an audio packet. All other packet types are reserved; packets
 | 
						|
marked with a reserved type should be ignored.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Following the three header packets, all packets in a Vorbis I stream
 | 
						|
are audio.  The first step of audio packet decode is to read and
 | 
						|
verify the packet type; <emphasis>a non-audio packet when audio is expected
 | 
						|
indicates stream corruption or a non-compliant stream. The decoder
 | 
						|
must ignore the packet and not attempt decoding it to
 | 
						|
audio</emphasis>.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
 | 
						|
<section><title>Mode decode</title>
 | 
						|
<para>
 | 
						|
Vorbis allows an encoder to set up multiple, numbered packet 'modes',
 | 
						|
as described earlier, all of which may be used in a given Vorbis
 | 
						|
stream. The mode is encoded as an integer used as a direct offset into
 | 
						|
the mode instance index. </para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section id="vorbis-spec-window">
 | 
						|
<title>Window shape decode (long windows only)</title>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis frames may be one of two PCM sample sizes specified during
 | 
						|
codec setup.  In Vorbis I, legal frame sizes are powers of two from 64
 | 
						|
to 8192 samples.  Aside from coupling, Vorbis handles channels as
 | 
						|
independent vectors and these frame sizes are in samples per channel.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis uses an overlapping transform, namely the MDCT, to blend one
 | 
						|
frame into the next, avoiding most inter-frame block boundary
 | 
						|
artifacts.  The MDCT output of one frame is windowed according to MDCT
 | 
						|
requirements, overlapped 50% with the output of the previous frame and
 | 
						|
added.  The window shape assures seamless reconstruction.  </para>
 | 
						|
 | 
						|
<para>
 | 
						|
This is easy to visualize in the case of equal sized-windows:</para>
 | 
						|
 | 
						|
<mediaobject>
 | 
						|
<imageobject>
 | 
						|
 <imagedata fileref="window1.png" format="PNG"/>
 | 
						|
</imageobject>
 | 
						|
<textobject>
 | 
						|
 <phrase>overlap of two equal-sized windows</phrase>
 | 
						|
</textobject>
 | 
						|
</mediaobject>
 | 
						|
 | 
						|
<para>
 | 
						|
And slightly more complex in the case of overlapping unequal sized
 | 
						|
windows:</para>
 | 
						|
 | 
						|
<mediaobject>
 | 
						|
<imageobject> 
 | 
						|
 <imagedata fileref="window2.png" format="PNG"/>
 | 
						|
</imageobject>
 | 
						|
<textobject>
 | 
						|
 <phrase>overlap of a long and a short window</phrase>
 | 
						|
</textobject>
 | 
						|
</mediaobject>
 | 
						|
 | 
						|
<para>
 | 
						|
In the unequal-sized window case, the window shape of the long window
 | 
						|
must be modified for seamless lapping as above.  It is possible to
 | 
						|
correctly infer window shape to be applied to the current window from
 | 
						|
knowing the sizes of the current, previous and next window.  It is
 | 
						|
legal for a decoder to use this method. However, in the case of a long
 | 
						|
window (short windows require no modification), Vorbis also codes two
 | 
						|
flag bits to specify pre- and post- window shape.  Although not
 | 
						|
strictly necessary for function, this minor redundancy allows a packet
 | 
						|
to be fully decoded to the point of lapping entirely independently of
 | 
						|
any other packet, allowing easier abstraction of decode layers as well
 | 
						|
as allowing a greater level of easy parallelism in encode and
 | 
						|
decode.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
A description of valid window functions for use with an inverse MDCT
 | 
						|
can be found in the paper 
 | 
						|
<citetitle pubwork="article">
 | 
						|
<ulink url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps">
 | 
						|
The use of multirate filter banks for coding of high quality digital
 | 
						|
audio</ulink></citetitle>, by T. Sporer, K. Brandenburg and B. Edler.  Vorbis windows
 | 
						|
all use the slope function 
 | 
						|
  <inlineequation>
 | 
						|
 | 
						|
    <alt>y=sin(.5*PI*sin^2((x+.5)/n*pi))</alt>
 | 
						|
    <inlinemediaobject>
 | 
						|
     <textobject>
 | 
						|
      <phrase>$y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi))$</phrase>
 | 
						|
     </textobject>
 | 
						|
    </inlinemediaobject>
 | 
						|
  </inlineequation>.
 | 
						|
</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>floor decode</title>
 | 
						|
<para>
 | 
						|
Each floor is encoded/decoded in channel order, however each floor
 | 
						|
belongs to a 'submap' that specifies which floor configuration to
 | 
						|
use.  All floors are decoded before residue decode begins.</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>residue decode</title> 
 | 
						|
 | 
						|
<para>
 | 
						|
Although the number of residue vectors equals the number of channels,
 | 
						|
channel coupling may mean that the raw residue vectors extracted
 | 
						|
during decode do not map directly to specific channels.  When channel
 | 
						|
coupling is in use, some vectors will correspond to coupled magnitude
 | 
						|
or angle.  The coupling relationships are described in the codec setup
 | 
						|
and may differ from frame to frame, due to different mode numbers.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis codes residue vectors in groups by submap; the coding is done
 | 
						|
in submap order from submap 0 through n-1.  This differs from floors
 | 
						|
which are coded using a configuration provided by submap number, but
 | 
						|
are coded individually in channel order.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>inverse channel coupling</title>
 | 
						|
 | 
						|
<para>
 | 
						|
A detailed discussion of stereo in the Vorbis codec can be found in
 | 
						|
the document <ulink url="stereo.html"><citetitle>Stereo Channel Coupling in the
 | 
						|
Vorbis CODEC</citetitle></ulink>.  Vorbis is not limited to only stereo coupling, but
 | 
						|
the stereo document also gives a good overview of the generic coupling
 | 
						|
mechanism.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Vorbis coupling applies to pairs of residue vectors at a time;
 | 
						|
decoupling is done in-place a pair at a time in the order and using
 | 
						|
the vectors specified in the current mapping configuration.  The
 | 
						|
decoupling operation is the same for all pairs, converting square
 | 
						|
polar representation (where one vector is magnitude and the second
 | 
						|
angle) back to Cartesian representation.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
After decoupling, in order, each pair of vectors on the coupling list, 
 | 
						|
the resulting residue vectors represent the fine spectral detail
 | 
						|
of each output channel.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>generate floor curve</title>
 | 
						|
 | 
						|
<para>
 | 
						|
The decoder may choose to generate the floor curve at any appropriate
 | 
						|
time.  It is reasonable to generate the output curve when the floor
 | 
						|
data is decoded from the raw packet, or it can be generated after
 | 
						|
inverse coupling and applied to the spectral residue directly,
 | 
						|
combining generation and the dot product into one step and eliminating
 | 
						|
some working space.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Both floor 0 and floor 1 generate a linear-range, linear-domain output
 | 
						|
vector to be multiplied (dot product) by the linear-range,
 | 
						|
linear-domain spectral residue.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>compute floor/residue dot product</title>
 | 
						|
 | 
						|
<para>
 | 
						|
This step is straightforward; for each output channel, the decoder
 | 
						|
multiplies the floor curve and residue vectors element by element,
 | 
						|
producing the finished audio spectrum of each channel.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
One point is worth mentioning about this dot product; a common mistake
 | 
						|
in a fixed point implementation might be to assume that a 32 bit
 | 
						|
fixed-point representation for floor and residue and direct
 | 
						|
multiplication of the vectors is sufficient for acceptable spectral
 | 
						|
depth in all cases because it happens to mostly work with the current
 | 
						|
Xiph.Org reference encoder.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
However, floor vector values can span ~140dB (~24 bits unsigned), and
 | 
						|
the audio spectrum vector should represent a minimum of 120dB (~21
 | 
						|
bits with sign), even when output is to a 16 bit PCM device.  For the
 | 
						|
residue vector to represent full scale if the floor is nailed to
 | 
						|
-140dB, it must be able to span 0 to +140dB.  For the residue vector
 | 
						|
to reach full scale if the floor is nailed at 0dB, it must be able to
 | 
						|
represent -140dB to +0dB.  Thus, in order to handle full range
 | 
						|
dynamics, a residue vector may span -140dB to +140dB entirely within
 | 
						|
spec.  A 280dB range is approximately 48 bits with sign; thus the
 | 
						|
residue vector must be able to represent a 48 bit range and the dot
 | 
						|
product must be able to handle an effective 48 bit times 24 bit
 | 
						|
multiplication.  This range may be achieved using large (64 bit or
 | 
						|
larger) integers, or implementing a movable binary point
 | 
						|
representation.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>inverse monolithic transform (MDCT)</title>
 | 
						|
 | 
						|
<para>
 | 
						|
The audio spectrum is converted back into time domain PCM audio via an
 | 
						|
inverse Modified Discrete Cosine Transform (MDCT).  A detailed
 | 
						|
description of the MDCT is available in the paper <ulink
 | 
						|
url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps"><citetitle pubwork="article">The use of multirate filter banks for coding of high quality digital
 | 
						|
audio</citetitle></ulink>, by T. Sporer, K. Brandenburg and B. Edler.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Note that the PCM produced directly from the MDCT is not yet finished
 | 
						|
audio; it must be lapped with surrounding frames using an appropriate
 | 
						|
window (such as the Vorbis window) before the MDCT can be considered
 | 
						|
orthogonal.</para>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>overlap/add data</title>
 | 
						|
<para>
 | 
						|
Windowed MDCT output is overlapped and added with the right hand data
 | 
						|
of the previous window such that the 3/4 point of the previous window
 | 
						|
is aligned with the 1/4 point of the current window (as illustrated in
 | 
						|
the window overlap diagram). At this point, the audio data between the
 | 
						|
center of the previous frame and the center of the current frame is
 | 
						|
now finished and ready to be returned. </para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>cache right hand data</title>
 | 
						|
<para>
 | 
						|
The decoder must cache the right hand portion of the current frame to
 | 
						|
be lapped with the left hand portion of the next frame.
 | 
						|
</para>
 | 
						|
</section>
 | 
						|
 | 
						|
<section><title>return finished audio data</title>
 | 
						|
 | 
						|
<para>
 | 
						|
The overlapped portion produced from overlapping the previous and
 | 
						|
current frame data is finished data to be returned by the decoder.
 | 
						|
This data spans from the center of the previous window to the center
 | 
						|
of the current window.  In the case of same-sized windows, the amount
 | 
						|
of data to return is one-half block consisting of and only of the
 | 
						|
overlapped portions. When overlapping a short and long window, much of
 | 
						|
the returned range is not actually overlap.  This does not damage
 | 
						|
transform orthogonality.  Pay attention however to returning the
 | 
						|
correct data range; the amount of data to be returned is:
 | 
						|
 | 
						|
<programlisting>
 | 
						|
window_blocksize(previous_window)/4+window_blocksize(current_window)/4
 | 
						|
</programlisting>
 | 
						|
 | 
						|
from the center of the previous window to the center of the current
 | 
						|
window.</para>
 | 
						|
 | 
						|
<para>
 | 
						|
Data is not returned from the first frame; it must be used to 'prime'
 | 
						|
the decode engine.  The encoder accounts for this priming when
 | 
						|
calculating PCM offsets; after the first frame, the proper PCM output
 | 
						|
offset is '0' (as no data has been returned yet).</para>
 | 
						|
</section>
 | 
						|
</section>
 | 
						|
 | 
						|
</section>
 | 
						|
 | 
						|
</section>
 | 
						|
<!-- end Vorbis I specification introduction and description -->
 | 
						|
 |