Date: Wed, 22 Sep 2004 06:42:22 +0000 From: "Adam M. Costello" To: png-list@ccrc.wustl.edu Subject: [png-list] Animated Network Graphics (ANG), draft 3 Message-ID: <20040922064222.GA4009~@nicemice.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6+20040722i Sender: owner-png-list@ccrc.wustl.edu Precedence: bulk Reply-To: png-list@ccrc.wustl.edu Animated Network Graphics (ANG), draft 3 (2004-Sep-21-Tue) Adam M. Costello http://www.nicemice.net/amc/ Changes from draft 2 bit depth --> sample depth revised PLTE/tRNS/hIST inheritance staging offsets can change redesigned action_flags source & destination regions may overlap ticks_per_second = 0 means all frame durations are infinite impossible-to-achieve frame durations may be fudged optional ANG signature before PNG datastream PLTE forbidden in non-indexed-color PNGanim substreams revised start & end of PNGanim datastream grammar optimized the notes on compositing Acknowledgements Many good ideas have been taken from png-list. Contents Conceptual model Encoding Datastream tagging External control Comparison with APNG 0.4 Note on compositing Conceptual model Recall that a Portable Network Graphics (PNG) datastream encodes a single reference image. An Animated Network Graphics (ANG) datastream encodes a main reference image (intended to be viewed alone as a still image) plus a sequence of reference images that are building blocks (not frames) of an animation. The main reference image can optionally be used as a building block in the animation. The building blocks are represented losslessly, as are the instructions for assembling them into frames, but the resulting frames are not represented losslessly unless the encoder opts to use the same sample depth for all building blocks and avoids compositing partially transparent building blocks over other building blocks Decoders are allowed to introduce inexactness when compositing partially transparent pixels over other pixels and when performing sample depth scaling. A viewer (that is, a decoder that displays images) shows either the main image or the animation, depending on the viewer's capabilities, the user's preferences, external signals, and the limitations of the medium (for example, paper does not support animation very well). Each frame of the animation is shown as a still image in front of a decoder-supplied background. Frames that are not fully opaque allow the background to show through. The frames are shown in succession, creating the illusion of motion if their durations are sufficiently brief. The sequence of frames can be shown more than once (looped). After all iterations of the animation have been shown, a still image is shown, which can be either the last frame, the main image, or a fully transparent image, as declared in the animation header at the start. All meta-data belonging to the main image applies not only to the main image but to the entire animation as well, with exceptions only for IHDR, PLTE, hIST, and tRNS: * The main image's IHDR information applies only to the main image, except that the width and height apply also to every frame, but not to any building blocks other than the main image. * If the main image is indexed-color: The main image's PLTE information applies only to the main image and to every indexed-color building block that does not provide its own PLTE information. The main image's tRNS information (if present) applies only to the main image and to every indexed-color building block that provides neither its own PLTE nor tRNS information. The main image's hIST information (if present) applies only to the main image. * If the main image is not indexed-color: The main image's PLTE/hIST information (if present) is a suggested palette and applies to the entire animation. The main image's tRNS information (if present) applies only to the main image and to every building block that has the same color type as the main image and does not provide its own tRNS information. While decoding a PNG animation, a decoder needs to maintain the following mutable state: * An image buffer at least as large as the main image. There is a rectangular region of this buffer, with dimensions matching the main image, that can be displayed as a frame; this region is the staging area for constructing the next frame. The rest of the buffer cannot be displayed, and serves as scratch space. The staging area can move, but only immediately after it is displayed; therefore the decoder always knows where the next frame will come from as it is being constructed. The minimal set of channels needed for the buffer is declared in the animation header, though a decoder is free to use more channels (for example, a decoder could use RGBA even if grayscale without alpha would suffice). If the decoder is extracting lossless frames then the buffer needs to have the same sample depth as the building blocks. If the decoder is merely displaying the animation then the buffer can have any sample depth, or indeed any representation. The buffer supports the following operations: Read a rectangular region of the buffer, write a rectangular region of the buffer, composite a rectangular region over the buffer, display the staging area as a frame. * The frame duration, which determines how long a frame is displayed. This value can be changed just before a frame is displayed, in order to affect the duration of that frame, but changing this variable has no effect on the duration of the frame that is already on the display. A duration of zero means infinity, which can be useful in conjunction with external signals (see section "External control"). It is deliberately impossible to assign an actual zero duration, because that would indicate that the frame is not intended to be displayed at all, in which case it does not deserve to be called a frame, it is merely an intermediate state along the way to constructing a frame. * An iteration counter, which counts the number of times the animation has been played so that the decoder knows when to stop looping. The animation is represented as a sequence of actions of the following kinds: decode a compressed datastream into or onto the buffer, copy a rectangular region of the buffer into or onto a same-sized region of the buffer, alter the frame duration, display the staging area as a frame, redefine the staging area, The frame durations are ideal targets, but sometimes decoders will be unable to achieve the targets because of forces beyond their control. For example, there might not be sufficient computing resources to keep up, or the data might be streaming in too slowly, or the decoder might be under constraints to keep the animation synchronized with something else. Decoders may shorten or lengthen the frame durations when necessary. Encoding The encoding of the model described above is largely independent of (and separable from) the model itself. Conjecture: It is possible to encode the model using a restricted profile of MNG. Another approach is to encode the model in a way that allows ANG-unaware PNG decoders to decode the main image. There are various ways to achieve that, one of which is described here. Bit numbering convention: Bit 0 means the least significant bit. Bit i means the bit with value 2^i. An Animated Network Graphics (ANG) datastream is a simple container format with the following grammar: ANG_datastream ::= ANG_signature? PNG_datastream PNGanim_datastream ANG_signature = 138 65 78 71 PNGanim_datastream ::= PNGanim_signature ( AnIC PNGanim_building_block? )* AnIE PNGanim_signature = 0 0 0 0 137 80 78 71 97 110 105 109 PNGanim_building_block ::= IHDR PLTE? tRNS? IDAT+ The ANG signature is the byte 138 followed by "ANG" in ASCII. It does not replicate the line-ending corruption detection features of the PNG signature, because the PNG signature is still present and fulfills that role. Notice that the ANG signature can be omitted. See section "Datastream tagging". The PNG datastream inside an ANG datastream is a regular PNG datastream, except that it must contain exactly one anIH (animation header) chunk, which must appear before IDAT. The PNGanim datastream cannot be interpreted in the absence of the preceeding PNG datastream. Within the PNGanim datastream, no other chunk types are allowed besides the ones listed in the grammar rules above, unless they are defined specifically for PNGanim datastreams. Regular PNG chunks other than those listed in the grammar rule belong in the PNG datastream, not the PNGanim datastream. Of course decoders must be prepared to encounter and ignore unknown ancillary chunks anywhere between the PNGanim signature and the AnIE chunk. The PNGanim signature is the bytes 0,0,0,0,137 followed by "PNGanim" in ASCII. There should not be any decoders looking for chunks after IEND, but in case there are, they will find what appears to be a zero-length chunk with a syntactically invalid chunk name and an incorrect CRC. The PNGanim signature does not replicate the line-ending corruption detection features of the PNG signature, because a PNGanim datastream is always preceeded by a PNG datastream with a PNG signature to fulfill that role. A PNGanim datastream would be useless without an accompanying PNG datastream to serve as the main image and carry the anIH chunk, A PNGanim building block closely resembles a PNG datastream, but is not one. It does not begin with the PNG signature, does not end with IEND, and might lack PLTE even when the color type is indexed-color, and even when tRNS is present. However, its similarity to a PNG datastream is designed to facilitate reuse of PNG decoding libraries. Within a PNGanim building block, PLTE may be present only if the color type in IHDR is indexed-color (there are no suggested palettes), and tRNS (if present) must have the form appropriate for the color type, as in PNG. AnIE is an empty chunk to mark the end of the PNGanim datastream. A chunk type other than IEND is used so that PNG-animation-aware programs encountering a file beginning with the PNG signature can quickly seek to the end of the file and determine whether a PNGanim datastream is present, rather than having to scan forward to see whether anIH appears before IDAT. anIH contains the following fields: buffer_width (4 bytes, unsigned) buffer_height (4 bytes, unsigned) staging_X_offset (4 bytes, unsigned) staging_Y_offset (4 bytes, unsigned) These must satisfy: staging_X_offset + main image width <= buffer_width staging_Y_offset + main image height <= buffer_height ticks_per_second (4 bytes, unsigned) Defines the time unit for frame durations. If this is zero, all frame durations (including zero) are treated as infinite. num_iterations (4 bytes, unsigned) Zero means infinity. promises (1 byte) Declares limitations that the encoder has imposed on the building blocks and actions in the animation. Each set bit constitutes a promise as follows: bit 0 ==> Color is not used. Explicit RGB channels never appear, and PLTE entries always satisfy R=G=B. bit 1 ==> Full alpha is not used. Explicit alpha channels never appear, and every tRNS entry is 0 or 255. bit 2 ==> tRNS never appears. bit 3 ==> No partially transparent pixels are ever composited over the buffer. If bit 1 is set this limitation is automatically satisfied. bit 4 ==> All building blocks have the same sample depth. bit 5 ==> The staging offsets never change. The remaining two bits are reserved; encoders must not set them (because they cannot know what they would be promising). after_image (1 byte) To be displayed in the "after" state: 0 ==> fully transparent 1 ==> last frame 2 ==> main image iteration_start_action (variable-length, optional) At most one action can appear here, using the same syntax as the actions that appear in AnIC (see below). If an action is present, its write_buffer flag must be set, and its read_buffer flag must be unset (the main image serves as the source). The main image serves as a building block of the animation iff an action is present in anIH. If so, it is bound by the promises, same as any other building block. Encoders are encouraged to set the appropriate bits of the promises field, but an unset bit is always valid, because it is not a negative promise, but merely the lack of a promise. Decoders may ignore any of the bits (including the reserved ones) or they can use the bits to enable optimized decoding routines. If the conditions for bits 3 and 4 are both satisfied, the frames are represented losslessly and can be recovered exactly. There is no palette or tRNS data associated with the buffer. Palette and tRNS data is used for decoding compressed datastreams, not for displaying or interpreting the buffers. At the start of each iteration of the animation the following four actions are performed: * The entire buffer is initialized to fully opaque black. [[ Fully transparent was my first inclination, but if the encoder promises not to use transparency of any kind, we ought to let the decoder use a buffer that lacks transparency information. ]] * The staging area offsets are initialized to their values from anIH. * The frame duration is initialized to 1 tick. * The action in anIH, if present, is performed. This action must be performed after the other three. Actions to be performed during an iteration of the animation are indicated by AnIC chunks. An AnIC chunk contains a sequence of one or more actions. An action contains: action_flags (1 byte) bit 0: reserved (nonzero is a fatal error) bit 1: write_buffer (boolean) bit 2: read_buffer (boolean) bit 3: composite_over (boolean) bit 4: change_frame_duration (boolean) bit 5: display_staging_area (boolean) bit 6: change_staging_offsets (boolean) bit 7: reserved (nonzero is a fatal error) If write_buffer is unset then read_buffer and composite_over must also be unset. If display_staging_area is unset then change_frame_duration and change_staging_offsets must also be unset. Violations of these rules are fatal errors. The flags write_buffer, read_buffer, change_frame_duration, and change_staging_offsets indicate that additional fields are present (see below). Those fields appear in bit-number order. The flags write_buffer, change_frame_duration, display_staging_area, and change_staging_offsets indicate that actions are to be performed (the other two flags, read_buffer and composite_over, are parameters of the write_buffer action). The actions are performed in bit-number order. destination_X_offset (1, 2, or 4 bytes, unsigned) destination_Y_offset (1, 2, or 4 bytes, unsigned) These are present iff write_buffer is set. They must satisfy: destination_X_offest + source_width <= buffer_width destination_Y_offest + source_height <= buffer_height If read_buffer is unset then source_width and source_height are the width and height from the IHDR in the PNGanim building block following the AnIC chunk. source_width (1, 2, or 4 bytes, unsigned) source_X_offset (1, 2, or 4 bytes, unsigned) source_height (1, 2, or 4 bytes, unsigned) source_Y_offset (1, 2, or 4 bytes, unisgned) These are present iff read_buffer is set. They must satisfy: source_X_offest + source_width <= buffer_width source_Y_offest + source_height <= buffer_height frame_duration (4 bytes, signed) Present iff change_frame_duration is set. staging_X_offset (1, 2, or 4 bytes, unsigned) staging_Y_offset (1, 2, or 4 bytes, unsigned) These are present iff change_staging_offsets is set. They must satisfy: staging_X_offset + main image width <= buffer_width staging_Y_offset + main image height <= buffer_height Within AnIC chunks, all horizontal dimensions are 1 byte if buffer_width <= 255, 2 bytes if buffer_width <= 65535, 4 bytes otherwise. Vertical dimensions depend on buffer_height analogously. Recall that buffer_width and buffer_height are set once in anIH and never change. The write_buffer flag indicates whether a buffer modification is to be performed. When it is set, the read_buffer flag indicates whether the source of the operation is the buffer or the stream, and the composite_over flag indicates whether the source replaces the destination or is composited over it. When the source is a region of the buffer, it may overlap the destination region. Decoders must handle this case correctly and not overwrite source pixels before reading them. Two scan orders will suffice. For example, suppose ForwardScan scans pixels from left to right within rows and scans rows from top to bottom within regions, while ReverseScan scans pixels from right to left within rows and scans rows from bottom to top within regions. ForwardScan will work correctly whenever (destY,destX) <= (srcY,srcX) lexicographically, and ReverseScan will work correctly whenever (srcY,srcX) <= (destY,destX). When write_buffer is set and read_buffer is unset, the action must be the last action in the AnIC chunk, and the AnIC chunk must be followed (before the next critical chunk) by a PNGanim building block to be used as the source of the buffer modification. If display_staging_area is set, the staging area is displayed as a frame (after any buffer modification has been performed). The number of frames in the animation equals the number of actions with set display_staging_area flags. If display_staging_area is set, change_frame_duration may be set in order to make the duration of this frame different from the duration of the previous frame, and change_staging_offsets may be set in order to make the staging area of the next frame different from the staging area of this frame. A PNGanim building block is decoded like a PNG datastream, except that PLTE may be absent when color_type is 3 (indexed-color) if the main PNG datastream contains a PLTE chunk, and PLTE and tRNS are inherited from the main PNG datastream under certain conditions specified above in section "Conceptual model". On encountering any fatal error while decoding the animation, decoders must cease decoding the animation, and must fall back to displaying the main image if possible. Datastream tagging When backward compatibility with ANG-unaware decoders is not needed, the ANG signature should be present, the preferred media type is "video/ang", and the preferred file extension is ".ang". When necessary for interoperating with animation-unaware decoders, the ANG signature may be omitted, and the media type "image/png" may be used. The file extension ".png" can also be used, but should be avoided if at all possible, because it will tend to confuse users. The new media type is necessary for applying greater control over fallback. For example, if you want ANG-unaware web browsers to fall back to animated GIF rather than still PNG, you need to do something like or something analogous at the HTTP layer using content negotation. Without a new media type, this control would not be possible; web browsers that support PNG but not ANG would always fall back to still PNG. The ANG signature is useful when failure is preferable to fallback, such as when the animation is considered essential and showing a still image would be misleading. External control [[ This is a rough idea that should be reworked after a review of relevant standards, or removed. ]] Animation viewers can optionally support five external control signals "stop", "start", "suspend", "resume", and "advance". A viewer that supports these signals is free to map them to the user interface in any way (possibly using fewer buttons than signals) and to provide additional controls (for example, to override the frame durations, or to play in reverse if all the previous frames have been cached). This control model is intended as a common set of primitives that could be useful for scripting (especially in web pages). Bindings to scripting languages and/or the document object model would need to be standardized. The "suspend" and "resume" signals simply set and clear a "suspended" flag. The other three signals cause state transitions between the three states "main", "during", and "after". It would be useful to be able to externally control the initial state of the animation: main, during, or during-suspended. The states and signals behave as follows: "main" The main image is displayed. "stop" ==> "main" (no effect) "start" ==> "during" (first frame, first iteration) "advance" ==> equivalent to "suspend" then "start" "during" The animation frames are displayed in sequence, looping from the last to the first as declared in the animation header. If the "suspended" flag is unset, each frame endures for its alloted duration or until a signal arrives. If the "suspended" flag is set, each frame endures until a signal arrives. "stop" ==> "main" "start" ==> "during" (restart from 1st frame & iteration) "advance" ==> "during" or "after" (see below) The "advance" signal terminates the currently displayed frame immediately (before its duration has fully elapased). When the last frame of the last iteration is terminated (either by the passage of time or by an "advance" signal), there is a transition to "after". "after" A still image is displayed, either the last frame, the main image, or a fully transparent image, as declared in the animation header. "stop" ==> "main" "start" ==> "during" (first frame, first iteration) "advance" ==> "after" (no effect) Frames are numbered starting with 1. Frame numbering is standardized because it might be useful for scripting. The image displayed in the "after" state does not count as a frame. The number of frames is independent of the number of iterations (looping merely displays the same frames again, it does not duplicate the frames). "Frame 0" is a convenient misnomer for the main image (which is not part of the animation, and might or might not be the same as frame 1, the first frame of the animation). Comparison with APNG 0.4 APNG takes the view that an APNG is a PNG. This draft takes the view that an ANG is not a PNG, an ANG contains a PNG, and an ANG can be disguised as a PNG when necessary (by omitting the ANG signature). This draft puts the animation data in a separate datastream after the PNG datastream, rather than embedding it in chunks inside the PNG datastream. This is simpler because sequence numbers are not necessary, and there is no chunk-in-chunk embedding. It also avoids the controversy about PNG being a single-image format. The one disadvantage is that animations will not immediately be usable inside container files that don't mark the end of the embedded PNG, but rather depend on IEND to mark the end. Encoders of such containers (if properly written) will either refuse to insert an ANG or will insert only the main PNG datastream and discard the rest. This is probably an uncommon case, however, not a deal breaker. The killer app is ANG files, for which the two-datastream approach should work fine. The chunk structure of the PNGanim datastream is identical to the chunk structure of the datastream encapsulated in APNG's aDAT. Both APNG and this draft use a single image buffer that supports both pixel replacement and composite-over operations (APNG calls it a canvas). This draft relaxes the requirement that the buffer have the same dimensions as the main image and the frames, and instead allows the buffer to be larger, with the extra area available for use as scratch space. This relaxation is not expected to add much complexity, since there's still just one buffer, and the location of the frame staging area is still known while a frame is being constructed. In APNG, the only way to modify the buffer is by decoding an embedded image. This draft adds the ability to copy pixels from one part of the buffer into or onto another part. This is not difficult, and has great potential to reduce the size of animations, especially in combination with the scratch space. Both drafts allow the same kinds of drawing operations: replace and composite-over. The APNG explanation of how to do composite-over is fatally incomplete. This draft adapts the explanation from the MNG spec. APNG neglects to specify the initial state of the buffer (canvas), so we don't know how a decoder should handle APNG_RENDER_OP_DISPOSE_PREVIOUS in the first fCTL chunk. APNG does not specify that the buffer is reinitialized at the start of each iteration. If it's not, then every iteration can produce a different set of frames (for example, if a one-frame animation composites a partially transparent image over the buffer, then the buffer will get more and more opaque on each iteration), and the decoder lacks the option of caching the frames to avoid reconstructing them on each iteration. In this draft, the buffer is initialized at the start of each iteration, so the decoder always has the option of caching the frames. This draft adds mutable state variables that are absent in APNG: the frame duration and the staging offsets. In APNG the staging offsets are always implicitly zero, and the frame duration appears explicitly for every frame. In this draft, the frame duration and staging offsets are remembered by the decoder; they appear explicitly only when they change. In most animations, almost all frames will have the same duration. This draft expresses offsets using integers of a size appropriate for the total buffer dimensions. This wouldn't be worth the trouble in APNG, where the offsets are dwarfed by the embedded image that invariably follows, but in this draft, where a single AnIC chunk can contain a long list of intra-buffer operations, the savings from shorter offsets and ellided frame durations can add up. APNG uses a few predefined "macros" for manipulating the image buffer, called disposal methods. This draft uses more primitive operations that that be used to simulate the macros or to do more custom things. The meaning of APNG_RENDER_OP_SKIP_FRAME is not entirely clear from the APNG 0.4 spec, but discussions have revealed that it was *not* intended to be equivalent to display_staging_area (with reversed polarity). Whereas an unset display_staging_area means modify the buffer but don't display the result (yet), a set APNG_RENDER_OP_SKIP_FRAME means don't modify the buffer at all and don't even decode the embedded image. PLTE/tRNS inheritance is slightly different. In APNG, tRNS can be inherited by an indexed-color image even if PLTE is not inherited, which is unlikely to be useful, given how tightly coupled the two chunks are. An empty tRNS chunk would need to be included to avoid inheriting tRNS. In this draft, tRNS is not inherited by an indexed-color image if PLTE is not inherited. This draft uses 4 bytes for the numerator and denominator of the frame duration rather than 2 bytes as in APNG, and uses one global denominator rather than a per-frame denominator, A frame delay of zero means "as quickly as possible" in APNG, but means infinity in this draft. In both drafts, zero iterations means loop infinitely. Both drafts use their respective usual mechanisms for incorporating the main image into the animation, same as any other image, although this draft puts it at the end of the animation header chunk rather than in its own chunk, because AnIC is a critical chunk and cannot be used in the PNG datastream. APNG always finishes an animation by holding the last frame. This draft adds two alternatives: become fully transparent, or display the main image. The last option is expected to be useful quite often. This draft adds a field in the header where encoders can (but need not) inform decoders of potential optimizations (which decoders can ignore). Both drafts specify that PNG meta-data generally applies to the whole animation, not just the main image or any one frame or embedded image. In APNG there is a one-to-one correspondence between images and frames, and all images (even skipped images) get frame numbers, starting with 0 for the main image. In this draft, "frame 0" refers to the main image, and frames (not building blocks) are numbered starting at 1, so "frame 1" always refers to the first frame, even if it is identical to the main image. This draft provides recommendations for the use of both old and new media types, signatures, and file extensions. This draft provides an optional external control model to support scripting. Note on compositing (Adapted from the MNG 1.0 specification, section 11.3.) The PNG specification gives a good explanation of how to composite a partially transparent image over an opaque image, but things get more complicated when both images are partially transparent. Pixels in PNG and PNGanim are represented using gamma-encoded RGB (or gray) samples along with a linear alpha value. Alpha processing can be performed only on linear samples. This section assumes that R, G, B, and A values have all been converted to real numbers in the range [0..1], and that any gamma encoding has been undone. For a top pixel (Rt,Gt,Bt,At) and a bottom pixel (Rb,Gb,Bb,Ab), the composite pixel (Rc,Gc,Bc,Ac) is given by: Ac = 1 - (1 - At)(1 - Ab) if (Ac != 0) then s = At / Ac t = (1 - At) Ab / Ac else s = 0.0 t = 1.0 endif Rc = s Rt + t Rb Gc = s Gt + t Gb Bc = s Bt + t Bb When the bottom pixel is fully opaque (Ab = 1.0), the function reduces to: Ac = 1 Rc = At Rt + (1 - At) Rb Gc = At Gt + (1 - At) Gb Bc = At Bt + (1 - At) Bb When the bottom pixel is not fully opaque, the function is much simpler if pixels are represented as (R*,G*,B*,A*) rather than (R,G,B,A), where A* is the complement of A, and (R*,G*,B*) is (R,G,B) "premultiplied" by A: A* = 1 - A R* = R A G* = G A B* = B A For a top pixel (Rt*,Gt*,Bt*,At*) and a bottom pixel (Rb*,Gb*,Bb*,Ab*), the composite pixel (Rc*,Gc*,Bc*,Ac*) is given by: Ac* = 0 + At* Ab* Rc* = Rt* + At* Rb* Gc* = Gt* + At* Gb* Bc* = Bt* + At* Bb* As mentioned in the PNG specification, the equations become much simpler when no pixel has an alpha value other than 0.0 or 1.0, and the RGB samples need not be linear in that case. End of draft.