Copyright | (c) 2010-2011 Simon Meier (c) 2010 Jasper van der Jeugt |
---|---|
License | BSD3-style (see LICENSE) |
Maintainer | Simon Meier <[email protected]> |
Portability | GHC |
Safe Haskell | Trustworthy |
Language | Haskell98 |
This module provides Builder
primitives, which are lower level building
blocks for constructing Builder
s. You don't need to go down to this level but
it can be slightly faster.
Morally, builder primitives are like functions a -> Builder
, that is they
take a value and encode it as a sequence of bytes, represented as a Builder
.
Of course their implementation is a bit more specialised.
Builder primitives come in two forms: fixed-size and bounded-size.
- Fixed(-size) primitives are builder primitives that always result in a
sequence of bytes of a fixed length. That is, the length is independent of
the value that is encoded. An example of a fixed size primitive is the
big-endian encoding of a
Word64
, which always results in exactly 8 bytes. - Bounded(-size) primitives are builder primitives that always result in a
sequence of bytes that is no larger than a predetermined bound. That is, the
bound is independent of the value that is encoded but the actual length will
depend on the value. An example for a bounded primitive is the UTF-8 encoding
of a
Char
, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.
Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.
Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.
The purpose of using builder primitives is to improve the performance of
Builder
s. These improvements stem from making the two most common steps
performed by a Builder
more efficient. We explain these two steps in turn.
The first most common step is the concatenation of two Builder
s. Internally,
concatenation corresponds to function composition. (Note that Builder
s can
be seen as difference-lists of buffer-filling functions; cf.
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. )
Function composition is a fast O(1) operation. However, we can use bounded
primitives to remove some of these function compositions altogether, which is
more efficient.
The second most common step performed by a Builder
is to fill a buffer using
a bounded primitives, which works as follows. The Builder
checks whether
there is enough space left to execute the bounded primitive. If there is, then
the Builder
executes the bounded primitive and calls the next Builder
with
the updated buffer. Otherwise, the Builder
signals its driver that it
requires a new buffer. This buffer must be at least as large as the bound of
the primitive. We can use bounded primitives to reduce the number of
buffer-free checks by fusing the buffer-free checks of consecutive Builder
s.
We can also use bounded primitives to simplify the control flow for signalling
that a buffer is full by ensuring that we check first that there is enough
space left and only then decide on how to encode a given value.
Let us illustrate these improvements on the CSV-table rendering example from
Data.ByteString.Builder. Its "hot code" is the rendering of a table's
cells, which we implement as follows using only the functions from the
Builder
API.
import Data.ByteString.Builder as B renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) = B.intDec i renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"' where escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\' escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"' escape c = B.charUtf8 c
Efficient encoding of Int
s as decimal numbers is performed by intDec
.
Optimization potential exists for the escaping of String
s. The above
implementation has two optimization opportunities. First, the buffer-free
checks of the Builder
s for escaping double quotes and backslashes can be
fused. Second, the concatenations performed by foldMap
can be eliminated.
The following implementation exploits these optimizations.
import qualified Data.ByteString.Builder.Prim as P import Data.ByteString.Builder.Prim (condB
,liftFixedToBounded
, (>*<
), (>$<
) ) renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> E.encodeListWithB
escape cs <> B.charUtf8 '"' where escape :: E.BoundedPrim
Char escape =condB
(== '\\') (fixed2 ('\\', '\\')) $condB
(== '\"') (fixed2 ('\\', '\"')) $ E.charUtf8
{-# INLINE fixed2 #-} fixed2 x =liftFixedToBounded
$ const x>$<
E.char7
>*<
E.char7
The code should be mostly self-explanatory. The slightly awkward syntax is
because the combinators are written such that the size-bound of the resulting
BoundedPrim
can be computed at compile time. We also explicitly inline the
fixed2
primitive, which encodes a fixed tuple of characters, to ensure that
the bound computation happens at compile time. When encoding the following list
of String
s, the optimized implementation of renderString
is two times
faster.
maxiStrings :: [String] maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]
Most of the performance gain stems from using primMapListBounded
, which
encodes a list of values from left-to-right with a BoundedPrim
. It exploits
the Builder
internals to avoid unnecessary function compositions (i.e.,
concatenations). In the future, we might expect the compiler to perform the
optimizations implemented in primMapListBounded
. However, it seems that the
code is currently to complicated for the compiler to see through. Therefore, we
provide the BoundedPrim
escape hatch, which allows data structures to provide
very efficient encoding traversals, like primMapListBounded
for lists.
Note that BoundedPrim
s are a bit verbose, but quite versatile. Here is an
example of a BoundedPrim
for combined HTML escaping and UTF-8 encoding. It
exploits that the escaped character with the maximal Unicode codepoint is '>'.
{-# INLINE charUtf8HtmlEscaped #-} charUtf8HtmlEscaped :: E.BoundedPrim Char charUtf8HtmlEscaped =condB
(> '>' ) E.charUtf8
$condB
(== '<' ) (fixed4 ('&',('l',('t',';')))) $ -- <condB
(== '>' ) (fixed4 ('&',('g',('t',';')))) $ -- >condB
(== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $ -- &condB
(== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $ -- "condB
(== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $ -- ' (liftFixedToBounded
E.char7
) -- fallback forChar
s smaller than '>' where {-# INLINE fixed4 #-} fixed4 x =liftFixedToBounded
$ const x>$<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7 {-# INLINE fixed5 #-} fixed5 x =liftFixedToBounded
$ const x>$<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7
This module currently does not expose functions that require the special
properties of fixed-size primitives. They are useful for prefixing Builder
s
with their size or for implementing chunked encodings. We will expose the
corresponding functions in future releases of this library.
- data BoundedPrim a
- emptyB :: BoundedPrim a
- (>*<) :: Monoidal f => f a -> f b -> f (a, b)
- (>$<) :: Contravariant f => (b -> a) -> f a -> f b
- eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b)
- condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a
- primBounded :: BoundedPrim a -> a -> Builder
- primMapListBounded :: BoundedPrim a -> [a] -> Builder
- primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- data FixedPrim a
- emptyF :: FixedPrim a
- liftFixedToBounded :: FixedPrim a -> BoundedPrim a
- primFixed :: FixedPrim a -> a -> Builder
- primMapListFixed :: FixedPrim a -> [a] -> Builder
- primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- int8 :: FixedPrim Int8
- word8 :: FixedPrim Word8
- int16BE :: FixedPrim Int16
- int32BE :: FixedPrim Int32
- int64BE :: FixedPrim Int64
- word16BE :: FixedPrim Word16
- word32BE :: FixedPrim Word32
- word64BE :: FixedPrim Word64
- floatBE :: FixedPrim Float
- doubleBE :: FixedPrim Double
- int16LE :: FixedPrim Int16
- int32LE :: FixedPrim Int32
- int64LE :: FixedPrim Int64
- word16LE :: FixedPrim Word16
- word32LE :: FixedPrim Word32
- word64LE :: FixedPrim Word64
- floatLE :: FixedPrim Float
- doubleLE :: FixedPrim Double
- intHost :: FixedPrim Int
- int16Host :: FixedPrim Int16
- int32Host :: FixedPrim Int32
- int64Host :: FixedPrim Int64
- wordHost :: FixedPrim Word
- word16Host :: FixedPrim Word16
- word32Host :: FixedPrim Word32
- word64Host :: FixedPrim Word64
- floatHost :: FixedPrim Float
- doubleHost :: FixedPrim Double
- char7 :: FixedPrim Char
- int8Dec :: BoundedPrim Int8
- int16Dec :: BoundedPrim Int16
- int32Dec :: BoundedPrim Int32
- int64Dec :: BoundedPrim Int64
- intDec :: BoundedPrim Int
- word8Dec :: BoundedPrim Word8
- word16Dec :: BoundedPrim Word16
- word32Dec :: BoundedPrim Word32
- word64Dec :: BoundedPrim Word64
- wordDec :: BoundedPrim Word
- word8Hex :: BoundedPrim Word8
- word16Hex :: BoundedPrim Word16
- word32Hex :: BoundedPrim Word32
- word64Hex :: BoundedPrim Word64
- wordHex :: BoundedPrim Word
- int8HexFixed :: FixedPrim Int8
- int16HexFixed :: FixedPrim Int16
- int32HexFixed :: FixedPrim Int32
- int64HexFixed :: FixedPrim Int64
- word8HexFixed :: FixedPrim Word8
- word16HexFixed :: FixedPrim Word16
- word32HexFixed :: FixedPrim Word32
- word64HexFixed :: FixedPrim Word64
- floatHexFixed :: FixedPrim Float
- doubleHexFixed :: FixedPrim Double
- char8 :: FixedPrim Char
- charUtf8 :: BoundedPrim Char
Bounded-size primitives
data BoundedPrim a Source #
A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.
Combinators
The combinators for BoundedPrim
s are implemented such that the
size of the resulting BoundedPrim
can be computed at compile time.
emptyB :: BoundedPrim a Source #
The BoundedPrim
that always results in the zero-length sequence.
(>*<) :: Monoidal f => f a -> f b -> f (a, b) infixr 5 Source #
A pairing/concatenation operator for builder primitives, both bounded and fixed size.
For example,
toLazyByteString (primFixed (char7 >*< char7) ('x','y')) = "xy"
We can combine multiple primitives using >*<
multiple times.
toLazyByteString (primFixed (char7 >*< char7 >*< char7) ('x',('y','z'))) = "xyz"
(>$<) :: Contravariant f => (b -> a) -> f a -> f b infixl 4 Source #
A fmap-like operator for builder primitives, both bounded and fixed size.
Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)
We can use it for example to prepend and/or append fixed values to an primitive.
showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'" where fixed3 = char7 >*< char7 >*< char7
Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time.
eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b) Source #
Encode an Either
value using the first BoundedPrim
for Left
values and the second BoundedPrim
for Right
values.
Note that the functions eitherB
, pairB
, and contramapB
(written below
using >$<
) suffice to construct BoundedPrim
s for all non-recursive
algebraic datatypes. For example,
maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a) maybeB nothing just =maybe
(Left ()) Right>$<
eitherB nothing just
condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a Source #
Conditionally select a BoundedPrim
.
For example, we can implement the ASCII primitive that drops characters with
Unicode codepoints above 127 as follows.
charASCIIDrop =condB
(< '\128') (fromF
char7
)emptyB
Builder construction
primBounded :: BoundedPrim a -> a -> Builder Source #
Create a Builder
that encodes values with the given BoundedPrim
.
We rewrite consecutive uses of primBounded
such that the bound-checks are
fused. For example,
primBounded (word32 c1) `mappend` primBounded (word32 c2)
is rewritten such that the resulting Builder
checks only once, if ther are
at 8 free bytes, instead of checking twice, if there are 4 free bytes. This
optimization is not observationally equivalent in a strict sense, as it
influences the boundaries of the generated chunks. However, for a user of
this library it is observationally equivalent, as chunk boundaries of a lazy
ByteString
can only be observed through the internal interface.
Morevoer, we expect that all primitives write much fewer than 4kb (the
default short buffer size). Hence, it is safe to ignore the additional
memory spilled due to the more agressive buffer wrapping introduced by this
optimization.
primMapListBounded :: BoundedPrim a -> [a] -> Builder Source #
Create a Builder
that encodes a list of values consecutively using a
BoundedPrim
for each element. This function is more efficient than the
canonical
filter p = B.toLazyByteString . E.encodeLazyByteStringWithF (E.ifF p E.word8) E.emptyF)
mconcat . map (primBounded w)
or
foldMap (primBounded w)
because it moves several variables out of the inner loop.
primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source #
Create a Builder
that encodes a sequence generated from a seed value
using a BoundedPrim
for each sequence element.
primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #
Create a Builder
that encodes each Word8
of a strict ByteString
using a BoundedPrim
. For example, we can write a Builder
that filters
a strict ByteString
as follows.
import Data.ByteString.Builder.Primas P (word8, condB, emptyB)
filterBS p = P.condB p P.word8 P.emptyB
primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #
Chunk-wise application of primMapByteStringBounded
.
Fixed-size primitives
A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size.
Combinators
The combinators for FixedPrim
s are implemented such that the size
of the resulting FixedPrim
is computed at compile time.
The '(>*and '($<)' pairing and mapping operators can be used
with FixedPrim
.
liftFixedToBounded :: FixedPrim a -> BoundedPrim a Source #
Lift a FixedPrim
to a BoundedPrim
.
Builder construction
In terms of expressivity, the function fixedPrim
would be sufficient
for constructing Builder
s from FixedPrim
s. The fused variants of
this function are provided because they allow for more efficient
implementations. Our compilers are just not smart enough yet; and for some
of the employed optimizations (see the code of encodeByteStringWithF
)
they will very likely never be.
Note that functions marked with "Heavy inlining." are forced to be
inlined because they must be specialized for concrete encodings,
but are rather heavy in terms of code size. We recommend to define a
top-level function for every concrete instantiation of such a function in
order to share its code. A typical example is the function
byteStringHex
from Data.ByteString.Builder.ASCII, which is
implemented as follows.
byteStringHex :: S.ByteString -> Builder byteStringHex =encodeByteStringWithF
word8HexFixed
primMapListFixed :: FixedPrim a -> [a] -> Builder Source #
Encode a list of values from left-to-right with a FixedPrim
.
primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #
Heavy inlining. Encode all bytes of a strict ByteString
from
left-to-right with a FixedPrim
. This function is quite versatile. For
example, we can use it to construct a Builder
that maps every byte before
copying it to the buffer to be filled.
mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder mapToBuilder f = encodeByteStringWithF (contramapF f word8)
We can also use it to hex-encode a strict ByteString
as shown by the
byteStringHex
example above.
primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #
Heavy inlining. Encode all bytes of a lazy ByteString
from
left-to-right with a FixedPrim
.
Standard encodings of Haskell values
Binary encodings
Big-endian
Little-endian
Non-portable, host-dependent
intHost :: FixedPrim Int Source #
Encode a single native machine Int
. The Int
s is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the Int
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or integer sized machines, without
conversion.
wordHost :: FixedPrim Word Source #
Encode a single native machine Word
. The Word
s is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the Word
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or word sized machines, without
conversion.
floatHost :: FixedPrim Float Source #
Encode a Float
in native host order and host endianness. Values written
this way are not portable to different endian machines, without conversion.
Character encodings
ASCII
Decimal numbers
Decimal encoding of numbers using ASCII encoded characters.
Hexadecimal numbers
Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,
toLazyByteString (primBounded word16Hex 0x0a10) = "a10"
Note that there is no support for using upper-case characters. Please contact the maintainer if your application cannot work without hexadecimal encodings that use upper-case characters.
Fixed-width hexadecimal numbers
Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,
toLazyByteString (primFixed word16HexFixed 0x0a10) = "0a10"
ISO/IEC 8859-1 (Char8)
The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1.
The Char8 encoding implemented here works by truncating the Unicode
codepoint to 8-bits and encoding them as a single byte. For the codepoints
0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the
Char8 encoding is equivalent to the ASCII encoding on the Unicode
codepoints 0-127. Hence, functions such as intDec
can also be used for
encoding Int
s as a decimal number with Char8 encoded characters.