Skip to content

Codecs library#

Importing#

|#| 'stdcodecs.nest' = cc

Nest encodings#

Note

To avoid any mistakes or confusion it is recommended that the constants defined in this library be used.

Nest supports the following encodings:

Encoding Aliases Link
ascii us-ascii ASCII
cp1250 cp-1250, windows[-]1250 CP1250
cp1251 cp-1251, windows[-]1251 CP1251
cp1252 cp-1252, windows[-]1252 CP1252
cp1253 cp-1253, windows[-]1253 CP1253
cp1254 cp-1254, windows[-]1254 CP1254
cp1255 cp-1255, windows[-]1255 CP1255
cp1256 cp-1256, windows[-]1256 CP1256
cp1257 cp-1257, windows[-]1257 CP1257
cp1258 cp-1258, windows[-]1258 CP1258
latin-1 latin1, l1, latin, iso[-]8859-1 latin1
utf8 utf-8 UTF-8
ext-utf8 ext[-]utf[-]8 -
utf16le utf-16le, utf[-]16 UTF-16LE
utf16be utf-16be UTF-16BE
ext-utf16le ext[-]utf[-]16le, ext[-]utf[-]16 -
ext-utf16be ext[-]utf[-]16be -
utf32le utf-32le, utf[-]32 UTF-32LE
utf32be utf-32be UTF-32BE

Note

[-] means that the hyphen is optional, for example windows1252 and windows-1252 are both accepted.

The name of the encoding is case insensitive. Underscores (_), hyphens (-) and spaces () are interchangeable. This means that utf8, UTF-8, uTf_8 and UtF 8 are all valid ways of specifying the UTF-8 encoding.

Functions#

@cp_is_valid#

Synopsis:

[cp: Int|Byte] @cp_is_valid -> Bool

Returns:

true if cp is a valid Unicode code point and false otherwise.


@encoding_info#

Synopsis:

[encoding: Str] @encoding_info -> Map

Returns:

A new map containing various information about a particular encoding. The keys in the map are the following:

Key Type Value
name Str The name of the encoding.
min_len Int The minimum length of a code point (character) in bytes.
max_len Int The maximum length of a code point (character) in bytes.
bom Array?.Byte The Byte Order Mark, an array of bytes if it exists for the encoding and null if it doesn't

Example:

|#| 'stdcodecs.nest' = cc

'utf16'  @cc.encoding_info --> {'name': 'UTF-16LE', 'min_len': 2, 'max_len': 4, 'bom': {255b, 254b}}
'latin1' @cc.encoding_info --> {'name': 'ISO-8859-1', 'min_len': 1, 'max_len': 1, 'bom': null}

@from_cp#

Synopsis:

[cp: Int|Byte] @from_cp -> Str

Returns:

A new string containing the character associated with the given code point. If cp is not valid (can be checked with cp_is_valid) the function throws an error.


@to_cp#

Synopsis:

[char: Str] @to_cp -> Int

Returns:

The code point associated with the character in char. If char does not contain only one character an error is thrown.


Constants#

ASCII#

ASCII (a.k.a. US-ASCII) encoding name.


UTF_8#

UTF-8 encoding name.


EXT_UTF_8#

extUTF-8 encoding name. This encoding is Nest-specific and is UTF-8 that accepts unpaired surrogates.


UTF_16#

UTF-16 encoding name.


UTF_16LE#

UTF-16LE encoding name.


UTF_16BE#

UTF-16BE encoding name.


EXT_UTF_16#

extUTF-16 encoding name. This encoding is Nest-specific and is UTF-16 that accepts unpaired surrogates. The only exception is the last character that must not be a high surrogate.


EXT_UTF_16LE#

extUTF-16LE encoding name. Little endian version of extUTF-16..


EXT_UTF_16BE#

extUTF-16BE encoding name. Big endian version of extUTF-16..


UTF_32#

UTF-32 encoding name.


UTF_32LE#

UTF-32LE encoding name.


UTF_32BE#

UTF-32BE encoding name.


CP1250#

CP1250 (a.k.a. Windows-1250) encoding name.


CP1251#

CP1251 (a.k.a. Windows-1251) encoding name.


CP1252#

CP1252 (a.k.a. Windows-1252) encoding name.


CP1253#

CP1253 (a.k.a. Windows-1253) encoding name.


CP1254#

CP1254 (a.k.a. Windows-1254) encoding name.


CP1255#

CP1255 (a.k.a. Windows-1255) encoding name.


CP1256#

CP1256 (a.k.a. Windows-1256) encoding name.


CP1257#

CP1257 (a.k.a. Windows-1257) encoding name.


CP1258#

CP1258 (a.k.a. Windows-1258) encoding name.


LATIN_1#

Latin-1 (a.k.a. ISO/IEC 8859-1) encoding name.


ISO_8859_1#

ISO/IEC 8859-1 (a.k.a. latin-1) encoding name. This is the same as LATIN_1.