You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A character is an overloaded term that can mean many things.
This gist will briefly touch on the differences between Code Point, Code Unit, Grapheme and Glyph, with a particular focus on emojis.
Code Point
A code point is the atomic unit of information that consists of a numerical value that maps to a specific character which is given meaning by the Unicode®︎ Standard.
A code point represent a digit, letter, whitespace, punctuation mark, emoji, symbol, control character, or formatting.
The text is a sequence of code points.
Code Unit
A code unit is the unit of storage of a part of an encoded code point.
Some characters are encoded using multiple code units, resulting in a variable-length encoding.
UTF-32 is a fixed width encoding, hence all code points can be encoded in a single 32-bit code unit.
The following table shows the three most common character encoding schemes and their corresponding word size.
There are also 5 characters with their respective number of code units for the corresponding encoding.
The character 👉🏿 is a combination of two code points wich requires two 32-bit code units.
UTF-32 does not always have a 1:1 mapping between code points and what a user perceives as characters.
In JavaScript, strings are represented fundamentally as sequences of UTF-16 code units:
Every UTF-16 code unit is exactly 16 bits long.
Code units and can be written in a string with \u followed by exactly four hex digits.
Code points can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits.
The entire Unicode character set is much bigger than what a single UTF-16 code unit can represent, supplementary characters are encoded as two 16-bit code units called a surrogate pair.
The length data property of a String value counts UTF-16 code units.
Grapheme
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element in the context of a particular writing system.
User-perceived characters are referred to as graphemes.
The following table shows an example with 3 characters:
Character
Code Point
Description
ä
000E4
Single code point.
ä
00061 00308
Base character (a), and combining diaeresis (◌̈).
👉🏿
1F449 1F3FF
Backhand index pointing right (👉), and dark skin tone (🏿).
Note
The characters ä and ä are the same grapheme, but have different code points.
The character 👉🏿 is two code points, but is displayed as a single character.
The diaeresis character (¨) and the combining diaeresis (◌̈) are not the same code point.
In JavaScript, Intl.Segmenter is used to split a string into segments at grapheme cluster boundaries, as determined by a specified locale.
Expand the details below to see a JavaScript example that demonstrates the differences between code point, code unit, and grapheme in different string encodings.
Expand to see an example of code point, code unit, and grapheme.
// Backhand Index Pointing Right: Dark Skin Tone.conststr='👉🏿';// 1F449 1F3FFconsole.log('String:',str);// Split using UTF-8 code units.constencoder=newTextEncoder();constencoded=encoder.encode(str);constu8_cu=[...encoded].map(byte=>byte.toString(16));console.log('UTF-8:',u8_cu);// ['f0', '9f', '91', '89', 'f0', '9f', '8f', 'bf']// Split using UTF-16 code units.// This is the default encoding in JavaScript.constu16_cu=str.split('');console.log('UTF-16:',u16_cu);// ['\uD83D', '\uDC49', '\uD83C', '\uDFFF']// Split using UTF-32 code units.// A UTF-32 code unit is always a single code point.constu32_cu=Array.from(str);// or [...str]console.log('UTF-32:',u32_cu);// ['👉', '🏿']// Split using graphemes.constsegmenter=newIntl.Segmenter();constsegment=segmenter.segment(str);constgraphemes=[...segment].map(x=>x.segment);console.log('Grapheme:',graphemes);// ['👉🏿']
In JavaScript, String.prototype.split() and String.prototype[@@iterator]() splits the characters of a string in different ways.
String indexes and split() operates by UTF-16 code units, and @@iterator() iterates by code points.
Glyph
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof.
Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs.
For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.
Modifiers & Variation Selectors
Variation Selectors (VS) designates a specific block (FE00-FE0F) within the Unicode character set containing 16 selectors used to specify a glyph variant for a preceding character.
Certain characters can have variant forms or styles, and variation selectors are used to indicate these variations.
These selectors are generally applied to base characters to modify their appearance, providing a means to represent different writing styles, scripts, or regional preferences.
Emojis can have diverse presentations, this includes variations in skin tones, genders, or other contextual modifications.
The following items are the emoji-specific variation selectors.
The VS-15 (FE0E) is used to request a text presentation (monochrome) for an emoji character (♀︎).
The VS-16 (FE0F) is used to request an emoji presentation (polychrome) for an emoji character (♀️).
Skin tone variations are achieved using modifiers from 1F3FB to 1F3FF (🏻🏼🏽🏾🏿), they are based on the Fitzpatrick scale, which is a formal classification of human skin tones. When one of these code points is appended to an emoji that supports skin tone modifiers, it will change the skin tone of the emoji.
Expand to see a table with all skin tones.
Character
Code Point
Name
🏻
1F3FB
Light skin tone.
🏼
1F3FC
Medium-light skin tone.
🏽
1F3FD
Medium skin tone.
🏾
1F3FE
Medium-dark skin tone.
🏿
1F3FF
Dark skin tone.
Zero Width Joiner
The zero-width joiner (ZWJ) is a non-printing character (200D) used in the computerized typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex scripts). When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.
When a ZWJ is placed between two emoji characters (or interspersed between multiple), it can result in a single glyph or new emoji, such as the family emoji (👪), made up of two adult emoji (👨👩) and one or two child emoji (👦).
Note
When not available, the ZWJ characters are ignored and a fallback sequence of separate emoji is displayed.
Thus an emoji ZWJ sequence should only be supported where the fallback sequence would also make sense to a viewer.
See the Recommended Emoji ZWJ Sequences.
When working with skin tones, you should be careful with those emojis that consist of two or more emojis joined by the ZWJ.
Skin tone modifiers must be included after the emoji but before the ZWJ.
Some ZWJ sequences include multiple emoji that each have different skin tone modifiers.
Expand to see an example combining characters and skin tones.
String.fromCodePoint(0x1FAF1,// Rightwards Hand (🫱)0x1F3FB,// Light skin tone (🏻)0x0200D,// Zero Width Joiner0x1FAF2,// Leftwards Hand (🫲)0x1F3FF,// Dark skin tone (🏿));// Handshake: Light Skin Tone, Dark Skin Tone (🫱🏻🫲🏿)