-
the C programming language has a set of functions implementing operations on strings (character/byte strings) in its standard library
- e.g. length, copying, concatenation, tokenization, searching
-
for character strings, the standard library uses the convention that strings are null-terminated:
- a string of
$n$ characters is represented as an array of$n + 1$ elements - the last of which is a "
NULL
character" with numeric value 0
- a string of
- the smallest addressable memory unit in most hardware nowadays is a byte
$=8$ bits-
each bit can store
$2$ states:$1$ /$0$ -
$n$ bits can store$2^n$ possible combinations
-
- 32-bit, 64-bit, ... refer to CPU architectures (and corresponding operating systems)
- Registers INSIDE CPUs are used to hold immediate data:
- instructions
- memory addresses
- data for calculations
- A
$n$ -bit CPU has$n$ -bit wide registers (e.g.$64$ -bit CPU has$8$ Byte wide registers)- This length is often called the natural unit of data a CPU processes at a time: a
word
/size_t
- Linux:
unsigned long
- Windows:
SIZE_T
- Linux:
- This length is often called the natural unit of data a CPU processes at a time: a
- Since memory access instructions (e.g. Assembly
mov
/load
/store
) of the CPU ISA (Instruction Set Architecture) are hard-coded to use a register as memory address- a
$32$ bit CPU can address a theoretic maximum of$2^{32}$ byte$\approx 4$ gigabytes - a
$64$ bit CPU can address a theoretic maximum of$2^{64}$ byte$\approx 18$ exabytes ($= 18000$ terrabytes)
- a
- Registers INSIDE CPUs are used to hold immediate data:
-
on a hardware level all characters are stored as consecutive fixed width unsigned binary integers that each represent a code unit
- terminated by an additional zero code unit
-
when speaking of strings normally code units of type
char
/wchar_t
are meant- a
char
is$8$ bits$=1$ byte (on most modern machines) - a
wchar_t
is$16/32$ bits$=$ $2/4$ byte
- a
#include <stdio.h> // printf
void print_array(const char* name, const char* chars) {
printf("%s [code units]: ", name);
const char *separator = "";
size_t counter = 0;
while (*chars) {
printf("%s%d", separator, *chars);
separator = ",";
chars++;
}
printf("%s%d", separator, *chars);
printf("\n");
}
int main() {
// 1. Stack Allocation (automatic memory management)
// > Pointer to string literal (read-only! | they are always immutable!)
char* literalPtr = "Literal String"; // Automatically inserts a NULL at the end
//literalPtr[0] = 'l'; // NOT POSSIBLE! IMMUTABLE!
// To indicate that you always write 'const'
const char* CONST_LITERAL_PTR = "Immutable Literal String";
// > Static string (can be updated | mutable)
char staticStr[] = "Static String";
staticStr[0] = 's';
printf("Literal Pointer: %s\n", literalPtr);
printf("Literal Pointer [CONST]: %s\n", CONST_LITERAL_PTR);
printf("Static String: %s\n", staticStr);
print_array("literalPtr", literalPtr);
print_array("staticStr ", staticStr);
}
// Literal Pointer: Literal String
// Literal Pointer [CONST]: Immutable Literal String
// Static String: static String
// literalPtr [code units]: 76,105,116,101,114,97,108,32,83,116,114,105,110,103,0
// staticStr [code units]: 115,116,97,116,105,99,32,83,116,114,105,110,103,0
To consistently map (encode) binary numbers represent by bits also known as code points to characters/symbols there are multiple standards:
In 1963 the ASCII character encoding defined
- Inspired by type/telewriter layout
They are represented by numbers from
0 NUL 16 DLE 32 48 0 64 @ 80 P 96 ` 112 p
1 SOH 17 DC1 33 ! 49 1 65 A 81 Q 97 a 113 q
2 STX 18 DC2 34 " 50 2 66 B 82 R 98 b 114 r
3 ETX 19 DC3 35 # 51 3 67 C 83 S 99 c 115 s
4 EOT 20 DC4 36 $ 52 4 68 D 84 T 100 d 116 t
5 ENQ 21 NAK 37 % 53 5 69 E 85 U 101 e 117 u
6 ACK 22 SYN 38 & 54 6 70 F 86 V 102 f 118 v
7 BEL 23 ETB 39 ' 55 7 71 G 87 W 103 g 119 w
8 BS 24 CAN 40 ( 56 8 72 H 88 X 104 h 120 x
9 HT 25 EM 41 ) 57 9 73 I 89 Y 105 i 121 y
10 LF 26 SUB 42 * 58 : 74 J 90 Z 106 j 122 z
11 VT 27 ESC 43 + 59 ; 75 K 91 [ 107 k 123 {
12 FF 28 FS 44 , 60 < 76 L 92 \ 108 l 124 |
13 CR 29 GS 45 - 61 = 77 M 93 ] 109 m 125 }
14 SO 30 RS 46 . 62 > 78 N 94 ^ 110 n 126 ~
15 SI 31 US 47 / 63 ? 79 O 95 _ 111 o 127 DEL
The reason the symbols are encoded as
This was enough for basic English use but was extended by various countries internationally to
- e.g. ISO/IEC 8859 which contains
ä
,ü
,ö
,ß
,€
,Ä
, ...
To solve the problem of multiple character encoding standards (that are different depending on the current locale) Unicode is one table that contains all of them instead of having to switch the encoding depending on the current use case.
Designed in 1988 by Joe Becker at Xerox this standard (where the first
This was later extended to
If you would store each character as
To support all current Unicode code points but still keep the storage size small a transfer function is used that widens the amount of bytes depending on how big the code point of a single symbol is (variable length characters):
-
UTF-8 uses
$1$ to$4$ byte per code point (indicated by a bit mask)-
Maximal compatibility with ASCII since the first
$128$ symbol have the same code points -
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 U+0000 (0) U+007F (127) 0yyyzzzz
U+0080 (128) U+07FF (2047) 110xxxyy
10yyzzzz
U+0800 (2048) U+FFFF (65535) 1110wwww
10xxxxyy
10yyzzzz
U+010000 (65536) U+10FFFF (1114111) 11110uvv
10vvwwww
10xxxxyy
10yyzzzz
-
If the first bit in a byte is
$0$ it's a$1$ byte wide wide symbol -
If the first bits in a byte are
$110$ it's a$2$ byte wide wide symbol -
...
-
e.g.
ä
has the Unicode code pointU+00E4
(228,1110 0100
):// UTF-8 encoding pattern for 2-byte characters: 110x xxxx 10xx xxxx // Filling in the bits of 00E4: // -> 0000 0000 1110 0100 // // 0 0011 10 0100 // 110 10 1100 0011 1010 0100
-
-
-
UTF-16 uses
$2$ or$4$ byte per code point (indicated by a high surrogate)- Used by the Windows API and by many programming environments such as Java and Qt
- If the first 2 bytes are between U+D800 (
$55296$ ) and U+DFFF ($57343$ ) this indicates a surrogate pair meaning this code point is combined by this and the next$2$ bytes
-
UTF-32 uses a fixed width
$4$ byte code point- No transfer necessary since every Unicode character fits into
$32$ bit
- No transfer necessary since every Unicode character fits into
Important
Using a variable length encoding like UTF-8 breaks existing code for string length and other operations when counting the actual code points!
The bigger size or e.g. resolving the actual encoded symbols with multiple bytes needs to be explicitly counted!
#include <stdio.h> // printf
#include <string.h> // strlen
const char* ASCII_CONSTANT = "Hello!"; // Only ASCII symbols
const char* UTF8_CONSTANT = "ÄÖÜ€"; // UTF-8 symbols (file is UTF-8 encoded)
size_t utf8_strlen(const char *s) {
size_t count = 0;
while (*s) { // As long as the byte/char is not NULL
if ((*s & 0xC0) != 0x80) { // Start of code point if the byte bitmask
count++; // isn't 10xxxxxx ['abcd & 1000'='a000']
} // -> '0xC0'(hex)='11000000'(binary)
// -> '0x80'(hex)='10000000'(binary)
// -> the first 2 bits of *s must be !='10'
s++; // Point to the next consecutive byte/char
} // memory location
return count;
}
int main() {
printf("ASCII_CONSTANT: %s (length = %d, utf-8 length = %d)\n",
ASCII_CONSTANT, strlen(ASCII_CONSTANT), utf8_strlen(ASCII_CONSTANT));
printf("UTF8_CONSTANT: %s (length = %d, utf-8 length = %d)\n",
UTF8_CONSTANT, strlen(UTF8_CONSTANT), utf8_strlen(UTF8_CONSTANT));
}
// ASCII_CONSTANT: Hello! (length = 6, utf-8 length = 6)
// UTF8_CONSTANT: ÄÖÜ€ (length = 9, utf-8 length = 4)
#include <stdio.h> // printf
void print_array(const char* name, const int integers[], int elements) {
printf("%s (%d): ", name, elements);
const char *separator = "";
for(int j=0; j<elements; j++) {
printf("%s%d", separator, integers[j]);
separator = ",";
}
printf("\n");
}
int main() {
// Static array (fixed size)
int staticArr[5] = {10, 20, 30, 40, 50}; // specify size + values
staticArr[0] = 1; // change values
int staticArrAuto[] = {10, 20, 30, 40}; // values determine size
int staticArrZeros[5] = {1, 2}; // remaining values are 0
print_array("staticArr", staticArr, 5);
printf("0: %d, 2: %d\n", staticArr[0], staticArr[2]);
print_array("staticArrAuto", staticArrAuto, 4);
print_array("staticArrZeros", staticArrZeros, 5);
// Pointer Arithmetic
int *staticArrPtr = staticArr; // use pointer to access/change values
printf("[staticArrPtr] 0: %d, 2: %d\n", *staticArrPtr, *(staticArrPtr + 2));
// This works because the compiler is automatically changing the memory
// address by 2 * sizeOf(int)
printf("staticArrPtr = %p [pointer]\n",
(void *)staticArrPtr);
printf("staticArrPtr + 1 = %p [pointer]\n",
(void *)(staticArrPtr + 1));
printf("staticArrPtr + 1 = %p [size_t]\n",
(size_t)staticArrPtr + 1);
printf("staticArrPtr + 1*sizeof(int) = %p [size_t]\n",
(size_t)staticArrPtr + 1 * sizeof(int));
// THATS WHY IT'S VERY IMPORTANT TO HAVE THE CORRECT POINTER DATA TYPE!
long *staticArrPtrL = (long *)staticArr;
printf("(long *) staticArrPtr + 1 = %p [long pointer]\n",
(void *)(staticArrPtrL + 1));
}
// staticArr (5): 1,20,30,40,50
// 0: 1, 2: 30
// staticArrAuto (4): 10,20,30,40
// staticArrZeros (5): 1,2,0,0,0
// [staticArrPtr] 0: 1, 2: 30
// staticArrPtr = 0x7ffe7ecd3fe0 [pointer]
// staticArrPtr + 1 = 0x7ffe7ecd3fe4 [pointer]
// staticArrPtr + 1 = 0x7ffe7ecd3fe1 [size_t]
// staticArrPtr + 1*sizeof(int) = 0x7ffe7ecd3fe4 [size_t]
// (long *) staticArrPtr + 1 = 0x7ffe7ecd3fe8 [long pointer]
#include <stdio.h> // printf
#include <stdlib.h> // malloc
#include <sys/resource.h> // stack size limit
int main() {
// Stack allocation have limits
struct rlimit rl;
if (getrlimit(RLIMIT_STACK, &rl) == 0) {
printf("Stack size limit: %ld bytes (%.2f MB)\n",
rl.rlim_cur, rl.rlim_cur / (1024.0 * 1024));
} else {
fprintf(stderr, "Error: Unable to get stack size limit.\n");
return 1;
}
// E.g. with a 8MB limit for the stack allocating more triggers a stack overflow
int bigArray[10 * 1024 * 1024]; // 40 MB -> Triggers a Segmentation fault on Linux
// Dynamic array (Heap allocated)
// Either big or dynamic sized arrays can be allocated in a region of RAM
size_t elements = 500;
size_t size = elements * sizeof(int);
printf("Allocate memory for %d elements of %d byte size => %zu bytes\n",
elements, sizeof(int), size);
int *arr = malloc(size);
if (arr == NULL) {
fprintf(stderr, "Error: Memory allocation of %zu bytes failed.\n", size);
return 1;
}
// ... access/change is the same
free(arr); // THE ALLOCATED RAM REGION MUST BE MANUALLY FREED
// THIS CAN ONLY BE DONE ONCE AND arr IS NOT ALLOWED TO BE ACCESSED AGAIN!
}
// Stack size limit: 8388608 bytes (8.00 MB)
// Allocate memory for 500 elements of 4 byte size => 2000 bytes