Skip to content

Instantly share code, notes, and snippets.

@michele-tn
Created September 18, 2025 18:38
Show Gist options
  • Save michele-tn/f2c5bfbc4e6c6470ea3a16673dcb5038 to your computer and use it in GitHub Desktop.
Save michele-tn/f2c5bfbc4e6c6470ea3a16673dcb5038 to your computer and use it in GitHub Desktop.

1. Introduction and scope

Hash functions are ubiquitous in information security: they are used for integrity checks, content-addressed storage, digital signatures, password verification, and more. This document treats three families of constructions that have received the most attention in practice and in the literature: MD5 (an historical design of the MD family), SHA-2 (notably SHA-256, standardized in FIPS 180), and SHA-3 (the Keccak-based family standardized in FIPS 202). Where necessary we contrast the design principles (MerkleDamgrd vs. sponge) and explain the consequences for security. Key primary sources are cited: the MD5 specification (RFC 1321), the SHA standards (FIPS 180-4), the SHA-3 standard (FIPS 202), and major cryptanalytic work on MD5 and related constructions. citeturn0search0turn0search7turn3search2


2. Formal definitions and basic security notions

Let (H:{0,1}^* o{0,1}^n) be a deterministic function mapping binary strings of arbitrary length to fixed-length (n)-bit outputs. The following definitions are standard (see comprehensive references in applied cryptography). citeturn0search3

Definition (Preimage resistance). (H) is preimage resistant if, given a randomly chosen (y{0,1}^n), it is computationally infeasible to find (x) such that (H(x)=y).

Definition (Second-preimage resistance). (H) is second-preimage resistant if, given a fixed (x), it is computationally infeasible to find (x' eq x) with (H(x')=H(x)).

Definition (Collision resistance). (H) is collision resistant if it is computationally infeasible to find any pair ((x,x')), (x eq x'), with (H(x)=H(x')).

These properties are not independent in a quantitative sense: generic bounds for attack complexity follow from combinatorial arguments and are summarized below. The ``birthday bound'' gives a generic (O(2)) work factor for collision search in an ideal (n)-bit hash, while preimage attacks are generic (O(2^n)). A rigorous derivation and discussion can be found in standard texts. citeturn0search3


3. MerkleDamgrd construction and its formal consequence

Many classical hash functions (MD4/MD5, SHA-1, SHA-2) follow the MerkleDamgrd paradigm: compress a chaining value iteratively with fixed-size message blocks and a compression function (f). The high-level reduction theorem is:

Informal theorem (MerkleDamgrd collision lifting). If the underlying fixed-length compression function (f) is collision-prone (i.e., there exist two distinct inputs to (f) that yield the same output), then one can construct collisions for the iterated MerkleDamgrd hash by appending appropriate blocks. The conversesecurity of (f) implies security of the iterated constructionholds under standard MD-compliant padding assumptions. citeturn2search10turn2search4

Consequence: structural weaknesses in a compression function can be amplified to the full hash. This is the mechanism that allowed cryptanalysts to move from local differential paths in the MD5 compression to full-message collisions. For historical exposition and formal proofs see the original Damgrd paper and subsequent analyses. citeturn2search10


4. MD5: specification, internal mathematics, and implementation notes

4.1 Specification pointers and invariants

MD5 is defined in RFC 1321 (1992). The algorithm maps arbitrary-length messages to a 128-bit digest and is specified by the padding rule, an initial 128-bit chaining value expressed as four 32-bit words, and a 64-step compression built from logical boolean functions, modular addition and left-rotation operations. Implementers should consult the RFC for exact constants, endianness conventions and a reference implementation. citeturn0search0

Key invariants:

  • Output length (n=128) bits.
  • Internal state width (v=128) bits (A,B,C,D each 32 bits).
  • The additive constants (T_j) (one per step) are derived from the fractional parts of (|(j+1)|) scaled by (2) (integer part taken). citeturn0search0

4.2 Padding, parsing and little-endian convention

Given message (M), MD5 forms [ M' = M ,|, 1 ,|, 0^k ,|, (len), ] where (len=|M|) in bits, (()) denotes the 64-bit little-endian encoding of (len), and (k) is chosen so (|M'| 0). The padded message is parsed into (t) blocks (M_1,,M_t) each of 512 bits, and each block is viewed as 16 little-endian 32-bit words (M_i[015]). citeturn0search0

4.3 The compression step: equations and functions

Define four boolean functions (operating on 32-bit words):

[

F(X,Y,Z) = (X AND Y) OR ( NOT X AND Z)
G(X,Y,Z) = (X AND Z) OR (Y AND NOT Z)
H(X,Y,Z) = X XOR Y XOR Z
I(X,Y,Z) = Y XOR (X OR NOT Z)

]

The per-step update (conceptual) for step index (j) is:

[ ext{tmp} = A + f_j(B,C,D) + M[g_j] + T_j } ]

[ A,B,C,D D,; B + ROL( ext{tmp}, s_j),; B,; C ]

Here (f_j) is one of (F,G,H,I) depending on the round, (g_j) selects which message word is used, (s_j) is the rotation amount, and (T_j) is the additive constant. After all 64 steps a block-specific result is added back to the chaining variables modulo (2). The RFC lists tables for (g_j), (s_j) and (T_j); implementers must use those exact tables to interoperate. citeturn0search0

4.4 ASCII diagram: MD5 block processing

+-------------------------+
| 512-bit message block |
+-------------------------+
| Parse -> 16 x 32-bit W |
+-------------------------+
| Initialize A,B,C,D |
| for j in 0..63: |
| tmp = A + f_j(B,C,D) |
| + W[g_j] + T_j |
| A = D |
| D = C |
| C = B |
| B = B + ROL(tmp,s_j) |
+-------------------------+
| Add to chaining state |
+-------------------------+

4.5 Avalanche, determinism and one-wayness (mathematical commentary)

  • Determinism: MD5 is a deterministic function; identical messages produce identical digests by construction.
  • Avalanche: The composition of non-linear boolean functions with rotations and modular addition is intended to produce diffusion where single-bit changes propagate across many output bits after a few rounds. This property is heuristic and measured empirically; it does not constitute formal resistance to collision-finding, which depends on the algebraic and differential structure of the compression. citeturn0search0
  • One-wayness: Formally one-wayness is a complexity-theoretic property; given the size of the output space (128 bits) generic preimage search has complexity (O(2)), but targeted cryptanalysis (e.g., exploiting internal structure) may reduce this work factor. In MD5, practical collision attacks have significantly weakened the assumed security margin. citeturn0search1

5. Collision attacks on MD5: differential methods and practical chosen-prefix collisions

5.1 Differential cryptanalysis and Wang et al.

Beginning with theoretical differential techniques and culminating in practical demonstrations, researchers led by Xiaoyun Wang showed that MD5's compression function admits differential paths that lead to collisions much faster than the generic birthday bound would suggest for an ideal 128-bit function. Wang et al.'s work demonstrated concrete collision-finding methods and performance estimates that were orders of magnitude faster than brute-force. citeturn0search1

5.2 Chosen-prefix collisions, Stevens and practical exploits

Marc Stevens and coauthors developed chosen-prefix collision techniques that dramatically increased the practical threat model: attackers can construct collisions where two chosen prefixes (arbitrary data controlled by the attacker) are extended with computed suffixes to produce colliding digests. These chosen-prefix collisions enabled concrete forgeries (notably rogue CA certificate constructions) and are a central reason why MD5 is considered unsafe for any collision-sensitive application. Stevens and collaborators provided complexity estimates and demonstrations; modern chosen-prefix methods reduce the computational cost substantially relative to naive estimates. citeturn0search2turn0search10

5.3 Quantitative statement (practical numbers)

  • Generic collision complexity for an ideal 128-bit digest is (O(2)) work (birthday bound). citeturn0search3
  • Wang's differential methods and subsequent refinements produced collisions in timescales that are feasible for attackers with modest resources (original demonstrations reported wall-clock times from minutes to hours on workstation clusters of the time). citeturn0search1
  • Chosen-prefix collisions constructed by Stevens and colleagues reported practical costs in the range of (2)(2) compression-function evaluations for specific constructions with optimizations; these orders of magnitude are well within reach of modern cluster or cloud resources when amortized. Practically, this implies MD5-based signatures and certificate usage are insecure. citeturn0search2turn0search10

Implication. Collision resistancerequired by a range of cryptographic uses such as digest-based digital signaturesis demonstrably broken for MD5. Any system depending on MD5's collision resistance (e.g., signing an MD5 digest) must be considered compromised. citeturn0search1turn0search10


6. SHA-2 (SHA-256): specification, internal mathematics, and comparative security

6.1 Specification and references

SHA-256 and the broader SHA-2 family are specified in the NIST Secure Hash Standard (FIPS 180-4). SHA-256 produces a 256-bit digest and uses an iterated compression function over 512-bit blocks with an internal state of eight 32-bit words (total 256-bit chaining state). The specification defines the message schedule, constants (K_t), the boolean functions and rotation/shift operators used in each round. citeturn3search1

6.2 Internal functions and round equations

Let the message schedule be (W_0,,W) derived from the 16 message words and extended by:

[ W_t = sigma1(W) + W + sigma0(W) + W}, t=1663 ]

where (sigma0,sigma1) are small-word rotations/xors defined in the standard. The per-round update uses functions:

[

ext{Ch}(x,y,z) = (x AND y) XOR ( NOT x AND z)\
ext{Maj}(x,y,z) = (x AND y) XOR (x AND z) XOR (y AND z)\

Sigma0(x) = ext{ROTR}^2(x) XOR ext{ROTR}(x) XOR ext{ROTR}(x)
Sigma1(x) = ext{ROTR}^6(x) XOR ext{ROTR}(x) XOR ext{ROTR}(x)

]

The compression updates eight working variables (a..h) per round and then adds them back to the chain variables. The complete spec and constants are tabulated in FIPS 180-4. citeturn3search1

6.3 ASCII diagram: SHA-256 round (conceptual)

[W_t] [K_t]
 \ /
 +---------+
 | temp1 | = h + 1(e) + Ch(e,f,g) + K_t + W_t
 | temp2 | = 0(a) + Maj(a,b,c)
 +---------+
 h = g
 g = f
 f = e
 e = d + temp1
 d = c
 c = b
 b = a
 a = temp1 + temp2

(Repeat for t=0..63, then add a..h to the chaining state.) citeturn3search1

6.4 Security posture and attacks

As of the present literature and standards, SHA-256 remains free of published practical collision attacks comparable to MD5's differential attacks; the best known attacks against SHA-256 are significantly more expensive than the generic birthday bound and still far beyond practical reach. The SHA-2 family benefits from a larger internal state and a more conservative design against differential characteristics than MD5. NIST continues to recommend SHA-2 for cryptographic hashing and signatures where mandated. citeturn3search1


7. SHA-3 (Keccak) and the sponge construction

SHA-3 (FIPS 202) departs from the MerkleDamgrd paradigm and adopts the sponge construction with a permutation-based internal primitive (Keccak-f). The sponge construction provides natural resistance to length-extension attacks inherent to classic MerkleDamgrd designs and is parameterized by a rate (r) and capacity (c) such that the claimed security level is at most (c/2). SHA-3 complements SHA-2 as an algorithmically distinct family and is standardized for contexts requiring diversification of primitives. citeturn3search2

ASCII schematic of the sponge (absorb/squeeze):

Input blocks -> XOR into first r bits of state -> Permutation f -> (repeat)
After absorption: Squeeze outputs from first r bits, applying f between output blocks.
Capacity c = state width - r (controls security level)

8. Collision probability, birthday paradox and quantitative bounds (derivation sketch)

Let (n) be the output length in bits and (N=2^n). Sampling (k) independent random digests, the probability that no pair collides equals ((1 - i/N)). Using the approximation ((1-x)�pprox -x) for small (x) and replacing the product by the exponential of a sum yields

[ [ ext{no collision}] �pprox (- rac{k(k-1)}{2N} ight). ]

Setting ([ ext{collision}]=1/2) and solving for (k) gives (k�pprox �pprox 1.177 2). Thus generic collision resistance for an (n)-bit hash degrades to (O(2)) work. For MD5 (n=128) this is (O(2)); for SHA-256 (n=256) this is (O(2)). The distinction highlights why modern hash families move to larger outputs: to maintain infeasibility in the face of the birthday bound. This derivation is standard; see primary cryptography texts for rigorous development. citeturn0search3


9. Password storage, salting, key derivation functions and modern recommendations

9.1 Why a simple hash is insufficient

Storing H(password) (even if H=SHA-256 or H=MD5) in a database exposes the system to offline dictionary and brute-force attacks: an adversary obtaining the database may systematically guess candidate passwords and compute their hashes to find matches. If H is fast (as MD5 and SHA-256 are by design), then attackers can test millions or billions of guesses per second on commodity hardware or GPUs. Therefore naive use of general-purpose hash functions for passwords is discouraged by modern guidance. citeturn1search3turn1search1

9.2 Salting and its mathematical effect

A salt (s) is a per-entry unique, non-secret random string. The stored verifier is typically (H(p|s)) (concatenation or other canonical combination). Salting defends against:

  • Precomputed rainbow tables (tables must be recomputed per salt), and
  • Mass compromise inference (same password yields different stored values for distinct salts).

Mathematically, salting increases the effective entropy of the input to the hash as seen by the attacker, and prevents amortization of computation across many accounts. OWASP and NIST both recommend unique salts and modern key-derivation functions. citeturn1search3turn1search1

9.3 Slow, memory- and CPU-hard functions (Argon2, bcrypt, scrypt, PBKDF2)

Modern password storage requires a deliberately expensive KDF. Key recommendations include:

  • Argon2 (winner of the Password Hashing Competition), with an IETF RFC and implementation guidance (Argon2id is recommended for balanced resistance to side-channel and GPU attacks). Argon2 lets you tune memory, parallelism, and time cost parameters. citeturn1search5turn1search11
  • bcrypt and scrypt remain widely used; scrypt adds memory-hardness but has different trade-offs. PBKDF2 is standardized (e.g., RFC2898) but offers only CPU-cost tuning and is considered less resistant than memory-hard functions for modern adversaries with massive parallel hardware. OWASP and NIST guidance recommend Argon2 or a similar memory-hard function. citeturn1search3turn1search11

9.4 Example scheme (recommended pattern)

For each user record:

  • Generate salt (s) uniformly at random (e.g., 16+ bytes).
  • Compute verifier (v = (password, s; ext{params})).
  • Store ( (s, v, ext{params}) ) in the authentication database.

This design prevents simple dictionary attacks using precomputation and increases the cost per guess due to Argon2's memory and time parameters. citeturn1search5turn1search3


10. Migration and deployment guidance (operational)

  1. Deprecate MD5 for security-critical tasks (signatures, certificate hashing, TLS, etc.). MD5 collisions are practical. Replace with SHA-256 or SHA-3 family where non-password integrity is needed. citeturn0search1turn0search10
  2. Do not use general-purpose fast hashes for passwords. Replace stored MD5/SHA-1/SHA-256 password hashes with a memory-hard KDF (Argon2), using salt and a migration strategy that supports on-login re-hashing. citeturn1search5turn1search3
  3. Use digital-signature schemes over the canonical digest functions mandated by standards (RSA/ECDSA over SHA-2/SHA-3 as appropriate). Avoid signing simple MD5 digests. citeturn3search1turn3search2
  4. Document and monitor: document cryptographic choices, parameter choices (Argon2 cost), and maintain an upgrade plan. Security recommendations evolvemonitor NIST and relevant standards bodies. citeturn1search1

11. Appendix: Selected mathematical proofs and derivations (sketches)

11.1 Birthday bound (derivation sketch)

Let the outputs be uniformly random in a space of size (N=2^n). For (k) samples, the probability all are distinct is [ (1- rac{i}{N} ight). ] Taking logarithms and approximating ((1-x)�pprox -x) yields (([ ext{distinct}]) �pprox - rac{k(k-1)}{2N}). Solving for (k) when ([ ext{distinct}]=1/2) gives (k�pprox). This standard analytic approximation is the origin of the (2) heuristic for collision complexity. citeturn0search3

11.2 MerkleDamgrd lifting (informal proof idea)

Assume compression function (f) admits two inputs (x eq x') such that (f(x)=f(x')). Then select a common prefix and use (x) and (x') as block inputs at the same compression position; because their outputs coincide, the remaining stream of compression steps produces identical chaining values and therefore identical final digests. Formalizing this argument requires careful attention to padding (MD-compliance) and the handling of message lengths; details appear in Damgrd's original treatment. citeturn2search10


12. Selected primary references and standards

  • R. Rivest, The MD5 Message-Digest Algorithm, RFC 1321 (1992). citeturn0search0
  • X. Wang, Y. Yin, H. Yu, Finding Collisions in the MD5 Compression Function (Eurocrypt/Advances in Cryptology) and related reports on breaking MD5. citeturn0search1turn0search21
  • M. Stevens et al., Chosen-prefix collisions for MD5 and applications (papers and technical reports on chosen-prefix collisions and certificate forgeries). citeturn0search2turn0search10
  • NIST, FIPS 180-4: Secure Hash Standard (SHS) (SHA-1, SHA-2 family). citeturn3search1
  • NIST, FIPS 202: SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions. citeturn3search2
  • NIST, SP 800-63B: Digital Identity Guidelines (Authentication and Lifecycle Management) (guidance on password handling). citeturn1search1
  • OWASP, Password Storage Cheat Sheet (practical guidance). citeturn1search3
  • Argon2 specification and RFC (Argon2id, memory-hard KDF recommendations). citeturn1search5turn1search11

13. Concluding remarks

Contemporary cryptography relies on a careful matching of algorithms to use-cases. MD5 remains a historically and pedagogically important design, but it should be avoided for any collision-sensitive security tasks because of practical collision attacks. For integrity and signature applications, SHA-2 and SHA-3 are the modern, standards-backed choices. For password storage, use a memory-hard KDF (Argon2) with unique salts and a well-considered parameter choice. The mathematical theory (birthday bound, MerkleDamgrd reductions, differential cryptanalysis) explains both the limits of generic bounds and why structural attacks can invalidate nave assumptions. Practitioners should follow NIST and OWASP guidance and maintain a migration roadmap away from obsolete constructions. citeturn0search1turn0search10turn3search1



Example Use Cases

Use Case 1: File integrity (not secure with MD5)

echo "Hello World" > file.txt
md5sum file.txt
# Output (128-bit digest): b10a8db164e0754105b7a99be72e3fe5

MD5 is vulnerable to collisions use SHA-256 instead.

Use Case 2: Secure integrity with SHA-256

sha256sum file.txt
# Output (256-bit digest): a591a6d40bf420404a011733cfb7b190

SHA-256 has no known practical collisions as of 2025.

Use Case 3: Password hashing (best practice)

from argon2 import PasswordHasher

ph = PasswordHasher()
hash = ph.hash("mypassword123")
print(hash) # Argon2id hash with unique salt and memory-hard parameters

Do not use raw MD5 or SHA-256 for password storage.


Additional ASCII Diagrams

SHA-3 (Keccak, Sponge construction)

+---------------------------+
| Input split in r-bit rate |
+---------------------------+
 |
 v
 +-------------------+
 | XOR into state |
 | Apply permutation |
 +-------------------+
 |
 v
 [Absorb -> Squeeze]
 |
+----------------------------+
| Output hash (digest) |
+----------------------------+

Conclusion

  • MD5: insecure, useful only for pedagogy or non-critical checksums.
  • SHA-256: robust, de-facto standard for digital signatures and integrity.
  • SHA-3: modern alternative, resistant to length-extension, based on Sponge construction.
  • Password storage: use Argon2id, bcrypt, or scrypt with proper salting and parameters.

MD5, SHA-2, and SHA-3: Constants, Pseudocode, Mathematics, and Security Analysis

Abstract

This document provides a unified reference on three major families of hash functions: MD5, SHA-2 (with emphasis on SHA-256), and SHA-3 (Keccak). It integrates two complementary perspectives:

  • Implementation-oriented: full constant tables, extended pseudocode, and step-by-step algorithmic details.
  • Mathematical and security-oriented: formal definitions, theorems, cryptanalysis, collision and preimage resistance, and modern best practices.

The combined document includes ASCII diagrams, mermaid charts, and practical use cases. It serves as both a technical manual for implementation and a theoretical guide for cryptographic study.


Contents

  1. MD5 detailed pseudocode with constant tables
  2. SHA-256 detailed pseudocode with constant table K[t]
  3. diagrams (as blocks) and ASCII diagrams
  4. Explanatory sections: collisions, salting, KDFs and best practices
  5. References

1. MD5 Detailed Pseudocode and Constants

1.1 Overview

  • Output: 128 bits (16 bytes).
  • Block size: 512 bits.
  • Internal state: A,B,C,D (32-bit words).
  • Padding: append 1 bit, k zero bits until length 448 (mod 512), append 64-bit little-endian length.

1.2 Constant tables

The rotation amounts s[j] and the per-step additive constants T[j] are fixed by the specification (RFC 1321). The message index function g(j) depends on the round.

s[0..63] (left-rotation amounts)

s = [
 7,12,17,22, 7,12,17,22, 7,12,17,22, 7,12,17,22,
 5,9,14,20, 5,9,14,20, 5,9,14,20, 5,9,14,20,
 4,11,16,23, 4,11,16,23, 4,11,16,23, 4,11,16,23,
 6,10,15,21, 6,10,15,21, 6,10,15,21, 6,10,15,21
]

T[0..63] (integer part of 2^32 * abs(sin(i+1)), RFC 1321)

T = [
0xd76aa478,0xe8c7b756,0x242070db,0xc1bdceee,
0xf57c0faf,0x4787c62a,0xa8304613,0xfd469501,
0x698098d8,0x8b44f7af,0xffff5bb1,0x895cd7be,
0x6b901122,0xfd987193,0xa679438e,0x49b40821,
0xf61e2562,0xc040b340,0x265e5a51,0xe9b6c7aa,
0xd62f105d,0x02441453,0xd8a1e681,0xe7d3fbc8,
0x21e1cde6,0xc33707d6,0xf4d50d87,0x455a14ed,
0xa9e3e905,0xfcefa3f8,0x676f02d9,0x8d2a4c8a,
0xfffa3942,0x8771f681,0x6d9d6122,0xfde5380c,
0xa4beea44,0x4bdecfa9,0xf6bb4b60,0xbebfbc70,
0x289b7ec6,0xeaa127fa,0xd4ef3085,0x04881d05,
0xd9d4d039,0xe6db99e5,0x1fa27cf8,0xc4ac5665,
0xf4292244,0x432aff97,0xab9423a7,0xfc93a039,
0x655b59c3,0x8f0ccc92,0xffeff47d,0x85845dd1,
0x6fa87e4f,0xfe2ce6e0,0xa3014314,0x4e0811a1,
0xf7537e82,0xbd3af235,0x2ad7d2bb,0xeb86d391
]

Message index function g(j)

  • For j in 0..15: g = j
  • For j in 16..31: g = (5*j + 1) mod 16
  • For j in 32..47: g = (3*j + 5) mod 16
  • For j in 48..63: g = (7*j) mod 16

1.3 Full MD5 pseudocode (with constants)

// Helper: LEFTROTATE(x, n) rotates x left by n bits on 32-bit word
function MD5(M):
 // 1. Preprocessing
 original_bit_len = len(M) * 8
 append bit '1' to M
 append k bits '0' where k is smallest >=0 such that (len(M)+1+k) 448 (mod 512)
 append 64-bit little-endian integer original_bit_len
 parse padded M' into N 512-bit blocks M[0..N-1]

 // 2. Initialize registers
 A0 = 0x67452301
 B0 = 0xEFCDAB89
 C0 = 0x98BADCFE
 D0 = 0x10325476

 // 3. Tables: s[0..63], T[0..63] as defined above

 for i = 0 to N-1:
 // break block into 16 32-bit little-endian words M_i[0..15]
 A = A0; B = B0; C = C0; D = D0
 for j = 0 to 63:
 if 0 j 15:
 f = (B C) | ((~B) D)
 g = j
 else if 16 j 31:
 f = (D B) | ((~D) C) // equivalent form of G
 g = (5*j + 1) mod 16
 else if 32 j 47:
 f = B ^ C ^ D
 g = (3*j + 5) mod 16
 else:
 f = C ^ (B | (~D))
 g = (7*j) mod 16

 tmp = (A + f + M_i[g] + T[j]) mod 2^32
 tmp = LEFTROTATE(tmp, s[j])
 // rotate registers
 A = D
 D = C
 C = B
 B = (B + tmp) mod 2^32

 // add this block's result to the current hash value
 A0 = (A0 + A) mod 2^32
 B0 = (B0 + B) mod 2^32
 C0 = (C0 + C) mod 2^32
 D0 = (D0 + D) mod 2^32

 // output (little-endian)
 digest = LE32(A0) || LE32(B0) || LE32(C0) || LE32(D0)
 return digest

2. SHA-256 Detailed Pseudocode and K Constants

2.1 Overview

  • Output: 256 bits.
  • Block size: 512 bits.
  • Internal state: H0..H7 (eight 32-bit words).
  • Padding: append 1 bit, k 0 bits until length 448 (mod 512), append 64-bit big-endian length.

2.2 K[0..63] constants (first 32 bits of the fractional parts of the cube roots of the first 64 primes)

K = [
0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
]

2.3 SHA-256 pseudocode (detailed)

function SHA256(M):
 // Preprocessing
 original_bit_len = len(M) * 8
 append bit '1' to M
 append k bits '0' until len(M)+1+k 448 (mod 512)
 append 64-bit big-endian integer original_bit_len
 parse padded M' into N 512-bit blocks M[0..N-1]

 // Initialize hash values (big-endian)
 H0 = 0x6a09e667
 H1 = 0xbb67ae85
 H2 = 0x3c6ef372
 H3 = 0xa54ff53a
 H4 = 0x510e527f
 H5 = 0x9b05688c
 H6 = 0x1f83d9ab
 H7 = 0x5be0cd19

 for i = 0 to N-1:
 // prepare the message schedule W[0..63]
 for t = 0 to 15:
 W[t] = BE32(M_i[t]) // parse 32-bit big-endian words
 for t = 16 to 63:
 s0 = (ROTR^7(W[t-15])) xor (ROTR^18(W[t-15])) xor (W[t-15] >> 3)
 s1 = (ROTR^17(W[t-2])) xor (ROTR^19(W[t-2])) xor (W[t-2] >> 10)
 W[t] = (W[t-16] + s0 + W[t-7] + s1) mod 2^32

 // Initialize working variables
 a = H0; b = H1; c = H2; d = H3; e = H4; f = H5; g = H6; h = H7

 // Main loop
 for t = 0 to 63:
 S1 = (ROTR^6(e)) xor (ROTR^11(e)) xor (ROTR^25(e))
 ch = (e f) xor ((~e) g)
 temp1 = (h + S1 + ch + K[t] + W[t]) mod 2^32
 S0 = (ROTR^2(a)) xor (ROTR^13(a)) xor (ROTR^22(a))
 maj = (a b) xor (a c) xor (b c)
 temp2 = (S0 + maj) mod 2^32

 h = g
 g = f
 f = e
 e = (d + temp1) mod 2^32
 d = c
 c = b
 b = a
 a = (temp1 + temp2) mod 2^32

 // Add the compressed chunk to the current hash value
 H0 = (H0 + a) mod 2^32
 H1 = (H1 + b) mod 2^32
 H2 = (H2 + c) mod 2^32
 H3 = (H3 + d) mod 2^32
 H4 = (H4 + e) mod 2^32
 H5 = (H5 + f) mod 2^32
 H6 = (H6 + g) mod 2^32
 H7 = (H7 + h) mod 2^32

 // Produce the final digest (big-endian)
 digest = BE32(H0) || BE32(H1) || BE32(H2) || BE32(H3) || BE32(H4) || BE32(H5) || BE32(H6) || BE32(H7)
 return digest

3. Diagrams and ASCII Diagrams

: MD5

flowchart TD
 A[Input message M] --> B[Padding + 64-bit LE length]
 B --> C[Split into 512-bit blocks (M_i)]
 C --> D[Parse block -> W[0..15] (32-bit LE words)]
 D --> E[Initialize A,B,C,D (IVs)]
 E --> F[Compression: 4 rounds 16 steps]
 F --> G[Per step: tmp = A + f(B,C,D) + W[g] + T; B = B + ROL(tmp,s)]
 G --> H[Update chaining values]
 H --> I[After all blocks: Digest = LE(A)||LE(B)||LE(C)||LE(D)]
Loading

: SHA-256

flowchart TD
 A[Input message M] --> B[Padding + 64-bit BE length]
 B --> C[Split into 512-bit blocks]
 C --> D[Message schedule W[0..63]]
 D --> E[Initialize a..h (H0..H7)]
 E --> F[64 rounds: temp1,temp2,0,1,Ch,Maj]
 F --> G[Update working vars a..h]
 G --> H[Add to Hash State H0..H7]
 H --> I[Digest = H0||...||H7 (256 bits)]
Loading

ASCII compact diagram: MD5 single-block

+---------------------------+
| 512-bit block M_i |
| W0 W1 ... W15 |
+-----------+---------------+
 |
 +------v-------+
 | Init A,B,C,D |
 +------v-------+
 | 64 step loop |
 | (4 rounds) |
 +------v-------+
 | Add to state |
 +--------------+

4. Collisions, Salting, KDFs and Best Practices (Concise Summary)

  • MD5: Practical collision and chosen-prefix collision methods exist. Avoid for security-critical contexts.
  • SHA-1: Broken for collision resistance (e.g., SHAttered demonstration). Avoid.
  • SHA-256: No practical collisions as of 2025; recommended for signatures and integrity.
  • SHA-3: Sponge construction; conservative alternative to SHA-2 family.
  • Password storage: Use Argon2id or bcrypt/scrypt with unique salts. Do not store raw MD5/SHA-256 hashes for passwords.

5. References

  • Rivest, R. "The MD5 Message-Digest Algorithm", RFC 1321, 1992.
  • NIST FIPS 180-4, Secure Hash Standard (SHA-2), 2015.
  • NIST FIPS 202, SHA-3 Standard (Keccak), 2015.
  • Wang, X., Yin, Y., Yu, H.: "Finding Collisions in the MD5 Compression Function."
  • Stevens, M. et al.: "Chosen-prefix collisions for MD5 and applications."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment