Last active
October 27, 2016 14:46
-
-
Save mathiasbynens/5760113 to your computer and use it in GitHub Desktop.
Let’s create a JavaScript-compatible regular expression that matches any URL code point, as per the URL Standard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// “The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", | |
// "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code | |
// points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFEF, | |
// U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to | |
// U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 | |
// to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, | |
// U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to | |
// U+FFFFD, U+100000 to U+10FFFD.” | |
// — http://url.spec.whatwg.org/#url-code-points | |
// Let’s create a JavaScript-compatible regular expression that matches any URL | |
// code point, as per the above definition. | |
var regenerate = require('regenerate'); // http://mths.be/regenerate | |
var set = regenerate() | |
.addRange(0x0030, 0x0039) // ASCII digits | |
.addRange(0x0041, 0x005A).addRange(0x0061, 0x007A) // ASCII alpha | |
.add( | |
'!', '$', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', | |
'=', '?', '@', '_', '~' | |
) | |
.addRange(0x00A0, 0xD7FF) | |
.addRange(0xE000, 0xFDCF) | |
.addRange(0xFDF0, 0xFFEF) | |
.addRange(0x10000, 0x1FFFD) | |
.addRange(0x20000, 0x2FFFD) | |
.addRange(0x30000, 0x3FFFD) | |
.addRange(0x40000, 0x4FFFD) | |
.addRange(0x50000, 0x5FFFD) | |
.addRange(0x60000, 0x6FFFD) | |
.addRange(0x70000, 0x7FFFD) | |
.addRange(0x80000, 0x8FFFD) | |
.addRange(0x90000, 0x9FFFD) | |
.addRange(0xA0000, 0xAFFFD) | |
.addRange(0xB0000, 0xBFFFD) | |
.addRange(0xC0000, 0xCFFFD) | |
.addRange(0xD0000, 0xDFFFD) | |
.addRange(0xE1000, 0xEFFFD) | |
.addRange(0xF0000, 0xFFFFD) | |
.addRange(0x100000, 0x10FFFD); | |
console.log(set.toString()); |
Thanks, this script helps me a lot.
There's something in the generated string I don't understand. The pattern like [.....]|[.....]
, is that necessary? Can I simply replace that with a single [..........]
?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
set.toString()
at the end returns the following JavaScript string:'[\\x21\\x24\\x26-\\x3B\\x3D\\x3F-Z\\x5Fa-z\\x7E\\xA0-\\uD7FF\\uE000-\\uFDCF\\uFDF0-\\uFFEF]|[\\uD800-\\uD83E\\uD840-\\uD87E\\uD880-\\uD8BE\\uD8C0-\\uD8FE\\uD900-\\uD93E\\uD940-\\uD97E\\uD980-\\uD9BE\\uD9C0-\\uD9FE\\uDA00-\\uDA3E\\uDA40-\\uDA7E\\uDA80-\\uDABE\\uDAC0-\\uDAFE\\uDB00-\\uDB3E\\uDB44-\\uDB7E\\uDB80-\\uDBBE\\uDBC0-\\uDBFE][\\uDC00-\\uDFFF]|[\\uD83F\\uD87F\\uD8BF\\uD8FF\\uD93F\\uD97F\\uD9BF\\uD9FF\\uDA3F\\uDA7F\\uDABF\\uDAFF\\uDB3F\\uDB7F\\uDBBF\\uDBFF][\\uDC00-\\uDFFD]'
Logging it shows:
This can easily be used as part of a regular expression literal in JavaScript.