Regenerate
Generate JavaScript-compatible regular expressions based on a given set of Unicode symbols or code points.
Install / Use
/learn @mathiasbynens/RegenerateREADME
Regenerate

Regenerate is a Unicode-aware regex generator for JavaScript. It allows you to easily generate ES5-compatible regular expressions based on a given set of Unicode symbols or code points. (This is trickier than you might think, because of how JavaScript deals with astral symbols.)
Installation
Via npm:
npm install regenerate
Via Bower:
bower install regenerate
In a browser:
<script src="regenerate.js"></script>
In Node.js, io.js, and RingoJS ≥ v0.8.0:
var regenerate = require('regenerate');
In Narwhal and RingoJS ≤ v0.7.0:
var regenerate = require('regenerate').regenerate;
In Rhino:
load('regenerate.js');
Using an AMD loader like RequireJS:
require(
{
'paths': {
'regenerate': 'path/to/regenerate'
}
},
['regenerate'],
function(regenerate) {
console.log(regenerate);
}
);
API
regenerate(value1, value2, value3, ...)
The main Regenerate function. Calling this function creates a new set that gets a chainable API.
var set = regenerate()
.addRange(0x60, 0x69) // add U+0060 to U+0069
.remove(0x62, 0x64) // remove U+0062 and U+0064
.add(0x1D306); // add U+1D306
set.valueOf();
// → [0x60, 0x61, 0x63, 0x65, 0x66, 0x67, 0x68, 0x69, 0x1D306]
set.toString();
// → '[`ace-i]|\\uD834\\uDF06'
set.toRegExp();
// → /[`ace-i]|\uD834\uDF06/
Any arguments passed to regenerate() will be added to the set right away. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
regenerate(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
var items = [0x1D306, 'A', '©', 0x2603];
regenerate(items).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
regenerate.prototype.add(value1, value2, value3, ...)
Any arguments passed to add() are added to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
regenerate().add(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
var items = [0x1D306, 'A', '©', 0x2603];
regenerate().add(items).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
It’s also possible to pass in a Regenerate instance. Doing so adds all code points in that instance to the current set.
var set = regenerate(0x1D306, 'A');
regenerate().add('©', 0x2603).add(set).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
Note that the initial call to regenerate() acts like add(). This allows you to create a new Regenerate instance and add some code points to it in one go:
regenerate(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
regenerate.prototype.remove(value1, value2, value3, ...)
Any arguments passed to remove() are removed from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
regenerate(0x1D306, 'A', '©', 0x2603).remove('☃').toString();
// → '[A\\xA9]|\\uD834\\uDF06'
It’s also possible to pass in a Regenerate instance. Doing so removes all code points in that instance from the current set.
var set = regenerate('☃');
regenerate(0x1D306, 'A', '©', 0x2603).remove(set).toString();
// → '[A\\xA9]|\\uD834\\uDF06'
regenerate.prototype.addRange(start, end)
Adds a range of code points from start to end (inclusive) to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
regenerate(0x1D306).addRange(0x00, 0xFF).toString(16);
// → '[\\0-\\xFF]|\\uD834\\uDF06'
regenerate().addRange('A', 'z').toString();
// → '[A-z]'
regenerate.prototype.removeRange(start, end)
Removes a range of code points from start to end (inclusive) from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
regenerate()
.addRange(0x000000, 0x10FFFF) // add all Unicode code points
.removeRange('A', 'z') // remove all symbols from `A` to `z`
.toString();
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
regenerate()
.addRange(0x000000, 0x10FFFF) // add all Unicode code points
.removeRange(0x0041, 0x007A) // remove all code points from U+0041 to U+007A
.toString();
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
regenerate.prototype.intersection(codePoints)
Removes any code points from the set that are not present in both the set and the given codePoints array. codePoints must be an array of numeric code point values, i.e. numbers.
regenerate()
.addRange(0x00, 0xFF) // add extended ASCII code points
.intersection([0x61, 0x69]) // remove all code points from the set except for these
.toString();
// → '[ai]'
Instead of the codePoints array, it’s also possible to pass in a Regenerate instance.
var whitelist = regenerate(0x61, 0x69);
regenerate()
.addRange(0x00, 0xFF) // add extended ASCII code points
.intersection(whitelist) // remove all code points from the set except for those in the `whitelist` set
.toString();
// → '[ai]'
regenerate.prototype.contains(value)
Returns true if the given value is part of the set, and false otherwise. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
var set = regenerate().addRange(0x00, 0xFF);
set.contains('A');
// → true
set.contains(0x1D306);
// → false
regenerate.prototype.clone()
Returns a clone of the current code point set. Any actions performed on the clone won’t mutate the original set.
var setA = regenerate(0x1D306);
var setB = setA.clone().add(0x1F4A9);
setA.toArray();
// → [0x1D306]
setB.toArray();
// → [0x1D306, 0x1F4A9]
regenerate.prototype.toString(options)
Returns a string representing (part of) a regular expression that matches all the symbols mapped to the code points within the set.
regenerate(0x1D306, 0x1F4A9).toString();
// → '\\uD834\\uDF06|\\uD83D\\uDCA9'
If the bmpOnly property of the optional options object is set to true, the output matches surrogates individually, regardless of whether they’re lone surrogates or just part of a surrogate pair. This simplifies the output, but it can only be used in case you’re certain the strings it will be used on don’t contain any astral symbols.
var highSurrogates = regenerate().addRange(0xD800, 0xDBFF);
highSurrogates.toString();
// → '[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])'
highSurrogates.toString({ 'bmpOnly': true });
// → '[\\uD800-\\uDBFF]'
var lowSurrogates = regenerate().addRange(0xDC00, 0xDFFF);
lowSurrogates.toString();
// → '(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
lowSurrogates.toString({ 'bmpOnly': true });
// → '[\\uDC00-\\uDFFF]'
Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript without the use of lookbehind assertions, which aren't yet widely supported. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.
If the hasUnicodeFlag property of the optional options object is set to true, the output makes use of Unicode code point escapes (\u{…}) where applicable. This simplifies the output at the cost of compatibility and portability, since it means the output can only be used as a pattern in a regular expression with the ES6 u flag enabled.
var set = regenerate().addRange(0x0, 0x10FFFF);
set.toString();
// → '[\\0-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]''
set.toString({ 'hasUnicodeFlag': true });
// → '[\\0-\\u{10FFFF}]'
regenerate.prototype.toRegExp(flags = '')
Returns a regular expression that matches all the symbols mapped to the code points within the set. Optionally, you can pass flags to be added to the regular expression.
var regex = regenerate(0x1D306, 0x1F4A9).toRegExp();
// → /\uD834\uDF06|\uD83D\uDCA9/
regex.test('𝌆');
// → true
regex.test('A');
// → false
// With flags:
var regex = regenerate(0x1D306, 0x1F4A9).toRegExp('g');
// → /\uD834\uDF06|\uD83D\uDCA9/g
Note: This probably shouldn’t be used. Regenerate is intended as a tool that is used as part of a build process, not at runtime.
regenerate.prototype.valueOf() or regenerate.prototype.toArray()
Returns a sorted array of unique code points in the set.
regenerate(0x1D306)
.addRange(0x60, 0x65)
.add(0x59, 0x60) // note: 0x59 is added after 0x65, and 0x60 is a duplicate
.valueOf();
// → [0x59, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x1D306]
regenerate.version
A string representing the semantic version number.
