Regular Expressions

1. Introduction
2. Character Matching
3. Meta Characters
4. Special Quantifiers
5. Shorthand Character Set
6. Good Examples
7. Command Line Usage
8. Look Around
9. Flags
10. Substitutions
11. When not to use Regex
12. Cheat Sheets

1. Introduction

REGEX mean REGular EXpression which in turn is nothing but sequence of characters. These expressions say for example [0–9] means that the expression should contain numbers. Regular expressions are used in many situation in computer programming. Majorly in search, pattern matching, parsing, filtering of results and so on.

In lay man words, its a kind of rule which programmer tells to the computer to understand. You may see this in website forms, where you are only forced to enter ONLY numbers or ONLY characters or MINIMUM 8 characters and so on, these are controlled by REGEX behind the screen.Imagine you are writing an application and you want to set the rules for when a user chooses their username. We want to allow the username to contain letters, numbers, underscores and hyphens. We also want to limit the number of characters in username so it does not look ugly. We use the following regular expression to validate a username:

Above regular expression would accept bansalnitish but not bansalnitish020297 (too long).

User_name

User name

A good thing about regex is that it is almost language independent. Implementation might vary but the expression itself is more or less the same. See below each script will read the test.txt file, search it using our regular expression:^[0-9]+$, and print all the numbers in file to the console. As of now consider just assume [0-9] means any digit from 0-9, ^ means start of string, + means anything greater than 0 and $ means end of the string.

If we use the following file as test.txt:

1234
abcde
12db2
5362

1

The output would be: ('1234', '5362', '1') in all the following cases -

Javascript / Node.js / Typescript

const fs = require('fs')
const testFile = fs.readFileSync('test.txt', 'utf8')
const regex = /^([0-9]+)$/gm
let results = testFile.match(regex)
console.log(results)

Python

import re
with open('test.txt', 'r') as f:
  test_string = f.read()
  regex = re.compile(r'^([0-9]+)$', re.MULTILINE)
  result = regex.findall(test_string)
  print(result)

PHP

<?php
$myfile = fopen("test.txt", "r") or die("Unable to open file.");
$test_str = fread($myfile,filesize("test.txt"));
fclose($myfile);
$re = '/^[0-9]+$/m';
preg_match_all($re, $test_str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
?>

Java

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;

class FileRegexExample {
  public static void main(String[] args) {
    try {
      String content = new String(Files.readAllBytes(Paths.get("test.txt")));
      Pattern pattern = Pattern.compile("^[0-9]+$", Pattern.MULTILINE);
      Matcher matcher = pattern.matcher(content);
      ArrayList<String> matchList = new ArrayList<String>();

      while (matcher.find()) {
        matchList.add(matcher.group());
      }

      System.out.println(matchList);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

C++

#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <regex>
using namespace std;

int main () {
  ifstream t("test.txt");
  stringstream buffer;
  buffer << t.rdbuf();
  string testString = buffer.str();

  regex numberLineRegex("(^|\n)([0-9]+)($|\n)");
  sregex_iterator it(testString.begin(), testString.end(), numberLineRegex);
  sregex_iterator it_end;

  while(it != it_end) {
    cout << it -> str();
    ++it;
  }
}

Why Regex?

Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters). Fields of application range from validation to parsing/replacing strings, passing through converting data to other formats and web scraping. One of the most interesting features is that once you’ve learned the syntax, you can actually use this tool in (almost) all programming languages (JavaScript, Java, VB, C #, C / C++, Python, Perl, Ruby, Delphi, R, Tcl, an many others) with the slightest distinctions about the support of the most advanced features and syntax versions supported by the engines).

2. Character matching

A regex can be used for simple searching and matching of a sequence of characters.

For example, the regular expression hat means: the letter h, followed by the letter a, followed by the letter t. hat => She wore a pretty blue hat. Her friend wore a red hat. All hats are cool.

In this way, you can find a match for any sequence of characters you want to. The characters will be searched throughtout the input provided by you. Note that regex is case sensitive.

3. Meta Characters

Meta characters are the building blocks of the regular expressions. Meta characters do not stand for themselves but instead are interpreted in some special way. Some meta characters have a special meaning and are written inside square brackets. The meta characters are as follows:

|Meta character|Description| |:----:|----| |.|Period matches any single character except a line break.| |[ ]|Character class. Matches any character contained between the square brackets.| |[^ ]|Negated character class. Matches any character that is not contained between the square brackets| |*|Matches 0 or more repetitions of the preceding symbol.| |+|Matches 1 or more repetitions of the preceding symbol.| |?|Makes the preceding symbol optional.| |{n,m}|Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol.| |(xyz)|Character group. Matches the characters xyz in that exact order.| |||Alternation. Matches either the characters before or the characters after the symbol.| |\|Escapes the next character. This allows you to match reserved characters <code>[ ] ( ) { } . * + ? ^ $ \ |</code>| |^|Matches the beginning of the input.| |$|Matches the end of the input.|

3.1 Full stop

Full stop . is the simplest example of meta character. The meta character . matches any single character. It will not match return or newline characters. For example, the regular expression .ar means: any character, followed by the letter a, followed by the letter r.

<pre> ".ar" => The car parked in the garage. </pre>

3.2 Character set

Character sets are also called character class. Square brackets are used to specify character sets. Use a hyphen inside a character set to specify the characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression [Tt]he means: an uppercase T or lowercase t, followed by the letter h, followed by the letter e.

<pre> "[Tt]he" => The car parked in the garage. </pre>

A period inside a character set, however, means a literal period. The regular expression ar[.] means: a lowercase character a, followed by letter r, followed by a period . character.

<pre> "ar[.]" => A garage is a good place to park a car. </pre>

3.2.1 Negated character set

In general, the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the character set. For example, the regular expression [^c]ar means: any character except c, followed by the character a, followed by the letter r.

<pre> "[^c]ar" => The car parked in the garage. </pre>

3.3 Repetitions

Following meta characters +, * or ? are used to specify how many times a subpattern can occur. These meta characters act differently in different situations.

3.3.1 The Star

The symbol * matches zero or more repetitions of the preceding matcher. The regular expression a* means: zero or more repetitions of preceding lowercase character a. But if it appears after a character set or class then it finds the repetitions of the whole character set. For example, the regular expression [a-z]* means: any number of lowercase letters in a row.

<pre> "[a-z]*" => The car parked in the garage #21. </pre>

The * symbol can be used with the meta character . to match any string of characters .*. The * symbol can be used with the whitespace character \s to match a string of whitespace characters. For example, the expression \s*cat\s* means: zero or more spaces, followed by lowercase character c, followed by lowercase character a, followed by lowercase character t, followed by zero or more spaces.

<pre> "\s*cat\s*" => The fat cat sat on the concatenation. </pre>

3.3.2 The Plus

The symbol + matches one or more repetitions of the preceding character. For example, the regular expression c.+t means: lowercase letter c, followed by at least one character, followed by the lowercase character t. It needs to be clarified that t is the last t in the sentence.

REGularEXpressions

Install / Use

README

Contents