Regular Expressions (Regex) Complete Guide: Patterns, Syntax & Real Examples (2026)
Master regex from scratch. Learn character classes, quantifiers, groups, lookaheads, and real-world patterns for email validation, URL parsing, log analysis, and more.
What Is a Regular Expression?
A regular expression (regex) is a sequence of characters that defines a search pattern. It is used to match, extract, replace, or validate text. Regex is supported in virtually every programming language — Python, JavaScript, PHP, Java, Go, Ruby — and in command-line tools like grep, sed, and awk.
A regex pattern like \b[A-Z][a-z]+\b means: match a word that starts with an uppercase letter followed by one or more lowercase letters. Once you understand the building blocks, you can construct patterns for almost any text processing task.
Core Syntax Reference
Literal Characters
Most characters match themselves. The pattern hello matches the exact string "hello" wherever it appears in the input text.
Character Classes [...]
Match any one character from a set:
[abc]— matches "a", "b", or "c"[a-z]— matches any lowercase letter[A-Za-z0-9]— matches any alphanumeric character[^abc]— matches any character EXCEPT "a", "b", or "c" (negation)[a-zA-Z_][a-zA-Z0-9_]*— matches a valid identifier (letter or underscore, followed by letters/digits/underscores)
Shorthand Character Classes
\d— digit (same as[0-9])\D— non-digit (same as[^0-9])\w— word character (letter, digit, or underscore:[a-zA-Z0-9_])\W— non-word character\s— whitespace (space, tab, newline, etc.)\S— non-whitespace.— any character except newline (use[\s\S]to include newlines)
Quantifiers
Control how many times the preceding element must match:
*— 0 or more times+— 1 or more times?— 0 or 1 time (makes the element optional){3}— exactly 3 times{2,5}— between 2 and 5 times{3,}— 3 or more times
By default, quantifiers are greedy — they match as much as possible.
Add
? after a quantifier to make it lazy — match as little as possible.<.+> on <b>bold</b> matches the entire string.<.+?> matches only <b>.
Anchors
Anchors do not match characters — they match positions in the string:
^— start of string (or start of line in multiline mode)$— end of string (or end of line in multiline mode)\b— word boundary (between a\wand a\Wcharacter)\B— non-word boundary\A— absolute start of string (Python/Java/Ruby; ignores multiline)\Z— absolute end of string
Groups and Alternation
(abc)— capturing group: matches "abc" and captures the match for later use(?:abc)— non-capturing group: groups without capturing (more efficient)cat|dog— alternation: matches "cat" or "dog"(cat|dog)s?— matches "cat", "cats", "dog", or "dogs"\1,\2— backreference: refers to the content of group 1, group 2
(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})Names the groups "year", "month", "day" for readable extraction.
Lookaheads and Lookbehinds
Lookaround assertions match a position based on what comes before or after it, without consuming the characters:
(?=abc)— positive lookahead: position is followed by "abc"(?!abc)— negative lookahead: position is NOT followed by "abc"(?<=abc)— positive lookbehind: position is preceded by "abc"(?<!abc)— negative lookbehind: position is NOT preceded by "abc"
Pattern:
(?<=\$)\d+(\.\d{2})?Input: "Pay $29.99 or $9.00 monthly"
Matches: "29.99", "9.00" (the symbol is checked but not included in the match)
Regex Flags (Modifiers)
Flags change how the engine interprets the pattern:
i— case-insensitive:/hello/imatches "Hello", "HELLO", "hElLo"g— global: find all matches, not just the first (JavaScript)m— multiline:^and$match start/end of each lines— dotall: makes.match newlines toox/verbose— ignore whitespace in pattern and allow comments (great for long patterns)
20 Essential Regex Patterns
Validation Patterns
# Email address (simplified but practical)
^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$
# URL
https?://[\w\-]+(\.[\w\-]+)+(/[\w\-._~:/?#[\]@!$&'()*+,;=%]*)?
# IPv4 address
^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$
# Phone number (flexible international)
^\+?[\d\s\-().]{7,20}$
# Postal code (US ZIP)
^\d{5}(-\d{4})?$
# Credit card (generic, 13-19 digits)
^[\d ]{13,19}$
# Strong password (min 8 chars, uppercase, lowercase, digit, special)
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
# Date YYYY-MM-DD
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
# Hex color code
^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$
# Semantic version (1.2.3)
^\d+\.\d+\.\d+$
Extraction Patterns
# Extract all URLs from text
https?://[^\s<>"']+
# Extract hashtags
#[a-zA-Z]\w*
# Extract @mentions
@[a-zA-Z0-9_]+
# Extract quoted strings
"([^"\]|\.)*"
# Extract HTML tag content
<(\w+)[^>]*>(.*?)<\/\1>
# Extract numbers (integers and decimals)
-?\d+(\.\d+)?
# Extract words (no punctuation)
\b[a-zA-Z]+\b
# Extract lines containing a keyword (case-insensitive)
(?im)^.*error.*$
# Extract everything between parentheses
\(([^)]*)\)
# Extract JSON key-value pairs
"(\w+)"\s*:\s*"([^"]*)"
Real-World Use Cases
Log File Analysis
Parse Apache/Nginx access logs to extract IP, timestamp, method, path, and status code:
^(\S+)\s+\S+\s+\S+\s+\[([^\]]+)\]\s+"(\S+)\s+(\S+)\s+\S+"\s+(\d{3})\s+(\d+|-)
Groups: 1=IP, 2=datetime, 3=method, 4=path, 5=status, 6=bytes
Data Cleaning
Remove all HTML tags from a string:
<[^>]+> → replace with empty stringNormalise whitespace (replace multiple spaces/tabs/newlines with a single space):
\s+ → replace with " "Remove leading and trailing whitespace:
^\s+|\s+$ → replace with empty stringCode Analysis
Find all TODO comments in source code:
//\s*TODO:?\s*(.*)Find function definitions (Python):
^def\s+(\w+)\s*\(([^)]*)\):Find import statements (JavaScript/TypeScript):
import\s+(?:\{[^}]+\}|[\w*]+)\s+from\s+['"]([^'"]+)['"]Performance Tips
- Avoid catastrophic backtracking: Nested quantifiers like
(a+)+on a long non-matching string can take exponential time. Use atomic groups or possessive quantifiers when available. - Prefer non-capturing groups: Use
(?:...)instead of(...)when you do not need to capture — it is slightly faster. - Anchor your patterns: Starting a pattern with
^or\btells the engine where to start matching, which dramatically reduces unnecessary work. - Use character classes instead of alternation:
[aeiou]is faster thana|e|i|o|u. - Compile patterns once: In Python (
re.compile()) and Java (Pattern.compile()), compile your pattern once and reuse it instead of recompiling it in a loop. - Be specific:
\d{4}is faster than\d+when you know the exact length.
Common Regex Mistakes
- Forgetting to escape metacharacters:
.,*,+,?,(,),[,{,^,$,|,\are special. To match a literal dot, write\.not just.. - Using
.when you mean "any visible character":.also matches spaces and tabs. If you want to exclude whitespace, use\S. - Greedy matching eating too much:
<.+>on<b>text</b>matches the whole string. Use<.+?>or<[^>]+>. - Forgetting word boundaries: The pattern
catalso matches "concatenate". Use\bcat\bto match the word "cat" only. - Assuming
^and$work per line by default: In most engines, they match the start/end of the entire string. Add themflag for line-by-line matching. - Case-sensitive by default: Regex is case-sensitive unless you add the
iflag./hello/will not match "Hello".
Regex in Different Languages
# Python
import re
pattern = re.compile(r"\b\w+@\w+\.\w+\b")
matches = pattern.findall("Send to [email protected] or [email protected]")
# JavaScript
const emails = text.match(/\b\w+@\w+\.\w+\b/g);
# PHP
preg_match_all('/\b\w+@\w+\.\w+\b/', $text, $matches);
# Go
re := regexp.MustCompile(`\b\w+@\w+\.\w+\b`)
matches := re.FindAllString(text, -1)
# Bash (grep)
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' file.txt
Test Your Regex Patterns for Free
Write, test, and debug regular expressions with real-time match highlighting, group capture display, and a built-in pattern library.