Email Validation w/ C# Regular Expressions

A simple C# Email Validator that uses Regular Expressions

C#, Code, RegExp, Regular Expressions

There are 1000's of email validation Regular Expression available some more robust, or brittle depending on your views, than other. This is a very simple expression and so a good example on how to use Regular Expressions with C#.

What's a Regular Expression?

Quite simply, a Regular Expression is the term attributed to a method of pattern matching within text.

In theoretical computer science and formal language theory, a regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.

Wikipedia

In it's simplist form an Expression could just be looking for the existance of a character within a string. Of course you could just as easily use myString.Contains(value).

At the other extreme the expression might be used to match out entries added to a log file, within a certain time frame, that contain some complex pattern.

Is it just a C# thing?

No. I'm not going to say all, but you'd expect to find Regular Expression in most modern programming languages, especially higher level ones like C#; JavaScript; Go, to name a few.

There are also commands, such as grep in POSIX/*nix operating systems, that includes windows via cygwin. In fact grep is a great tool for searching logs.

Editors, such as Vim even Visual Studio, also have support for Regular Expressions. Although some implementations are much better than others, notably Visual Studios Search/Replace support isn't as powerfull as Vim's.

When to use Regular Expressions

For me, I tend to use them where the task lies somewhere between the 2 examples above.

For example it might be more efficient to use lexical analysis for larger chunks of text or where you have to execute multipule regular expressions against a peice of text, such as parsing HTML or Templating.

They are very useful means of validating & matching values; notibly for web developers, form data submited by by users, suchas :

  • Format of a date
  • Email Addresses (Below)
  • Postal / Zip Codes
  • Phones numbers
  • Credit/Debit Card numbers

The Core Switches & Opperators

There are a number of 'switches'/'opperators' that are used to describe individual or sets of characters. Here's some of, what I call, the core set.

  • [a-z] - Range of lower case*1 letters from a to z

  • [A-Z] - Range of upper case*1 letters from A to Z

  • [e,f,g] - The letters e, f & g

  • [ea,fb,gc,aaa] - The letter combinations ea, fb, gc & aaa

  • [0-9] - Range of numbers 0 to 9

  • [12,14,16,1001] - The numbers 12, 14, 16 & 1001

  • \w - Match any Word *2

  • \d - Match any Digit

  • \D - Any Non-Digit values

  • * - Zero or more

  • + - One or more

  • {n,m} - Number of occurances from n to m

  • \ - Escape sequence

  • ^ - Starts with

  • $ - Ends with

  • ( ~ exp ~ ) - Defines identifable parts/matches within the expression *3

*1 Case can be ignored globally on the regular expression

*2 Matches A to Z, 0 to 9 and _ (underscore)

*3 Where exp is a matching expression

Validating an Email Address

Before we can validate some textual content we must define some rules that must be followed in order to validate it.

Lets look at how to validate an email address.

The Structure of an Email Address

An email address is made up from 2 main parts; the mailbox/username (the value before the @) and the Domain name.

       '@' delimiter  
              V
    first.last@example.com
    ^________^ ^_________^    
     Mailbox     Domain

The Mailbox / Username

There are some assumptions we can make against a mailbox these are :

  • The first and last chars must be alphanumeric

  • Can only contain :

    • Letters A to Z

    • Numeric values

    • The Chars : +, -, _ & .

      • Can not end in one of these

The Domain Name

Similar rules can be applied to the domain name.

We could also test if the Top Level Domain (TLD), .com or .co.uk for example, is one of the currently avalaible TLDs. There are, however, a couple of reasons to avoid testing TDL. Primarly, the number of TLDs will change and so the Regular Expression has to be updated as new TLDs are made avaliable. Additionally, adding the definition for all TLDs will bloat the Regular Expression. There are at least 440 TLDs according to IANA.

  • The first and last chars must be alphanumeric

  • Can only contain :

    • Letters A to Z

    • Numeric values

    • The hyphen char -

      • Only the hostname can contain -

      • Hostname cannot end with -

      • Hostname ends .

    • TLD cannot end .

  • Multipule level hostnames eg : mailbox@example.tld & mailbox@alt.example.tld

Building the Regular Expression

With some rules in place we can use them to write expressions for the different parts before combining them into a single expression.

Regular Expression for mailbox

We'll handle case sensitivity later.

1 - Starts with alphanumeric - Requires the use of [0-9a-z] to match a alpha or numerics.

"[0-9a-z]"

We'll need to use ^ to denote starts with, later.

2 - Can contain certain non alphanumerics - Requires we define the accetable chars.

"[\+\._-]"

Note the use of \ on + & . as these need to be escaped in order to be litterally matched. Where as _ & - do not require the escape char.

3 - Connot end with the allowable non alphanumerics - With the same alphanumeric range [0-9a-z] and the + opperator to denote one or more we can say match at least one.

"[0-9a-z]+"

Combining Part 2 with this and making use of (, ) & * to define these parts and how they are structured.

"([\+\._-][0-9a-z]+)*"  

Part 2 matches one of the allowable chars and, through the use of +, the alphanumeric range ensures that at least one alphanumeric is matched. Ordered in this way, if any of the allowable non-alphanumeric chars are matched they must preceed an alphanumeric char.

Thus : "+a", "-x" & ".c" is allowed

Whereas : "a+", "x-" & "c." is not allowed

Wrapping the expression with ( & )* denotes that any number of matches are can be found, zero or more; If the mailbox doesn't contain any of allowable chars it is not affected by this expression/condition.

4 - Putting it together - Prefixing the starts with opperator ^, Part 1 & ( Part 3 )+ we have the following Regular Expression for matching the mailbox/username portion of the email address.

"^([0-9a-z]([\+\._-][0-9a-z]+)*)+"

Here the expression is wrapped and the + opperator says that at least one is matched, any additional matches are also allow allowed.

Regular Expression for Domain

Using the same approach we can ensure that the domain starts with an alphanumeric char and that any hyphen preceeds an alphanumeric:

1 Any hyphen must preceed an alphanumeric - Using the same technique used for mailbox but with different chars we have :

"([-][0-9a-z]+)*" 

Adding in the single alphanumeric match, for our starts with and the One or more opperator + we get :

"([0-9a-z]([-][0-9a-z]+)*)+"

The same rules apply, if the hostname doesn't contain a - (hyphen) then only the first part of the expression is used.

This will match a example from example.com.

2 Multiple leveled hostnames - Using the literal value \. and modifing Part 1, to test for example. from example.com and alt.example. from alt.example.com we have :

"([0-9a-z]([-][0-9a-z]+)*\.)+"

3 The TLD - As the TLD can only contain alphanumerics, each part is at least 2 chars in length, not sure what the exact limit is but as the word TECHNOLOGY is avaliable lets say up to say 10 chars.

TDL in this example is only looking for the very top of the domain, com from example.com and uk from example.co.uk. As the Part 2 will match co. from co.uk as a level of hostnames.

With this in mind, all that's required is to take our alpha matcher. [a-z], combine it with a restriction on the min & max occourances, via {n,m}, with n and m set to 2 and 10 respectivly. We also need to append the ends with opperator $ so it only matches from the right side of the string.

"[a-z]{2,10}$"

Add in the @ Delimiter & the Final Expression

Taking the mailbox & domain name expressions, combining them with the @ delimiter we can build the full expression.

var exp = "^([0-9a-z]([\+\._-][0-9a-z]+)*)+"     +  // Mailbox 
          "@"                                    +  // @ Delimiter
          "("                                    +
          "([0-9a-z]([-][0-9a-z]+)*\.)+)"        +  // Hostname
          "[a-z]{2,10}$"                          +  // TLD
          ")$"  
          ;

Main Regex Image (above) via Mike Dixson