InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Jeff Moser's How .NET Regular Expressions Really Work

Posted by Jonathan Allen on Apr 01, 2009

Sections
Development,
Architecture & Design
Topics
.NET ,
Compilers ,
.NET Framework ,
Domain Specific Languages

Jeff Moser's has done an in-depth study of how regular expressions work in .NET. His article covers the core operating principals of Microsoft’s implementation such as the machine code used by compiled regular expressions.

The first thing he reveals is that the last 15 regular expressions are cached. For those little utility applications that only use one or two expressions, this means explicitly creating a RegEx object is probably not necessary.

When compiling a regular expression, the first step consists of a scanner than emits a RegexTree. Looking at just the leaf node, this resembles the source code to a fair extend. Next this is translated into the machine code of the regular expression engine.

The bulk of the work is done by the 250 line switch statement that makes up the EmitFragment function. This function breaks up RegexTree "fragments" and converts them to a simpler RegexCode.

[…]

The reward for all this work is an integer array that describes the RegexCode "op codes" and their arguments. You can see that some instructions like "Setrep" take a string argument. These arguments point to offsets in a string table. This is why it was critical to pack everything about a set into the obscure string we saw earlier. It was the only way to pass that information to the instruction.

Decoding the code array, we see:

Index

Instruction

Op Code/Argument

String Table Reference

Description

0

Lazybranch

23

 

Lazily branch to the Stop instruction at offset 21.

1

 

21

 

2

Setmark

31

 

Push our current state onto a stack in case we need to backtrack later.

3

Multi

12

 

Perform a multi-character match of string table item 0 which is "http://".

4

 

0

"http://"

5

Setmark

31

 

Push our current state onto a stack in case we need to backtrack later.

6

Setrep

2

 

Perform a set repetition match of length 1 on the set stored at string table position 1, which represents [^\s/].

7

 

1

"\x1\x2\x1\x2F\x30\x64"

8

 

1

 

9

Setloop

5

 

Match the set [^\s/] in a loop at most Int32.MaxValue times.

10

 

1

"\x1\x2\x1\x2F\x30\x64"

11

 

2147483647

 

12

Capturemark

32

 

Capture into group #1, the string between the mark set by the last Setmark and the current position.

13

 

1

 

14

 

-1

 

15

Oneloop

3

 

Match Unicode character 47 (a '/') in a loop for a maximum of 1 time.

16

 

47

 

17

 

1

 

18

Capturemark

32

 

Capture into group #0, the contents between the first Setmark instruction and the current position.

19

 

0

 

20

 

-1

 

21

Stop

40

 

Stop the regex.

We can now see that our regex has turned into a simple "program" that will be executed later.

You can read more about this process on Jeff Moser's blog. His article also covers

  • Prefix Optimizations
  • The Interpreter
  • Backtracking
  • Know Bugs
Does anyone proofread these things? by Michael Alonso Posted
  1. Back to top

    Does anyone proofread these things?

    by Michael Alonso

    A quick read quickly found 3 errors in spelling or grammar. I'm not a grammar nazi, but it really interrupts the rapid pace of reading that most people use to read blogs.

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.