The perhaps biggest obstacle you may face as a developer being introduced to Regex, is usually not what they are used for, but rather how. Learning Regex isn't always a straight forward process, you might find difficulties with;
- The large vocabulary of symbols you have to learn
- Poor examples which are too complex, created by other people who don't really master Regex
- Not running into enough practical use cases to hone your skills
A good way to do this is to incorporate regular expressions in your workday when doing "fire and forget" tasks like finding and replacing common occurrences of words in text files, or transforming file content to fit another format. In this post, I will show you how you can do just that.
A practical example
Imagine that you have been tasked with setting up 1-to-1 redirects for an ASP.NET website located at https://example.se. You have received an Excel sheet with two columns; Old URLs and New URLs. It contains 100 rows and each redirect destination may be a page on the same website or a completely different one.
You know that you need a rewrite rule along with a rewrite map, so you set up the following rule and map:
Rule
The rule takes a relative URL, matches it to an entry in a rewrite map, and if a mapped absolute destination URL is found, the request will be redirected.
<rewrite>
<rules>
<rule>
<match url=".*" />
<conditions logicalGrouping="MatchAll" trackAllCaptures="true">
<add input="{MyMap:{REQUEST_URI}}" pattern="(.+)" ignoreCase="true" />
</conditions>
<action type="Redirect" redirectType="Permanent" url="{C:1}" appendQueryString="true" />
</rule>
</rules>
...
</rewrite>
Map
<rewrite>
...
<rewriteMaps>
<rewriteMap name="MyMap">
<!-- Entries go here -->
</rewriteMap>
</rewriteMaps>
</rewrite>
You realise quickly that setting these up manually is both time consuming and prone to errors, so you look for a faster solution.
You could build a console app which takes the path to a CSV file, reads it, transforms it to a correctly formatted rewrite map and saves it in a new XML file.
Or, you'll find that this is a perfect time to sharpen your regular expression skills! But how do we start?
Many text editors and IDEs will have an advanced search and replace function which supports regular expressions. So boot up your editor of choice and open a new empty tab.
In this example, I will use my favorite editor to date - Visual Studio Code.
Note that your regular expression may depend on which exact format you received the redirects in, for example a CSV file or XLSX. To keep this post simple and straight forward, we assume it's a XLSX file and we simply skip the step of converting it to a CSV.
So, open up the Excel sheet and copy everything from it and paste into your empty editor tab.
What you have now is a plain-text document, with a list of old and new URLs, with a big set of whitespace in the middle.
Now the steps to convert all this into a list of XML-formatted rewrite map entries become a simple task.
Search for this (remember to check "Regex" in your search field):
^https:\/\/example\.se(.+?)\s+(.+)$
and replace with this:
<add key="$1" value="$2" />
Works like a charm! Easy, right?
But hey, let's break it down a bit and see how it works.
- ^https:\/\/example\.se(.+?) indicates that we want to match each row that starts with the protocol and hostname of the old URLs. We then only want to preserve the path, hence we only include what comes after the hostname in a capture group.
- \s+ indicates that we expect at least one whitespace character. Since we don't want to count exactly how many (and who knows, the amount may differ between each row), we simply match _any_ whitespace that may exist.
- (.+)$ - indicates that we want to preserve the destination URL, found in the end of each row, exactly as is.
But wait, something looks off with the first capturing group? Why does it contain a question mark, and the other doesn't?
Enter, the lazy quantifier.
The lazy quantifier
In our example, our first capturing group tries to find one or more occurrences of any character, followed by whitespace characters. Without the question mark, the whitespace will be included in the capture group. Why?
Since .+ matches at least one character of any type, our capture group will not stop when it reaches the set of whitespace - instead the whitespace will be included, which we do not want.
Common quantifiers like + and * are greedy by default - in other words, they will in some cases match a bit too much if the rest of the expression does not have an unmistakable stopping point.
This is solved by combining the quantifier with ? - the lazy quantifier.
The question mark has two meanings - alone it is a "regular" quantifier like + and * which means zero or one occurences of the preceding character. If it on the otherhand is combined with another quantifier like +, it suddenly changes the greedy setting to lazy. "Lazy" simply mean that the expression will match as few characters as possible.
- .+ - Greedy
- .+? - Lazy
By simply changing (.+) to (.+?), the first capture group is forced to stop at the next set of whitespace.
Final words
Regular expressions can seem really difficult at first and I'm sure you will bump into more complex use cases in the future. But look at it this way:
We have now managed to convert an entire set of 1-to-1 redirects into a valid XML-formatted rewrite map, all with a simple regular expression. Once you get the hang of it, it can become a powerful tool in your development arsenal.
A tip for the road
A great way to learn more about regular expressions, is to use the awesome Regex playground found at https://regex101.com.
It lets you test your regular expressions on a chunk of text that you specify yourself, and an extensive list of all available symbols and modifiers, along with a short description of what the do and an example of how to use them.