Regular Expressions with non-capturing groups and Terraform

Regular Expressions (RegEx) are the most geeky stuff that is still borderline useful in IT.

They describe patterns of strings, for example how a valid e-mail-address looks or a valid URL.

We all know how a mail address looks like:

someone@domain.org

Here is its pattern as RegEx:

.*@[^.]+.*

Here is how you read that:


A mail address is an arbitrary amount of arbitrary characters (.*) followed by the @ sign (@), followed by one or more characters that are not a dot ([^.]+), followed by an arbitrary amount of arbitrary characters (.*).

The dot stands for any character, and the asterisk for an arbitrary amount. [^.] stands for any character that is not a dot, + stands for one or more occurrences. 

The expression is not yet quite right, but I leave it like this to run into issues and correct them below.

RegEx in Linux

First, let's fire up a Linux Terminal and see if we can use this:



grep -P grabs for Perl-compatible regular expressions. I start it and type "this is not a mail address". Nothing happens. But when I type a legal mail address, "me@domain.org", grep grabs it and prints it again. This way, you can extract every line from a text file that is a mail address... realize it? There is already the first problem: grep matches line-by-line. So a line in a text file that contains a mail address will be listed completely:


So, let's assume we want to filter out all mail addresses from a text file, then, this is useless, because it also filters the text that does not match the expression.

We can do better with the command sed:


sed replaces every mail address by "censored" here. Do you realize it? The first line,

My mail is me@domain.org

is interpreted entirely as mail address. Sed thinks, even the word "mail" was part of the address. This is because we said "an arbitrary amount of arbitrary characters". And arbitrary characters include blanks. We should have written "an arbitrary amount of characters that are not blanks. Here we go:


Now you understand how pattern matching works with RegEx under Linux - you can use the command grep to filter any text file for lines containing string patterns, and you can use sed to replace string patterns inside files.

RegEx in Programming

Now, RegEx are also good in programming. Most languages support them, among them PHP. The mediasyntax plugin recognizes whenever you write certain commands into your dokuwiki.

Here is one example:



If the mediasyntax plugin detects a wiki page that has a newline (\n) followed by a blank, followed by an arbitrary amount of arbitrary characters (.*), a codeblock will be triggered and your text will appear monospaced, as if it was written in a console. But what do the question marks mean in the expression above? We will come to this later.

RegEx in Terraform

Today I discovered that you can use RegEx in Terraform as well. My job was to extract a VPC name from a service account name. This looks like this:

serviceaccount@network1.whatever.org

my job was to extract network1 from the above. Needless to say, this must work for thousands of service accounts to come.

OK, let's get rolling. First of all, Terraform supports RegEx and also has a console command that you can use to test it. Here is how: 


Now, with this, it is easy to extract the middle part that we need. Give me the string that starts with an @, followed by an arbitrary amount of arbitrary characters, followed by a dot:

> regex("@.*[.]","bla@foo.bar")

"@foo."

Hm. Not bad, but not perfect either. The @ must be there, but it must not be printed. Same about the dot at the end. The solution is to use groups. Groups help you to cut the string into pieces. Whatever you surround with parantheses becomes a group, and you can throw away groups that you don't need. In this case, we cut it into three pieces:


The output is no longer a single string, but an array of the three groups. Now we set the first and the last to be non-capturing using the question mark:


This is still not what we want, we wanted to have just one output, "foo" in this case. But thanks to Terraform this is not an issue - we just output the only element of the array which is [0]:


References

https://www.terraform.io/docs/language/functions/regex.html

https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions

https://github.com/tstaerk/mediasyntax

Comments

Popular posts from this blog

Set up a webcam with Linux

PuTTY: No supported authentication methods available

My SAT>IP Server