How to extract links from XML file using Linux grep

Recently I had offered a colleague of mine to provide her with all links to blogs I follow so she could read them, too. It was easier said than done. Needles to say, wordpress.com is very friendly in this case and provides you with an option to export all blogs you follow. You simply have to go to “Reader” > “Blogs I follow”, click “Edit” and then “Export” which appears under the text input field.

So good, so far. I was provided with an .opml file (an xml looking like mark-up) and here is where my misery began. It’s good to know, that currently I am following about 150 blogs and extract all manually would have been a punishment I don’t want 🙂  and providing an ugly .xml file to my co-worker would have been even uglier.

Solution.

I’ve done such exercises when learning C# by using regular expressions and I believe I could even do it in C, but dealing with stream readers and writers wasn’t as efficient, as I wanted. And one thing popped into my mind.

Regular expressions + Linux.

Regular expressions are wide-spread in almost any programming language, but it’s not a surprise they could be used in the Linux shell, too. So with only one command, we could do the same job as open file (with some stream), read it, write to another, and then close it in any programming language. The sweetest part here is, we don’t have to care about it, it’s all done by grep. But let’s give a some context to our task.

The markup.

This was tricky. I wont paste the actual .xml because I don’t want to have these blogs spammed, but it looked just like this:

So as we see, we are actually having a lot of information there and we have the links repeated twice, so we’ll have to keep this in mind, too.

Using grep.

Grep is command line utility that searches for a specific phrase or pattern in a text file and displays the results in another file, or through the standard output, if no file is provided.
Normally the command looks like this:

To be honest there’s no easy way to learn regex, you eighter struggle with them for long enough until you learn them, or you just apply the try-fail rountine until you get. I don’t pretend I am e pro, I am mostly using the second approach. This is the reason why I wont explain details everything you see, because there’s enough tutorials for using regex, I will just focus on our specific task.

Our command would look like this.

So before we are able to proudly say:

neo - I know regex

we must explain what we did with the command.

Ok, so the easy part is we are using grep as a command line utility and we are providing it with source file(wpcom-subscriptions.opml), and at the end we redirect the standard output to a file that will be created for us (> sites.txt). What’s left are some additional arguments that we provide (-oE) where -E stands for “extended regular expressions”, this way we could use more advanced features such as grouping, and -o which stands for “only matching”, which will redirect only the matching (otherways grep returns the whole lines, which is useless in our case).

So the only part that’s left is the regular expression itself. We start with matching the protocol “http”, than we have the symbol representing any character sign (\w) .But there’s another “hack” here, since it’s not sure if we are expecting a letter (ex.https) or a non-letter symbol (http.) we are using the pipe symbol to define one option or another. Since we really don’t care what’s coming next, we use ‘+’ symbol which will literally select every occasion defined from the previous condition, until it hit the end of the row, whitespace character (\s, \t) or the next codition.

The last part of deciphering our regular expression is the part we deal with the top-level domains of the URLs(.com, .net, etc).But first we are using the backslash (\) to escape the dot.Why is this important? The dot has it’s own meaning in regex and it’s “select anything”, that’s why we have to use it carefully, other wise we might select stuff we don’t want. Next we are using again the same paradigm – we divide them in a group, using the parenthesis, inside we list the top-level domains, we believe are present in our list, and divide them using a pipe. The last part is just cosmetic, if we take a look in the mark up again, we’ ll see that some of the URLs are ending with ‘ ” ‘ and some ending with ‘ \” ‘. We use the same pattern and switch these two options with a pipe. The last symbol – ‘\s’ represents whitespace, because our target is the link located in the section “htmlUrl” and it ends with a whitespace. If we don’t match the pattern ending with space we’ ll just end up extracting every link twice. Which we don’t want, but indeed it’s a “corner case”.

Conclusion.

It’s always exciting when you are able to use your coding/scripting skills to tackle a boring task and I hope this will be helpful. If you ever need to use regular expressions, doesn’t matter if in Linux or any programming language, and if you feel insecure on what to use in your expression, I would recommend you to use this tool. It is helpful because it provides a precise definition of your regular expression, step by step, this way you will be able to see if there some part that’s ambiguous or not precise enough.
I would love to read you opinion on this and of course if you liked the topic, feel free to comment and share with your friends.