An elegant regular expression for finding URLs

so you can turn them into hyperlinks automatically…

A while ago I needed to write some code which would automatically recognise a URL in plain text, and turn it into a hyperlink.  Being a lazy sort, I turned to Google, and found this article on DevX.  The regular expression it gave there was not perfect, but worked reasonably well:

\w*[\://]*\w+\.\w+\.\w+[/\w+]*[.\w+]*

At the time, I was sufficiently rushed off my feet that I forgave its flaws and implemented it.  Over time, however, it’s been bugging me, and as the service it’s implemented on gets more traffic, so the need to improve it has become greater.  And so it came to pass that this evening I bit the bullet and tried to write a better one.  After a couple of hours of testing various permutations, here it is (after the jump):

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

I know what you’re thinking, and you’re right, it is a thing of beauty.  But before you copy and paste it into your auto-hyperlinking code, it seems only fair that I break it down into its constituent parts, so you know what you’re getting yourself into.

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

\w+:// means ‘one or more word characters, followed by ‘://’.  I took the deliberate decision that if people wanted a hyperlink, they were going to have to prefix it with the protocol and ‘://’ to give us a heads-up.  I suppose I could have looked out for ‘www’ and two more word groups, separated by full stops (periods for our readers in the US), but since the easiest way to create a link is to copy and paste from the address bar, I figured that this was OK.  It’s not like everyone uses the www prefix, anyway. And it keeps the regex from being truly horrendous.

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

[\w-]+(\.[\w-]+)* means ‘one or more word characters, optionally followed by any number of  full-stops-and-words’ (from now on, my use of the term ‘word characters’ includes hyphens).  Basically, this takes care of the domain.  Crucially, if you have a full stop you have to follow it with some word characters.  This avoids the biggest flaw with the original regular expression; namely that full stops at the end of the URL were being counted as part of it, when they should be ignored.

 \w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

(:[0-9]+)? means ‘a colon, then one or more digits, can occur zero or one time’. Essentially, this allows a port number to be specified if required.

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

[/\w-]* means ‘any combination of forward slashes and word characters. This takes care of any directories or filenames after the domain.

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

(\.[\w-]+)* means ‘a full stop, followed by one or more word characters or hyphens’.  This looks after the extensions of any file names matched in the previous step, once again ensuring that it doesn’t match any trailing full stops.

\w+://[\w-]+(\.[\w-]+)*(:[0-9]+)?[/\w-]*(\.[\w-]+)*([#\?]+[\w-\?=\+%\&]*)?

([#\?]+[\w-\?=\+%\&]*)? is good fun, meaning ‘a hash or question mark, optionally followed by any combination of word characters, hyphens, question marks, equals, plus and percentage signs, and ampersands’. You are only allowed these additional characters if you use the preceding # or ? meaning that you can use them in querystrings or named anchors, but nowhere else.

So, one regex to rule them all.  I’ve probably missed something, so let me know how you get on.

Thursday, January 21st, 2010 Web Development

No comments yet.

Leave a comment

Bookmark and Share