The amazing adventures of Doug Hughes

Irregular Expressions

I’m not a big fan of the Regular Expression methods that come with ColdFusion. I find them cumbersome, at best. They’re great if you just want to check a string for a pattern, but if you’re replacing data using any sort of algorithm you’re hosed.

As an example, I’m working on a very simply content editor for a client’s website. A feature of the system is that they can type link names as plain test. Thus, if they wanted to go a page that they want to call “About Us”, they simply click the link icon in TinyMCE and type in the English words, “About Us”. When the content form is submitted the system translates this into a more meaningful URL, “index.cfm?event=About-Us.

The problem is that the site only receives the HTML markup which might look like this:

<p>Click here to read <a href="About Us" />about us</a>. </p>
<p>And, as an example, here's <a href="Another Link" />another link to see</a>. </p>
<p>Lastly, here's a real URL to <a href="http://www.anothersite.com" />another site</a>. </p>

So, the idea is, I loop over all the matches I can find in the HTML and check the href value to see if it contains “http://&#8221;, “http://&#8221; or “index.cfm”. If so, I ignore it. Otherwise, I manipulate the href value so that I insert index.cfm and replace all non alphanumeric characters with hyphens.

So, the first thing I always do when I run into problems like this is try to use ReReplace. Frankly, I’m no regular expression wizard. I can usually slink by, but if there’s any solution to this problem with just regular expressions I sure don’t know.

The problem is that I need to make decisions conditionally and ReReplace statically replaces instances of a pattern in a string with another string or a limited pattern.

So, to do what I want I typically write a loop tag that looks till I break out of it. On each iteration of the loop I look for the pattern with ReFind. I parse the heck of the results and then rebuild the string I’m searching over. Lastly, I figure out where the next search should begin.

Gawd do I despise this technique. So, I decided to work out a better solution.

Wouldn’t it be nice if there was a way to call a handler function on each match that simply had to parse that one match and return the result? Well, here’s a very simply CFC that does just that:

<cfcomponent>
    <cffunction access="public" hint="I am a function that can be used to more easily parse a set of matches in a particular way using regular expressions" name="parse" output="false" returntype="string">
        <cfargument hint="I am the string to parse" name="string" required="yes" type="string"/>
        <cfargument hint="I am the regex to use" name="regex" required="yes" type="string"/>
        <cfargument hint="I am the call back function to use. This must accept one argument, an array of elements in one match." name="callbackFunction" required="yes" type="string"/>
        <cfset location=0 var/>
        <cfset match=0 var/>
        <cfset full=0 var/>
        <cfset detail=0 var/>
        <cfset result=0 var/>
        <cfset var x=0/>
        <cfset left=0 var/>
        <cfset right=0 var/>
        <cfset change=0 var/>
        <cfloop condition="true">
            <cfset detail=ArrayNew(1)/>
            <cfset >)(.+?)(</a>)', arguments.string, location, true)/>
            <cfif 0 IS match.len[1]>
                <cfbreak/>
            </cfif>
            <!--- parse the match into chunks --->
            <cfloop from="1" index="x" to="#ArrayLen(match.len)#">
                <cfset detail[x]=StructNew()/>
                <cfif match.len[x]>
                    <cfset detail[x].string=Mid(arguments.string, match.pos[x], match.len[x])/>
				<cfelse>
					<cfset detail[x].string=""/>
				</cfif>
				<cfset detail[x].pos=match.pos[x]/>
				<cfset detail[x].len=match.len[x]/>
			</cfloop>
			<cfif match.len[1]>
				<cfset left=left(arguments.string, match.pos[1] - 1)/>
				<cfset right=right(arguments.string, Len(arguments.string) - (match.pos[1] + match.len[1]) + 1)/>
				<cfset change=arguments.callbackFunction(detail)/>
				<cfset & & arguments.string=left change right/>
			</cfif>
			<cfset + + 1 len(change) location=len(left)/>
		</cfloop>
		<cfreturn arguments.string/>
	</cffunction>
</cfcomponent>

This CFC has one method, parse, which accepts a string, a regular expression and a pointer to a method to that will handle matches. The handler method receives an array of structs. Each structure in the array has keys “string”, “len” and “pos”. The string is the string that was matched by the portion of the regular expression.

Let’s say I used this regular expression

(&lt;a href=&quot;(.+?)&quot;.*?&gt;)(.+?)(&lt;/a&gt;)

I would end up with five elements in my array: The entire link tag, the opening tag, the value of the href element, the text in the link, and the closing tag. Each of these elements would have the matched string. IE, element 4 would be “about us” for the first match in the HTML example above.

So, I can write a handler method like this:

<cffunction name="fixLinks" output="false" returntype="string">
    <cfargument name="match"/>
    <cfset return="" var/>
    <cfset link="" var/>
    <!--- parse any string that doesn't start with http://, http:// or contain index.cfm --->
    <cfif "http://" "http://" (arguments.match[3].string CONTAINS "index.cfm" ) AND AND IS IS Left(arguments.match[3].string, 7) Left(arguments.match[3].string, 8) NOT NOT NOT>
        <!--- open the link tag --->
        <cfset return=arguments.match[2].string/>
        <!--- fix the href in the link tag --->
        <cfset & ReReplace(Trim(arguments.match[3].string), "W", "-", "all") link="index.cfm?event="/>
        <cfset return=replace(return, arguments.match[3].string, link)/>
        <!--- add the link text --->
        <cfset & arguments.match[4].string return=return/>
        <!--- close the tag --->
        <cfset & arguments.match[5].string return=return/>
	<cfelse>
		<cfset return=arguments.match[1].string/>
	</cfif>
	<cfreturn return/>
</cffunction>

This accepts the match and parses it. It simply looks to see if it’s an http, https or index.cfm link and parses it to the standard, if not.

Making this work is as simple as two remaining lines of code:

<cfset var RegEx = CreateObject("Component", "model.regex.Regex") />
<cfset content = RegEx.parse(content, '(<a href="(.+?)".*?>)(.+?)(</a>)', fixLinks) />

Assuming we use the HTML aboive, the output from this method is:

<p>Click here to read <a href="index.cfm?event=About-Us" />about us</a>. </p>
<p>And, as an example, here's <a href="index.cfm?event=Another Link" />another link to see</a>. </p>
<p>Lastly, here's a real URL to <a href="http://www.anothersite.com" />another site</a>. </p>

How handy is that? Isn’t it cool what can happen when you start coding to interfaces?

Comments on: "Irregular Expressions" (5)

  1. Sammy Larbi said:

    What about something like this?

    <cfset content = rereplacenocase(content,"()”,”1index.cfm?event=2-45″,”all”)>

    Is there something I’ve missed in it?

    Like

  2. Sammy Larbi said:

    Oops. Here’s a better version that doesn’t hide in the HTML:

    &lt;cfset content = rereplacenocase(content,”(&lt;a href=””)(w+)(s+)(w+)(“”&gt;)”,”1index.cfm?event=2-45″,”all”)&gt;

    Like

  3. Peter Boughton said:

    Check for a link when the user enters the text, and if it’s not a know protocol wrap it in brackets.
    Then, for the CF you can just do this:

    &lt;cfset content = REReplace(content,’href=”[([^]]*)]”‘,’href=”index.cfm?event=1″‘,’all’)/&gt;

    Like

  4. Doug Hughes said:

    Sammy / Peter – I figured someone would submit one line of code to invalidate my examples!

    Anyhow, the thing I like about my Regex cfc is that I can handle every match on a case by case basis. Thus, I could do something like searching for any html tag and I could handle each match on a case by case basis.

    Doug

    Like

  5. Sammy Larbi said:

    Well, I do like that about it too! =)

    Like

Comments are closed.

Tag Cloud

%d bloggers like this: