CFMX UDF: Parsing a URI into a struct...

Posted by Dan on Sep 27, 2006 @ 3:49 PM

Ever needed to parse a qualified URI to examine a URL for specific information? I'm working on some code that needs to examine links in a document and extract information about the links.

To make sure I was doing things by the spec, I made sure to check out RFC2396. Fortunately, the RFC has a nice little regular expression for breaking a URI into it's core pieces: scheme, authority, path, query and fragment.

However, those core portions are still pretty broad. The authority can include user info, domain and port information. The path can include embedded parameters inside each segment. So, I took the core regular expression to break up a URI and then I do further parsing on the authority and path portions of the URI.

What I ended up with was a ColdFusion function that would return the following:

	authority:      the authority section of the Uri
	directory:      the directory path in the Uri (w/out parameters)
	domain:         the domain of the Uri
	file:           the file name in the Uri
	fragment:       the fragment of the uri (i.e. #namedAnchor)
	params.url:     the query string params as a struct
	params.segment: the segment parameters as a struct (i.e. ;JSESSIONID=1234)
	password:       the password supplied in the Uri
	path:           the full Uri path--includes embedded parameters (i.e. ;JSESSIONID=1234)
	port:           the port defined in the Uri
	query:          the query string
	scheme:         the scheme/protocol (i.e. http, https, ftp, etc.)
	username:       the username in the Uri

I definitely know there are times when I've needed to parse information from the current URL (i.e. to figure out if the user is using HTTP or HTTPS.) Hopefully some of you will find this function useful.

<!---// create a PDF document and save it to a variable //--->
<cfdocument format="pdf" name="binPdf">
    <h1>
        A PDF Document
    </h1>

    <p>
        This is a document rendered by the cfdocument tag
        at <cfoutput>#dateFormat(now(), "mmmm dd, yyyy")#
        #lCase(timeFormat(now(), "h:mmtt"))#</cfoutput>.
    </p>
    <table width="50%" border="2" cellspacing="2" cellpadding="2">
        <tr>
            <td><strong>Name</strong></td>
            <td><strong>Role</strong></td>
        </tr>
        <tr>
            <td>Bill</td>
            <td>Lead</td>
        </tr>
        <tr>
            <td>Susan</td>
            <td>Principal Writer</td>
        </tr>
        <tr>
            <td>Adelaide</td>
            <td>Part Time Senior Writer</td>
        </tr>
        <tr>
            <td>Thomas</td>
            <td>Full Time for 6 months</td>
        </tr>
        <tr>
            <td>Michael</td>
            <td>Full Time for 4 months</td>
        </tr>
    </table>
</cfdocument>

<!---// save the file to the following path and file //--->
<cfset sFilename = expandPath('./test-file.pdf') />

<!---// write the file to disk //--->
<cffile action="write" file="#sFilename#" output="# = reFindNoCase(sUriRegEx, sUrl, 1, true);
    // this is for the authority section
    var stAuthInfo = "
";
    // this is for the segments in the path
    var stSegInfo = "
";

    // if the charset argument exists, use it -- this defaults to the url encoding to mimic the default
    // behavior
    if( arrayLen(arguments) gte 2 ) sCharSet = arguments[2];

    // create empty keys
    stUrlInfo["
scheme"] = "";
    stUrlInfo["
authority"] = "";
    stUrlInfo["
path"] = "";
    stUrlInfo["
directory"] = "";
    stUrlInfo["
file"] = "";
    stUrlInfo["
query"] = "";
    stUrlInfo["
fragment"] = "";
    stUrlInfo["
domain"] = "";
    stUrlInfo["
port"] = "";
    stUrlInfo["
username"] = "";
    stUrlInfo["
password"] = "";
    stUrlInfo["
params"] = structNew();

    // get the scheme
    if( stUriInfo.len[3] gt 0 ) stUrlInfo["
scheme"] = mid(sUrl, stUriInfo.pos[3], stUriInfo.len[3]);
    // get the authority
    if( stUriInfo.len[5] gt 0 ) stUrlInfo["
authority"] = mid(sUrl, stUriInfo.pos[5], stUriInfo.len[5]);
    // get the path
    if( stUriInfo.len[6] gt 0 ) stUrlInfo["
path"] = mid(sUrl, stUriInfo.pos[6], stUriInfo.len[6]);
    // get the path
    if( stUriInfo.len[8] gt 0 ) stUrlInfo["
query"] = mid(sUrl, stUriInfo.pos[8], stUriInfo.len[8]);
    // get the fragment
    if( stUriInfo.len[10] gt 0 ) stUrlInfo["
fragment"] = mid(sUrl, stUriInfo.pos[10], stUriInfo.len[10]);

    // break authority into user info, domain and ports
    if( len(stUrlInfo["
authority"]) gt 0 ){
        // parse the authority looking for info
        stAuthInfo = reFindNoCase(sAuthRegEx, stUrlInfo["
authority"], 1, true);

        // get the domain
        if( stAuthInfo.len[6] gt 0 ) stUrlInfo["
domain"] = mid(stUrlInfo["authority"], stAuthInfo.pos[6], stAuthInfo.len[6]);
        // get the port
        if( stAuthInfo.len[8] gt 0 ) stUrlInfo["
port"] = mid(stUrlInfo["authority"], stAuthInfo.pos[8], stAuthInfo.len[8]);
        // get the username
        if( stAuthInfo.len[3] gt 0 ) stUrlInfo["
username"] = mid(stUrlInfo["authority"], stAuthInfo.pos[3], stAuthInfo.len[3]);
        // get the password
        if( stAuthInfo.len[5] gt 0 ) stUrlInfo["
password"] = mid(stUrlInfo["authority"], stAuthInfo.pos[5], stAuthInfo.len[5]);
    }

    // the query string in struct form
    stUrlInfo["
params"]["segment"] = structNew();

    // if the path contains any parameters, we need to parse them out
    if( find("
;", stUrlInfo["path"]) gt 0 ){
        // this is for the segments in the path
        stSegInfo = reFindNoCase(sSegRegEx, stUrlInfo["
path"], 1, true);

        // loop through all the segments and build the strings
        while( stSegInfo.pos[1] gt 0 ){
            // build the path, excluding parameters
            sPath = sPath & mid(stUrlInfo["
path"], stSegInfo.pos[2], stSegInfo.len[2]);

            // if there are some parameters in this segment, add them to binPdf#"
>


<!---// create a png chart and save it to a variable //--->
<cfchart name="binChart" format="png" font="arialunicodeMS" xaxistitle="Month" yaxistitle="Degrees Celsius" showlegend="yes">
    <cfchartseries type="line" serieslabel="Europe">
        <cfloop index="i" list="Apr,May,Jun,Jul,Aug,Sep">
            <cfchartdata item="#i#" value="#RandRange(12,42)#">
        </cfloop>
    </cfchartseries>
    <cfchartseries type="line" serieslabel="USA">
        <cfloop index="j" list="Apr,May,Jun,Jul,Aug,Sep">
            <cfchartdata item="#j#" value="#RandRange(12,42)#">
        </cfloop>
    </cfchartseries>
</cfchart>

<cfscript>
// config
sMailServer = "mail.yourserver.com";
sUsername = "";
sPassword = "";
sSubject = "Using the JavaMail API!";
sAddyTo = "to@yourcompany.com";
sAddyFrom = "from@yourcompany.com";


// set javamail properties
oProps = createObject("java", "java.util.Properties").init();
oProps.put("javax.mail.smtp.host", sMailServer);

// get static recipient types
oRecipientType = createObject("java", "javax.mail.Message$RecipientType");

// create the session for the smtp server
oMailSession = createObject("java", "javax.mail.Session").getInstance(oProps);

// create a new MIME message
oMimeMessage = createObject("java", "javax.mail.internet.MimeMessage").init(oMailSession);

// create the to and from e-mail addresses
oAddressFrom = createObject("java", "javax.mail.internet.InternetAddress").init(sAddyFrom);
oAddressTo = createObject("Java", "javax.mail.internet.InternetAddress").init(sAddyTo);

// build message
// set who the message is from
oMimeMessage.setFrom(oAddressFrom);
// add a recipient
oMimeMessage.addRecipient(oRecipientType.TO, oAddressTo);
// set the subject of the message
oMimeMessage.setSubject(sSubject);

// create multipart message: only needed if you're including both plain/text and html
// or using attachments
oMimeMultipart = createObject("java", "javthe struct
            if( stSegInfo.len[4] gt 0 ){

                // put the parameters into an array for easier looping
                aQSPairs = listToArray(mid(stUrlInfo["
path"], stSegInfo.pos[4], stSegInfo.len[4]), ";");

                // now, loop over the array and build the struct
                for( i=1; i lte arrayLen(aQSPairs); i=i+1 ){
                    sKeyPair = aQSPairs[i]; // current pair
                    sKey = listFirst(sKeyPair, "
="); // current key
                    // make sure there are 2 keys
                    if( listLen(sKeyPair, "
=") gt 1){
                        sValue = listLast(sKeyPair, "
="); // current value
                        // fix bug in urlDecode w/single % signs
                        sValue = reReplaceNoCase(sValue, "
%([^A-F0-9{2}])", "%25\1", "all");
                        // url decode the value
                        sValue = urlDecode(sValue, sCharSet);
                    } else {
                        sValue = "
"; // set blank value
                    }
                    // check if key already added to struct
                    if( structKeyExists(stUrlInfo["
params"]["segment"], sKey) ) stUrlInfo["params"]["segment"][sKey] = listAppend(stUrlInfo["params"]["segment"][sKey], sValue); // add value to list
                    else structInsert(stUrlInfo["
params"]["segment"], sKey, sValue); // add new key/value pair
                }
            }

            // get the ending position
            i = stSegInfo.pos[1] + stSegInfo.len[1];

            // get the next segment
            stSegInfo = reFindNoCase(sSegRegEx, stUrlInfo["
path"], i, true);
        }

    } else {
        // set the current path
        sPath = stUrlInfo["
path"];
    }

    // get the file name
    stUrlInfo["
file"] = getFileFromPath(sPath);
    // get the directory path by removing the file name
    if( len(stUrlInfo["
file"]) gt 0 ){
        stUrlInfo["
directory"] = replace(sPath, stUrlInfo["file"], "", "one");
    } else {
        stUrlInfo["
directory"] = sPath;
    }

    // the query string in struct form
    stUrlInfo["
params"]["url"] = structNew();

    // if query info was supplied, break it into a struct
    if( len(stUrlInfo["
query"]) gt 0 ){
        // put the query string into an array for easier looping
        aQSPairs = listToArray(stUrlInfo["
query"], "&");

        // now, loop over the array and build the struct
        for( i=1; i lteax.mail.internet.MimeMultipart"
).init();

// specifies that the message contains both inline text and html, this is so that
// images given a cid will show up when rendered by the e-mail client
oMimeMultipart.setSubType("related");

// create plain text multipart
oPlainText = createObject("java", "javax.mail.internet.MimeBodyPart").init();
// create the plain/text for the message
oPlainText.setText("You like using JavaMail in CFMX.");
// add the body part to the message
oMimeMultipart.addBodyPart(oPlainText);

// create html text multipart
oHtml = createObject("java", "javax.mail.internet.MimeBodyPart").init();
// add the html content (the setText() method shortcut/only works for "plain/text")
oHtml.setContent(
        "<html><head><title>HTML E-mail</title></head><body>"
    & "<h1>You like using JavaMail in CFMX.</h1>"
    & "<p><img src=cid:23abc@pc27 /></p>"
    & "</body></html>",
    "text/html"
);
// add the body part to the message
oMimeMultipart.addBodyPart(oHtml);

// attach an inline binary object
att = createObject("java", "javax.mail.internet.MimeBodyPart").init();
// create an octet stream out of the binary file
os = createObject("java", "org.apache.axis.attachments.OctetStream").init(binPdf);
// we now convert the octet stream into the required data source. using an octet stream
// allows us pass in any binary data as a file attachment
osds = createObject("java", "org.apache.axis.attachments.OctetStreamDataSource").init("", os);
// initialize the data handler using the data source
dh = createObject("java", "javax.activation.DataHandler").init(osds);
// pass in the binary object to the message--javamail will handle the encoding
// based on the headers
att.setDataHandler(dh);
// define this binary object as a PDF
att.setHeader("Content-Type", "application/pdf");
// make sure the binary data gets converted to base64 for delivery
att.setHeader("Content-Transfer-Encoding", "base64");
// specify the binary object as an attachment
att.setHeader("Content-Di arrayLen(aQSPairs); i=i+1 ){
            sKeyPair = aQSPairs[i]; // current pair
            sKey = listFirst(sKeyPair, "
="); // current key
            // make sure there are 2 keys
            if( listLen(sKeyPair, "
=") gt 1){
                sValue = urlDecode(listLast(sKeyPair, "
=")); // current value
            } else {
                sValue = "
"; // set blank value
            }
            // check if key already added to struct
            if( structKeyExists(stUrlInfo["
params"]["url"], sKey) ) stUrlInfo["params"]["url"][sKey] = listAppend(stUrlInfo["params"]["url"][sKey], sValue); // add value to list
            else structInsert(stUrlInfo["
params"]["url"], sKey, sValue); // add new key/value pair
        }
    }

    // return the struct
    return stUrlInfo;
}
</cfscript>
NOTE:
UPDATE 2006-09-27: Updated the code to allow for custom charsets for the urlDecode() function. I found that when using the urlDecode() method on data coming from the database or xml document, it may have encoding different than the URL scope. By default CFMX uses the URL encoding for the urlDecode() format.

The example below shows how the URI would get broken down:

Example

http://usr:pwd@www.foo.com:80/bar/sub/file.gif;p=5?q1=item1&q1=item2&q2=item3#nameAnchor

authority   usr:pwd@www.foo.com:80
directory   /bar/sub/
domain      www.foo.com
file        file.gif
fragment    nameAnchor
params      [struct]
  segment   [struct]
    p:      5
  url       [struct]
    q1:     item1,item2
    q2:     item3
password:   pwd
path:       /bar/sub/file.gif;p=5
port:       80
query:      q1=item1&q1=item2&q2=item3
scheme:     http
username:   usr
Categories: HTML/ColdFusion, Java, Source Code

5 Comments

  • This could definately help me. For instance, I've written a product browser in javascript that lets you browse products faster by using DHTML techniques. The only problem is, hitting the back button in your browser sends you to the last page you were at (not the last "virtual page"). To get around this, I'm going to add anchor tags to the URI each time you select to browse to a new "virtual page". I already had some simple/sloppy javascript to grab the anchor value, but I'll probably just make use of this code....
  • I have to congratulate you on this. I've been needing (and thinking about creating) this function for about 6 months and, while the regex isn't difficult, I found it completely impossible to read the entire RFC without stabbing myself in the eye (rendering further reading even less enjoyable).

    For that alone you should receive some sort of medal.
  • Awesome function, very glad I didn't have to write it :)
  • FYI --- if your looking for a VERY similar function in javascript, check this page out:
    http://badassery.blogspot.com/2007/02/parseuri-spl...
  • over 2 years late but... thanks so much for sharing this!

Comments for this entry have been disabled.