CFMX UDF: Parsing a URI into a struct...

Categories: HTML/ColdFusion, Source Code, Java

Ever needed to parse a qualified URI to examine a URL for specific information? I'm working on some code that needs to examine links in a document and extract information about the links.

To make sure I was doing things by the spec, I made sure to check out RFC2396. Fortunately, the RFC has a nice little regular expression for breaking a URI into it's core pieces: scheme, authority, path, query and fragment.

However, those core portions are still pretty broad. The authority can include user info, domain and port information. The path can include embedded parameters inside each segment. So, I took the core regular expression to break up a URI and then I do further parsing on the authority and path portions of the URI.

What I ended up with was a ColdFusion function that would return the following:

	authority:      the authority section of the Uri
	directory:      the directory path in the Uri (w/out parameters)
	domain:         the domain of the Uri
	file:           the file name in the Uri
	fragment:       the fragment of the uri (i.e. #namedAnchor)
	params.url:     the query string params as a struct
	params.segment: the segment parameters as a struct (i.e. ;JSESSIONID=1234)
	password:       the password supplied in the Uri
	path:           the full Uri path--includes embedded parameters (i.e. ;JSESSIONID=1234)
	port:           the port defined in the Uri
	query:          the query string
	scheme:         the scheme/protocol (i.e. http, https, ftp, etc.)
	username:       the username in the Uri

I definitely know there are times when I've needed to parse information from the current URL (i.e. to figure out if the user is using HTTP or HTTPS.) Hopefully some of you will find this function useful.

<cfscript>
/**
* Parses a uri and returns a structure with the information in the uri
*
* @param sUrl The url to parse
* @return Returns a struct.
* @author Dan G. Switzer, II
* @version 1, September 26, 2006
*/
function parseUrl(sUrl){
   // var to hold the final structure
   var stUrlInfo = structNew();
   // var to hold the final structure
   var sCharSet = getEncoding("url");
   // vars for use in the loop, so we don't have to evaluate lists and arrays more than once
   var i = 1;
   var sKeyPair = "";
   var sKey = "";
   var sValue = "";
   var aQSPairs = "";
   var sPath = "";
   /*
      from: http://www.ietf.org/rfc/rfc2396.txt
      ^((([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*)))?
       123 4 5 6 7 8 9 A
         scheme = $3
         authority = $5
         path = $6
         query = $8
         fragment = $10 (A)
   */
   var sUriRegEx = "^(([^:/?##]+):)?(//([^/?##]*))?([^?##]*)(\?([^##]*))?(##(.*))?";
   /*
      separates the authority into user info, domain and port
      ^((([^@:]+)(:([^@]+))?@)?([^:]*)?(:(.*)))?
       123 4 5 6 7 8
         username = $3
         password = $5
         domain = $6
         port = $8
   */
   var sAuthRegEx = "^(([^@:]+)(:([^@]+))?@)?([^:]*)?(:(.*))?";
   /*
      separates the path into segments & parameters
      ((/?[^;/]+)(;([^/]+))?)
      12 3 4
         segment = $1
         path = $2
         parameters = $4
   */

   var sSegRegEx = "(/?[^;/]+)(;([^/]+))?";
   // parse the url looking for info
   var stUriInfo = reFindNoCase(sUriRegEx, sUrl, 1, true);
   // this is for the authority section
   var stAuthInfo = "";
   // this is for the segments in the path
   var stSegInfo = "";
   // if the charset argument exists, use it -- this defaults to the url encoding to mimic the default
   // behavior
   if( arrayLen(arguments) gte 2 ) sCharSet = arguments[2];
   // create empty keys
   stUrlInfo["scheme"] = "";
   stUrlInfo["authority"] = "";
   stUrlInfo["path"] = "";
   stUrlInfo["directory"] = "";
   stUrlInfo["file"] = "";
   stUrlInfo["query"] = "";
   stUrlInfo["fragment"] = "";
   stUrlInfo["domain"] = "";
   stUrlInfo["port"] = "";
   stUrlInfo["username"] = "";
   stUrlInfo["password"] = "";
   stUrlInfo["params"] = structNew();
   // get the scheme
   if( stUriInfo.len[3] gt 0 ) stUrlInfo["scheme"] = mid(sUrl, stUriInfo.pos[3], stUriInfo.len[3]);
   // get the authority
   if( stUriInfo.len[5] gt 0 ) stUrlInfo["authority"] = mid(sUrl, stUriInfo.pos[5], stUriInfo.len[5]);
   // get the path
   if( stUriInfo.len[6] gt 0 ) stUrlInfo["path"] = mid(sUrl, stUriInfo.pos[6], stUriInfo.len[6]);
   // get the path
   if( stUriInfo.len[8] gt 0 ) stUrlInfo["query"] = mid(sUrl, stUriInfo.pos[8], stUriInfo.len[8]);
   // get the fragment
   if( stUriInfo.len[10] gt 0 ) stUrlInfo["fragment"] = mid(sUrl, stUriInfo.pos[10], stUriInfo.len[10]);
   // break authority into user info, domain and ports
   if( len(stUrlInfo["authority"]) gt 0 ){
      // parse the authority looking for info
      stAuthInfo = reFindNoCase(sAuthRegEx, stUrlInfo["authority"], 1, true);
      // get the domain
      if( stAuthInfo.len[6] gt 0 ) stUrlInfo["domain"] = mid(stUrlInfo["authority"], stAuthInfo.pos[6], stAuthInfo.len[6]);
      // get the port
      if( stAuthInfo.len[8] gt 0 ) stUrlInfo["port"] = mid(stUrlInfo["authority"], stAuthInfo.pos[8], stAuthInfo.len[8]);
      // get the username
      if( stAuthInfo.len[3] gt 0 ) stUrlInfo["username"] = mid(stUrlInfo["authority"], stAuthInfo.pos[3], stAuthInfo.len[3]);
      // get the password
      if( stAuthInfo.len[5] gt 0 ) stUrlInfo["password"] = mid(stUrlInfo["authority"], stAuthInfo.pos[5], stAuthInfo.len[5]);
   }
   // the query string in struct form
   stUrlInfo["params"]["segment"] = structNew();
   // if the path contains any parameters, we need to parse them out
   if( find(";", stUrlInfo["path"]) gt 0 ){
      // this is for the segments in the path
      stSegInfo = reFindNoCase(sSegRegEx, stUrlInfo["path"], 1, true);
      // loop through all the segments and build the strings
      while( stSegInfo.pos[1] gt 0 ){
         // build the path, excluding parameters
         sPath = sPath & mid(stUrlInfo["path"], stSegInfo.pos[2], stSegInfo.len[2]);
         // if there are some parameters in this segment, add them to the struct
         if( stSegInfo.len[4] gt 0 ){
            // put the parameters into an array for easier looping
            aQSPairs = listToArray(mid(stUrlInfo["path"], stSegInfo.pos[4], stSegInfo.len[4]), ";");
            // now, loop over the array and build the struct
            for( i=1; i lte arrayLen(aQSPairs); i=i+1 ){
               sKeyPair = aQSPairs[i]; // current pair
               sKey = listFirst(sKeyPair, "="); // current key
               // make sure there are 2 keys
               if( listLen(sKeyPair, "=") gt 1){
                  sValue = listLast(sKeyPair, "="); // current value
                  // fix bug in urlDecode w/single % signs
                  sValue = reReplaceNoCase(sValue, "%([^A-F0-9{2}])", "%25\1", "all");
                  // url decode the value
                  sValue = urlDecode(sValue, sCharSet);
               } else {
                  sValue = ""; // set blank value
               }
               // check if key already added to struct
               if( structKeyExists(stUrlInfo["params"]["segment"], sKey) ) stUrlInfo["params"]["segment"][sKey] = listAppend(stUrlInfo["params"]["segment"][sKey], sValue); // add value to list
               else structInsert(stUrlInfo["params"]["segment"], sKey, sValue); // add new key/value pair
            }
         }
         // get the ending position
         i = stSegInfo.pos[1] + stSegInfo.len[1];
         // get the next segment
         stSegInfo = reFindNoCase(sSegRegEx, stUrlInfo["path"], i, true);
      }
   } else {
      // set the current path
      sPath = stUrlInfo["path"];
   }
   // get the file name
   stUrlInfo["file"] = getFileFromPath(sPath);
   // get the directory path by removing the file name
   if( len(stUrlInfo["file"]) gt 0 ){
      stUrlInfo["directory"] = replace(sPath, stUrlInfo["file"], "", "one");
   } else {
      stUrlInfo["directory"] = sPath;
   }
   // the query string in struct form
   stUrlInfo["params"]["url"] = structNew();
   // if query info was supplied, break it into a struct
   if( len(stUrlInfo["query"]) gt 0 ){
      // put the query string into an array for easier looping
      aQSPairs = listToArray(stUrlInfo["query"], "&");
      // now, loop over the array and build the struct
      for( i=1; i lte arrayLen(aQSPairs); i=i+1 ){
         sKeyPair = aQSPairs[i]; // current pair
         sKey = listFirst(sKeyPair, "="); // current key
         // make sure there are 2 keys
         if( listLen(sKeyPair, "=") gt 1){
            sValue = urlDecode(listLast(sKeyPair, "=")); // current value
         } else {
            sValue = ""; // set blank value
         }
         // check if key already added to struct
         if( structKeyExists(stUrlInfo["params"]["url"], sKey) ) stUrlInfo["params"]["url"][sKey] = listAppend(stUrlInfo["params"]["url"][sKey], sValue); // add value to list
         else structInsert(stUrlInfo["params"]["url"], sKey, sValue); // add new key/value pair
      }
   }
   // return the struct
   return stUrlInfo;
}
</cfscript>
NOTE:
UPDATE 2006-09-27: Updated the code to allow for custom charsets for the urlDecode() function. I found that when using the urlDecode() method on data coming from the database or xml document, it may have encoding different than the URL scope. By default CFMX uses the URL encoding for the urlDecode() format.

The example below shows how the URI would get broken down:

Example

http://usr:pwd@www.foo.com:80/bar/sub/file.gif;p=5?q1=item1&q1=item2&q2=item3#nameAnchor

authority usr:pwd@www.foo.com:80 directory /bar/sub/ domain www.foo.com file file.gif fragment nameAnchor params [struct] segment [struct] p: 5 url [struct] q1: item1,item2 q2: item3 password: pwd path: /bar/sub/file.gif;p=5 port: 80 query: q1=item1&q1=item2&q2=item3 scheme: http username: usr

Comments

Jeff S.'s Gravatar This could definately help me. For instance, I've written a product browser in javascript that lets you browse products faster by using DHTML techniques. The only problem is, hitting the back button in your browser sends you to the last page you were at (not the last "virtual page"). To get around this, I'm going to add anchor tags to the URI each time you select to browse to a new "virtual page". I already had some simple/sloppy javascript to grab the anchor value, but I'll probably just make use of this code....
Rob Wilkerson's Gravatar I have to congratulate you on this. I've been needing (and thinking about creating) this function for about 6 months and, while the regex isn't difficult, I found it completely impossible to read the entire RFC without stabbing myself in the eye (rendering further reading even less enjoyable).

For that alone you should receive some sort of medal.
Dan B.'s Gravatar Awesome function, very glad I didn't have to write it :)
Dan's Gravatar FYI --- if your looking for a VERY similar function in javascript, check this page out:
http://badassery.blogspot.com/2007/02/parseuri-spl...

Add Comment



If you subscribe, any new posts to this thread will be sent to your email address.