09 July 2010

File Scrubbing – whether you want to, or not

As an EDI guy, I’m supposed to get to work in industry standard formats.  Since I mostly have done healthcare in the past, this primarily means the ANSI X12 4010A1 standard.  However, in the real world, the standard doesn’t mean much.

There are clients with proprietary formats, vendors with old versions of the standard, and just plain screw-ups that we deal with on a day-to-day basis.

So, with that in mind, I bring you the first in a new occasional series of posts: Public Code Review.  The following code is a scrubber I created to handle known, recurring issues with client and vendor files.  It uses Regular Expressions to find said known issues, and can either just remove them, or replace them.

First, I created two Interfaces: IScubber, and IConfigurable.  IScrubber is the interface which will do most of the work; IConfigurable just allows anyone else who wants to use this code to use their own configuration to set it up.

   1:  interface IScrubber
   2:      {
   3:          string Scrub(string original, string match);
   4:          string Replace(string original, string match, string replacement);
   5:      }

   1:  interface IConfigurable
   2:      {
   3:          void Configure();
   4:      }

Next, comes the class ScrubRule.  This simply holds two strings: the Regex to match the errors, and the replacement string.

   1:  class ScrubRule
   2:      {
   3:          public string Match { get; private set; }
   4:          public string Replacement { get; private set; }
   5:   
   6:          public ScrubRule(string m, string r)
   7:          {
   8:              Match = m;
   9:              Replacement = r;
  10:          }
  11:      }

With our base items created, I can now create the actual scrubber.  In this case, I’ve called it “BasicScrubber.”  It implements both IScrubber and IConfigurable.

   1:  public class BasicScrubber : IConfigurable, IScrubber
   2:      {
   3:          private List<ScrubRule> rules;
   4:   
   5:          public BasicScrubber(string configPath)
   6:          {
   7:              rules = new List<ScrubRule>();
   8:              Configure();
   9:          }
  10:   
  11:          public void Configure()
  12:          {
  13:              foreach (string s in ConfigurationManager.AppSettings.AllKeys)
  14:              {
  15:                  ScrubRule sr = new ScrubRule(s, ConfigurationManager.AppSettings[s].ToString());
  16:                  rules.Add(sr);
  17:              }
  18:          }
  19:   
  20:          public string Scrub(string original, string match)
  21:          {
  22:              Regex rx = new Regex(match);
  23:              return rx.Replace(original, string.Empty);
  24:          }
  25:   
  26:          public string Replace(string original, string match, string replacement)
  27:          {
  28:              Regex rx = new Regex(match);
  29:              return rx.Replace(original, replacement);
  30:          }
  31:      }

As you can see, the scrubber gets it’s configuration (in this case) from the System.Configuration.ConfigurationManager class pulling from app.config:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key ="RegexHere" value="ReplacementValueHere" />
  </appSettings>
</configuration>

So that if we replace “RegexHere” with the Regex: (?<=\.\d*)0+(?=\D|$)

and if we replace “ReplacementValueHere” with “”

we get a scrubber rule which will trim trailing zeros after a decimal place.

Wire this class up to a windows or console app, point it at your file in error, and let it go.

One of the great things about Regex is its speed at just this kind of process.  Before I started using Regex, I tried using basic string manipulation with string.Replace().  The problem is that when you start playing with special characters, or if something is off just a little bit, string.Replace() is a little unreliable for my tastes.  Additionally, it’s slow.  Running string comparisons and manipulations against a normal X12 835 file used to take a couple of minutes.  With Regex, it’s seconds.  As in, two or three, not thirty or forty.

So, let me know what you think.  This code should be highly portable.  Without much effort, it can be database driven instead of app.config driven, or you can even configure in some custom way.

No comments:

Post a Comment