Data munging and data migration come with big challenges — and big rewards

Throughout the past decade, I’ve dealt with innumerable data munging (a bona fide technical term, folks) tasks. Whether it’s migrating an ancient database into a modern database or dragging massive data sets through new processing tools or a myriad other tasks, taking data in one form and turning it into data in a new form happens with massive frequency, every single day. It’s still magic to the vast majority of folks, IT included.

Let’s take a few minutes and try to explain how this whole thing works, at least on a high level. You may find that referencing this helps communicate delicate back-end processes to those who are lightly technical.

It all started with Excel
Let’s pick an unfortunately common situation: the dreaded Excel spreadsheet. Some time ago in a company far, far away, it was determined that data needed to be collected pertaining to a business process — inventory, sales, customers, what have you. Lacking the proper tools, someone cooked up an Excel spreadsheet and went nuts with it. As time passed and thousands of records were amassed, the spreadsheet became less and less functional, and the decision was finally made to turn this intos a real database. A consulting group was hired or an internal resource defined, and someone was handed the task.

The first order of business is to inspect the cleanliness of the data itself. In a perfect world, the spreadsheet is laid out sort of like a database, with each common field in its own column — say, First Name, Last Name, Street, City, and so forth. However, that’s not always the way it works. There might be a top-down layout with the information contained in individual rows under the same column, such as a column Contact with separate rows for full name, company, address, phone number, and so forth. The next column over might be Last Order or Sales 2012 or other data. This presents a much more challenging problem.

Let’s look at the first situation because it’s definitely the easiest. The data is relatively clean and well organized, and it can be exported to CSV and run through a custom parser to translate it into a database. A good CSV parser will pull all of those records into an array that can be peeled apart record by record and inserted into the new database. Within that process, checks of the data can be made, and the data can be modified to better fit the format of the new database.

For instance, we might run regexes over a Phone Number field to turn various phone number formats into one standard. This requires tossing all the special characters and reformatting the resulting string before it gets inserted into the new database. The operation will turn numbers such as (212) 555-1212, 212-555-1212, 2125551212, 212 555 1212, or 212.555.1212 into the likes of (212) 555-1212 for each record, which will help with readability and searching later on.

We might strip these with a regex such as /[^0-9]+/ and reassemble them after splitting them up with a regex like /([0-9]{3})([0-9]{3})([0-9]{4})/, with the resulting matches holding 212, 555, and 1212. We can now reformat the phone number however we like. We should also bail out if we encounter a number that cannot be a phone number because it has too many or too few digits.

The free-form free-for-all
It gets dicier as we move into more free-form fields. Addresses are particularly picky since they can be formatted in a wide variety of ways. We also need to deal with the vagaries of street and city names. We need to make sure that we correctly handle “Washington,DC,” “Washington, DC,” and “Washington DC” properly, along with oddities such as “Winston-Salem, NC,” “King of Prussia, PA,” “Scranton, Penn.,” “N. Providence RI,” “Houston, tx,” and “O’Fallon, IL.”

Those sorts of variations can trip up parsers if they’re not accounted for because we can’t strip special characters. Additionally, we can’t count on a set number of spaces defining the city versus the state or the state abbreviation being present or capitalized. Thus, we need to construct a conditional expression to push that string through to make a best-effort determination of the actual city and state, perhaps even checked against a database consisting of every city and state in the United States. Depending on the result, we may still need to bail out on that record if no definitive determination can be made or at least throw a flag that the record in question needs to be manually checked.

We’ve only begun to scratch the surface. We have a significant amount of work invested in figuring out just the city, state, and phone number of each record. We need to rinse and repeat for every other field in the spreadsheet, depending on the content.

This mess is a direct result of unconstrained free-form data entry, and it plagues every company everywhere. It doesn’t have to be Excel, either. It could be Access, a homegrown database, or any other application. Unless there are checks against the validity and formatting of data upon input, the data will likely be a mess. Of course, that’s the point of building a proper database front end to handle data input: We can clean and groom the data on the way in, which greatly enhances the accuracy and usability of that data on an ongoing basis. That’s one of the major benefits of using databases in the first place.

However, we cannot discount the effort involved in postprocessing these types of data sets. All kinds of tools have been developed to ease this process, but they will not apply to every case. While they may work for a portion of the input data, the parts that are missed may make their use more problematic than not.

Work of this nature is tedious and exceptionally detail oriented. It requires lots of manual data inspection, test runs, debugging, and forward thinking on the part of the developer working on the project. When all is said and done, the results are almost guaranteed to be worth the effort.

Working with clean data makes everything easier. Just don’t underestimate how challenging the journey to clean data may be.

Leave a Reply

Your email address will not be published. Required fields are marked *