December 27, 2011
Importing “Ugly” Data
Perhaps this blog's title is a misnomer. When I refer to "ugly" data I don't mean the actual data, I am referring to the file format that contains the data, but "Importing Data With a Less-Than-Cosmetic File Structure" just doesn't have much of a ring to it.
I'm a big fan of automated data collection. What's not to like? Everything that we can do to keep operators, technicians, quality personnel, etc. from pulling, measuring and recording data helps process consistency and the bottom line. Automated is almost like "free" data, right?
Automated data can be a beautiful thing once everything is configured, but before that it can pose headaches. For customers who have used InfinityQS products to collect data from text files, you will no doubt be familiar with what a "good" file looks like. Here's an example of a "good" file that has data in neatly organized columns and rows. Notice how each column is delimited with a tab? Fantastic!
Opening this same file in your favorite spreadsheet application reveals neatly organized data. There's a column for Part, Operator, Process, Color, OD - Loc A, OD - Loc B and OD - Loc C. Using InfinityQS tools like the Data Management System (DMS), Buffer File (ProFicient) or RTI will be a snap with this "pretty" data, but what about a file with "ugly" data?
The files presented above were reformatted from a very different file, which is an actual output received from a customer's vision system. Don't worry, the names and data have been changed to protect the innocent! Here's the original file:
Notice that the Part, Process, Descriptors and Test Values are all over the place! Further, this file doesn't have the familiar delimiters like a tab or comma. Can this data be collected with InfinityQS tools? Both DMS and the buffer file method can be used with this file type.
It's all in the formatting
When I look at this file, I see the data spread across several rows because of the way my word processing software interprets the records on the screen. If you use an advanced text editor to turn on hidden characters, you will see that the end of each line has a '<13><10>' at the end of it. This '<13><10>' is the computer's way of knowing to perform a carriage return (<13>), which means go to the far left, and a line feed (<10>), which means go to the next row. Every time we press the 'Enter' key on our keyboards, these 2 characters are issued. This is done to make the text more easily deciphered by humans, but we aren't limited to this when configuring DMS or a buffer file.
What would happen if we told our configuration to use the pipe character (|) to find each column and used the exclamation mark at the very end of the file to indicate the record is complete? The record would look like this:
NAME|PN 1475-215 BLUE PART (USA).mxi<13><10>DATE|3/24/2011<13><10>TIME|16:07:45
<13><10>DATA|FACTOR|1|T|++OPERATOR|OP| +0.00000| +0.00000| +0.00000| +0.00000|
EJW<13><10>DATA|FACTOR|2|T|++MACHINE NUMBER|PHX| +0.00000| +0.00000| +0.00000| +0.00000|77<13><10>DATA|FACTOR|3|T|++COLOR|PHX| +0.00000| +0.00000| +0.00000| +0.00000|BLU<13><10>DATA|FACTOR|4|T|++NOTES|PHX| +0.00000| +0.00000| +0.00000| +0.00000|2 OFF<13><10>DATA|"A"|DIA| +2.7500| +0.00500| -0.00500| +2.7515|
<13><10>DATA|"B"|DIA| +1.5000| +0.00500| -0.00500| +1.5015|<13><10>DATA|"C"|DIA| +1.5000| +0.00500| -0.00500| +1.5040|<13><10><13><10><13><10><13><10>
Notice all of the '<13><10>' characters and how some records blend together. With the word processor's "word wrap" turned on, this still looks like a paragraph of information, but if it is put in a single line, it looks like this:
You can see that this is a single record with many characters. My computer has 2 monitors and it's still not enough to see it all on one screen!
Considering the record in this format the '<13><10>' are treated as any other characters and we now have a record with many columns. The appropriate columns need to be mapped. Here is the mapping using the DMS Grid Provider:
Mapping the columns is the first step, but other considerations need to be made. For example, the Part column now contains an entry of:
PN 1475-215 BLUE PART (USA).mxi<13><10>DATE
The record needs to be trimmed to ignore the undesirable characters. In this example, the Part is defined as 22 characters and the values can be trimmed as needed:
These same steps are taken for each field and the results are anything but ugly!
The same concepts can be applied to a buffer file used in a ProFicient project, and the columns are mapped in an identical fashion.
The same options exist to trim characters.
Once the columns are mapped and each field is trimmed, the buffer file will yield the same results as DMS.
Now that we're armed with this knowledge does it mean that any text file can be conquered? Not exactly, but it helps. What are some important considerations that need to be made with this file, or any file being imported for that matter?
- Descriptor Lengths - For the method displayed here, each descriptor must have a defined length that applies to all instances of that descriptor being imported. For example, this Part has 22 characters and all other Parts will need to have the same character count. If they don't, characters will either be omitted or added.
- Descriptor Counts - In this example there was a part, operator, process, color and 3 measured features. Considerations need to be made on how to handle variable numbers of characteristics and descriptors. There are some options that can be exercised, but the ideal case is to have a fixed number of each.
Even with all of these considerations, which can sometimes take days/weeks/months to resolve, I am still a big fan of automated data collection. This one-time pain will pay dividends for years. Happy importing!