|
|
Data - The Link Between Current and Past Problems
by Roy Youngman on Jan 11, 2008 - 08:47 AM read 972 times Source: http://www.ryoungman.net/?p=11 |
|
My good buddy Vaughan Merlyn wrote a great post yesterday in which he draws parallels between the concept of Information Centers (generally, an 80s thing) and the need to empower business people today to be more effectively self-serving with technology. His reference to FOCUS reminded me how for several years in my career I was considered the FOCUS guru of my company, and therefore an Information Center superstar. I humbly admit, I was pretty good. But the secret of my success was not my programming skills, it was my data skills.
In fact, my programming skills actually slowed me down initially. Like anyone my age, I learned procedural programming first (COBOL, PL/I, FORTRAN). In other words, one record (a row to you young kids) at a time: read a record, decide what to do with it, read the next record. But FOCUS was one of the initial nonprocedural languages. To be good at it, you had to think in terms of sets of data at a time: create a set of rows that share common characteristics and JOIN that set to another based on a common fields (columns) between the two sets. I remember that it took a couple of months of trial and error before I really overcame my procedural ways. Most of the people I helped never did whether they were programmers or end users. And FOCUS would punish you for it. You see, FOCUS would allow you to do record-by-record processing, but in a convoluted, complex way (the DEFINE section as opposed to the TABLE section for you nostalgic FOCUS users). Amazingly, most people preferred this to non-procedural methods. They would come to me with some of the most difficult code to try to decipher imaginable. But I developed a great reputation for solving their problems fast in spite of my unusual methods. They expected me to somehow take their mess and fix it, but I almost always could get the job done more quickly by starting over.
The reason why most people wrote bad FOCUS queries was bad data. Few people understood this at the time. They were getting errors or incorrect results, but they thought it was a programming problem or a bug in FOCUS. But I had a strong data modeling and data base design background, so I was well-situated to understand the root of the problem. For example, a common data problem of the day was having one field (column) that changed meaning (a 2nd normal form violation for those who are academically inclined).
Illustration: For exempt employees, pay rate means annual gross pay but for non-exempt employees, it means hourly rate.
The typical reaction to an illustration like this was to process each record at a time, determine what kind of employee it was, and calculate the appropriate pay rate per record. This would be simple enough if that was the only data problem or the only thing they wanted to do. But alas, there were always many data problems and always a need to do more combining and matching of data and so the procedural logic becamemore and more complex. The unnatural solution that always worked for me was to first create sets of good clean data, and second combine them to answer the business question the user had. In the example above, Id run two queries: one to create a table of exempt employees and one to create a table of non-exempt employees. Note: no procedural code required at all (in fact, I remember you could tell how bad the data was by how much code someone had in the DEFINE section of a FOCUS query I empathized with E. F. Codd when he protested against the inclusion of cursors in SQL which broke the 12th of his original 12 Rules of a Relational Database).
I really got good at it when I started assuming data was bad and had to prove to myself it was actually good. I remember I had several standard queries in my bag of tricks that would tell me a ton of information about the data someone wanted to use mostly problems that all had to be fixed before we could address the actual business question the user was trying to answer. Always never an exception the user was surprised. For example, Id always run a domain of values query on every field in use.
“Hey, Bob. Ive discovered that in your file of 1,000 records, Gender_Code has 442 Ms, 417 Fs, 112 Xs, and the rest are blank. I dug into the Xs a bit and noticed none of them have anything in the Employee_Name field which really makes me wonder if they are even Employees or something else altogether (turns out it was a placeholder for contractors someone needed to track in the system that created the data). The blank ones also seem to have a lot of fields that are blank for some reason (turns out someone created an additional record for employees that had two addresses and didnt see any need to duplicate the previously entered data since the two records always showed up with each other on their CRT screen).“
“Damn, Roy you are one great FOCUS programmer!”
Later, I started teaching classes on my techniques. I created an example from live data. First, I broke a 1st normal form rule and asked everyone to solve a basic problem using FOCUS. Then I did the same thing breaking a 2nd normal form rule and again breaking a 3rd normal form rule. Afterwards, I provided the data clean and in third normal form and gave the same assignments. The first three assignments always took long, produced a wide variety of results, and required many, many lines of code. The last assignment with clean data was consistently finished by all students in a couple minutes, always yielded the same, correct results, and had very few lines of code that were simple to understand. Strangely, business users caught on faster than programmers who just couldnt get over their record-at-a-time frame-of-reference.
Okay, enough about how great I was. How much have things changed in the past 20 or 30 years relative to understanding the contextual meaning of enterprise data? Relative to how much faster CPUs have gotten or how much cheaper disk space has gotten, Id say, not that much. I still get the chance every now and then to help a client or a friend stroll through a data analysis problem. The techniques I applied 25 years ago are still needed because the data is unfortunately, still usually bad.
At the heart of the matter then as is now is understanding the context and meaning of data resources, organizing those services that provide access to data, and designing some sort of architecture to minimize bad data in the first place. This has been, is currently, and will always be the big issue in self-service of information no matter what other technology changes take place.
-
By: ryoungman
a reply to Data - The Link Between Current and Past Problems
by Roy Youngman on May 24, 2008 - 08:09 AM read 60 times
Source: http://www.ryoungman.net/?p=11#comment-33
Not surprisingly, while I was discovering and solving bad data problems, many others were too. The process now even has a name and a Wikipedia article: Data Profiling.
There is a short list of commercial products offering data profiling capabilities. Not surprisingly, some are associated with ETL tools and others with DBMS products. Oracle provides some capability with its Warehouse Builder product. IBM has a product called WebSphere Information Analyzer. Informatica has a product called Data Explorer. There doesn’t seem to be any open source alternatives I can find. If anyone knows of any, please comment. There is a company called DataFlux that seems pretty focused on data profiling. They have an active blog going that offers up a lot of good advise on the subject matter as well.


