What are "Inverted" or "Transposed" Data Files

Standard Data files look like this:

              variable1 variable2 variable3   . . .  variable1000
respondent 1        x             x             x             xxx        x
respondent 2        x             x             x             xxx        x
respondent 3        x             x             x             xxx        x
respondent 4        x             x             x             xxx        x
respondent 5        x             x             x             xxx        x
respondent 6        x             x             x             xxx        x
respondent 7        x             x             x             xxx        x
       .                       x             x             x             xxx        x
       .                       x             x             x             xxx        x
       .                       x             x             x             xxx        x
respondent1000   x             x             x             xxx        x


In order to analyze variable 2 vs. variable 693, most software packages would simply need to read each respondent record in, note the values of the two variables under study, then throw away the information about the other 998 variables and then go read the next respondent record. When all respondents records have been read, then the analysis could be completed.

The problem is that this involved a lot of what the computer world calls I-O or Input-Output. The software has done a lot of extra data reading, probably from a magnetic disk or a floppy disk, just to throw MOST of the information away. In the last 10 years the Computing Speed of all kinds of computers has increased by leaps and bounds. If you consider an IBM PC/AT from 1985 compared to a PC with an Intel Pentium CPU chip today, the difference in speed actually doing real computing jobs is probably a factor of 10-20 times faster or more. However, the job of Input-Output has not increased nearly as dramatically. While techniques like "buffering" and "cacheing" can save time on repeated I-O, the initial reading is probably only twice or three times as fast as in 1985.

If one knows that only a small fraction of the data file will be needed at any one time, then the file can be prepared to make that usage quite fast. Of course, nothing is free in this world! The energy investment to make the usage go rapidly must be made in advance, hopefully when time is not a factor, so that the file is ready to access when needed.

This is called "Inverting" or more sensibly, "Transposing" the data file. It simply means rotating the file structure 90 degrees so that the rows of the data matrix are the variables, and the columns are the respondents. By doing this, the rows can become "indexed", and accessed separately. 

 

Here is a picture of a Transposed Data file:

         respdnt1  respdnt2  respdnt3   . . .  respdnt1000
variable 1        x             x              x            xxx        x
variable 2        x             x              x            xxx        x
variable 3        x             x              x            xxx        x
variable 4        x             x              x            xxx        x
variable 5        x             x              x            xxx        x
variable 6        x             x              x            xxx        x
variable 7        x             x              x            xxx        x
       .                 x             x              x            xxx        x
       .                 x             x              x            xxx        x
       .                 x             x              x            xxx        x
variable1000  x             x              x            xxx        x

This means that, in our example above, only two out of 1000 variables need be read from disk. This is a savings of 500X over reading the entire file! (Actually, modern computers read in blocks of data, so some of the adjacent variables are probably being read in automatically. But the savings is still amazing.

Suppose that the data represented 50,000 households on 5000 variables; such a data file could be as large as 800,000,000,000 bytes or characters of data. Let us say it another way: 800 Billion numbers or letters. How about the idea of 800 GIGABYTES of disk storage? How long would it take to read the data file?

The best way to evaluate a software program's speed is to actually try it with a big data set and a variety of queries vs. the stopwatch. The real world test is much more convincing than a couple of diagrams like this.  Contact DATAN and arrange a timed test with standard and inverted data structures and see for yourself!