Tuesday, 22 November 2011

What data does CIAWF actually offer ?

The CIA World Factbook contains 180 raw data files in the rankorder subfolder. However, not all of these files contain data (e.g. rawdata_2005.txt is actually empty). Ordering the rawdata files by size, some 60+ files are empty. For the remaining ~120 files, it would be good to know what the data is - the rawdata files simply containing rows of tab-separated numbers with no metadata.

An index for selecting files from the web-site is available in ..\rankorder\rankorderguide.html. Hopefully this file only links to files containing useful data - in which case we can use scripted datasets to query this file and extract the relevant title for each 4-digit code.


Examining the file, we find a series of references to the individual data files:

The general format for the links to the detail files is along the lines of:
<a href="../rankorder/NNNNrank.html">TTTTTTTTTTTTT:</a>

where NNNN is the code used for the file and TTTTTTTTTTTTT a descriptive title. If we load the file into memory, we can then search the content for this pattern using regular expressions.

The groups that build up our regular expression are:
  • a component before the code : <a href=\"../rankorder/
  • the 4-digit code
  • a component between the code and the title : rank\\.html\">
  • the title
  • a terminating component : :</a>

Representing this in a regular expression is messy, as JavaScript uses \ to represent escape sequences, as do regular expressions. So, each \ in the regular expression needs to be represented as \\ in the JavaScript string. Similarly, quotation marks need to be escaped in the JavaScript string. So, our final regular expression to search for details of a file is:
(<a href=\"\\.\\.\\/rankorder\\/)(\\d\\d\\d\\d)(rank\\.html\">)(.+)(\\:<\\/a>)

This will return an array of 6 items - an overall matching string and the five groups of characters defined by the parentheses. There are two main ways to return matching data for a global regular expression: string.match(regexp) and regexp.exec(string). The former returns an array of matches, whereas the latter returns the next match (or null if there are no more matches). Of the two approaches, regexp.exec(string)is appropriate for use in the fetch method of a scripted dataset.

Now that we have a plan, we can write the  Scripted Dataset to access the data.




As we are just interested in two string columns, the describe method is fairly straight-forward:
    this.addDataSetColumn("code","STRING");
    this.addDataSetColumn("heading","STRING");
    return true;

In the open method, we read in the file using a subfunction and set up the global regular expression:

    importPackage(java.io);

    java.lang.System.out.println("open:"+this.getDataSource().name + ":" + this.name);
    fileguide = "C:\\Users\\username\\Desktop\\Data\\CIA World Factbook\\factbooka2011\\rankorder\\rankorderguide.html";
    rankindex = loadFile(fileguide);
    var regexptxt = "(<a href=\"\\.\\.\\/rankorder\\/)(\\d\\d\\d\\d)(rank\\.html\">)(.+)(\\:<\\/a>)";
    ciawf_rank_regexp = new RegExp(regexptxt, "g");
   
    function loadFile(filename) {
        var file = new java.io.File(filename);
        var reader = new FileReader(file); // Create a FileReader object
        var newdatabyte = reader.read();
        filedata = "";
        while (!(newdatabyte == -1)) {
            var newdatachar = String.fromCharCode(newdatabyte);
            filedata = filedata+newdatachar;
            newdatabyte = reader.read();
            };
        return filedata;
    };

Finally, the fetch method is very simple. Having set up the global regular expression and loaded the file, each fetch simply executes the regular exspression to get the next match and populates the current row accordingly. If no match is found, the regular expression returns null, in which case we simply return false from teh fetch method, indicating that all rows have been returned.

    strpos = ciawf_rank_regexp.exec(rankindex);
    if (strpos != null) {
        row.code = strpos[2];
        row.heading = strpos[4];
        return true;
        }
    else {
        return false;
        };

This allows us to build a list of available CIAWF data directly from the index file. Although 180 files exist in the rankorder folder, and we know that ~60 of the files are empty we only get 62 records returned from the dataset.


So, there is something awry. We were expecting another 60 files to be accessible. Noting that code 2002 (population growth) isn't listed we can search the index source code for that code and see if that helps.
Yes! It helps! The problem is that the "." wild-card in the regular expression doesn't match a new-line character. So we need to tweak the expression. Rather then "." we can use [^<] to match all characters except a left angle-bracket / less-than sign. This update returns dataset 2002... but that's the only one that's added. We still aren't finding details of almost 60 files. Loading the source of the rankorderguide.html file into Eclipse, we can do a search-and-replace on the \d\d\d\drank.html regular expression. This produces 63 matches, so our scripted daatset did find all the data! Maybe there's a reason those files aren't in the index... maybe we need to look more at what they actually contain.

No comments:

Post a Comment