An index for selecting files from the web-site is available in ..\rankorder\rankorderguide.html. Hopefully this file only links to files containing useful data - in which case we can use scripted datasets to query this file and extract the relevant title for each 4-digit code.
Examining the file, we find a series of references to the individual data files:
The general format for the links to the detail files is along the lines of:
<a href="../rankorder/NNNNrank.html">TTTTTTTTTTTTT:</a>
The groups that build up our regular expression are:
- a component before the code : <a href=\"../rankorder/
- the 4-digit code
- a component between the code and the title : rank\\.html\">
- the title
- a terminating component : :</a>
Representing this in a regular expression is messy, as JavaScript uses \ to represent escape sequences, as do regular expressions. So, each \ in the regular expression needs to be represented as \\ in the JavaScript string. Similarly, quotation marks need to be escaped in the JavaScript string. So, our final regular expression to search for details of a file is:
(<a href=\"\\.\\.\\/rankorder\\/)(\\d\\d\\d\\d)(rank\\.html\">)(.+)(\\:<\\/a>)
Now that we have a plan, we can write the Scripted Dataset to access the data.
As we are just interested in two string columns, the describe method is fairly straight-forward:
this.addDataSetColumn("code","STRING");
this.addDataSetColumn("heading","STRING");
return true;
this.addDataSetColumn("heading","STRING");
return true;
In the open method, we read in the file using a subfunction and set up the global regular expression:
importPackage(java.io);
java.lang.System.out.println("open:"+this.getDataSource().name + ":" + this.name);
fileguide = "C:\\Users\\username\\Desktop\\Data\\CIA World Factbook\\factbooka2011\\rankorder\\rankorderguide.html";
rankindex = loadFile(fileguide);
var regexptxt = "(<a href=\"\\.\\.\\/rankorder\\/)(\\d\\d\\d\\d)(rank\\.html\">)(.+)(\\:<\\/a>)";
ciawf_rank_regexp = new RegExp(regexptxt, "g");
function loadFile(filename) {
var file = new java.io.File(filename);
var reader = new FileReader(file); // Create a FileReader object
var newdatabyte = reader.read();
filedata = "";
while (!(newdatabyte == -1)) {
var newdatachar = String.fromCharCode(newdatabyte);
filedata = filedata+newdatachar;
newdatabyte = reader.read();
};
return filedata;
};
java.lang.System.out.println("open:"+this.getDataSource().name + ":" + this.name);
fileguide = "C:\\Users\\username\\Desktop\\Data\\CIA World Factbook\\factbooka2011\\rankorder\\rankorderguide.html";
rankindex = loadFile(fileguide);
var regexptxt = "(<a href=\"\\.\\.\\/rankorder\\/)(\\d\\d\\d\\d)(rank\\.html\">)(.+)(\\:<\\/a>)";
ciawf_rank_regexp = new RegExp(regexptxt, "g");
function loadFile(filename) {
var file = new java.io.File(filename);
var reader = new FileReader(file); // Create a FileReader object
var newdatabyte = reader.read();
filedata = "";
while (!(newdatabyte == -1)) {
var newdatachar = String.fromCharCode(newdatabyte);
filedata = filedata+newdatachar;
newdatabyte = reader.read();
};
return filedata;
};
Finally, the fetch method is very simple. Having set up the global regular expression and loaded the file, each fetch simply executes the regular exspression to get the next match and populates the current row accordingly. If no match is found, the regular expression returns null, in which case we simply return false from teh fetch method, indicating that all rows have been returned.
strpos = ciawf_rank_regexp.exec(rankindex);
if (strpos != null) {
row.code = strpos[2];
row.heading = strpos[4];
return true;
}
else {
return false;
};
if (strpos != null) {
row.code = strpos[2];
row.heading = strpos[4];
return true;
}
else {
return false;
};
This allows us to build a list of available CIAWF data directly from the index file. Although 180 files exist in the rankorder folder, and we know that ~60 of the files are empty we only get 62 records returned from the dataset.
So, there is something awry. We were expecting another 60 files to be accessible. Noting that code 2002 (population growth) isn't listed we can search the index source code for that code and see if that helps.
Yes! It helps! The problem is that the "." wild-card in the regular expression doesn't match a new-line character. So we need to tweak the expression. Rather then "." we can use [^<] to match all characters except a left angle-bracket / less-than sign. This update returns dataset 2002... but that's the only one that's added. We still aren't finding details of almost 60 files. Loading the source of the rankorderguide.html file into Eclipse, we can do a search-and-replace on the \d\d\d\drank.html regular expression. This produces 63 matches, so our scripted daatset did find all the data! Maybe there's a reason those files aren't in the index... maybe we need to look more at what they actually contain.
No comments:
Post a Comment