Tuesday, 22 November 2011

What data does CIAWF actually offer ?

The CIA World Factbook contains 180 raw data files in the rankorder subfolder. However, not all of these files contain data (e.g. rawdata_2005.txt is actually empty). Ordering the rawdata files by size, some 60+ files are empty. For the remaining ~120 files, it would be good to know what the data is - the rawdata files simply containing rows of tab-separated numbers with no metadata.

An index for selecting files from the web-site is available in ..\rankorder\rankorderguide.html. Hopefully this file only links to files containing useful data - in which case we can use scripted datasets to query this file and extract the relevant title for each 4-digit code.

Monday, 21 November 2011

Picking at outliers

In our last chart, we examined the variation in GDP per person (USD) vs. the population growth %. We will look further at a couple of outliers in this data:
  • Which country has > $10k GDP per capita and a growth rate > 3% ?
  • Which country has approximately 0% growth rate with $1k GDP/person ?
The answers to these two questions can't easily be found directly from the CIAWF data as we're looking at the calculated column (GDP per person) and need to cross-reference with the population growth. However, it is easy to find the answers in BIRT.

Calculating columns

A comparison of population growth rate (i.e. % increase in population) with GDP for the entire country only makes sense if we expect some link between the size of a country and the population growth rate. To adjust for this, we can examine the GDP per. person rather than the overall GDP.

Updated scatter chart of pop. growth versus GDP per person
So, it appears that generally population growth is lower the higher the GDP per person - but this is by no means a concrete rule. We note that Cook Island and the North Mariana Islands are now both in evidence!

What are those outliers ?

Looking at the previous scatter chart, there is an outlier with a decrease in population of more than 3%... We can easily find which country this is from the CIAWF data:
229 Cook Islands
-3.20
2011 est.
230 Northern Mariana Islands
-4.00
2011 est.

There's a problem here. The CIAWF says that we should have *two* points with population decreases over 3% - for some reason we're not seeing the Northern Mariana Islands in the chart.

Joining datasets

Having created a scripted dataset, it is a simple matter to point it to new data: copy and paste the dataset in BIRT; rename it; rename the columns as appropriate.
 
We can now use scripted datasets to access 2011 CIAWF files for GDP (rawdata_2001.txt) and Population Growth Rate (rawdata_2002.txt). We can look at this data in a scatter chart to see if there is any correlation:
GDP versus Population growth rate.
NB: Log scale used for GDP!

The answer looks to be that there is very little relationship between the two values. However, we can see that large population growth (> 2%) is not observed for either very low or very high GDPs. Whatever that might mean.

Friday, 18 November 2011

Getting CIAWF data into the reports using a Scripted Dataset

To create a scripted dataset to use with BIRT we need to:
  1. Create a scripted data source
  2. Create a scripted dataset
  3. Add code to the dataset for
    1. Describe the columns in the dataset
    2. Opening  the file
    3. Fetching a record
    4. Closing the file

Wednesday, 16 November 2011

Data (ii)

Yesterday, when we left off, I had a working BIRT and a lot of CIA World Factbook (CIAWF) data and had failed to produce a report. Today things moved on...


Sample data from CIAWF raw data displayed in Notepad2.
Opening an example raw data file in Notepad2 and setting all the "show hidden characters" options on revealed that the CIAWF records were terminated with not LF, not CRLF but... CR. So, that looks like a likely culprit for why the (very basic) flat file option in BIRT failed to open it. Apparently in bygone days, CR terminated lines were an Apple standard but I can't believe that's why the CIAWF do it.