The CIA World Factbook contains 180 raw data files in the rankorder subfolder. However, not all of these files contain data (e.g. rawdata_2005.txt is actually empty). Ordering the rawdata files by size, some 60+ files are empty. For the remaining ~120 files, it would be good to know what the data is - the rawdata files simply containing rows of tab-separated numbers with no metadata.
An index for selecting files from the web-site is available in ..\rankorder\rankorderguide.html. Hopefully this file only links to files containing useful data - in which case we can use scripted datasets to query this file and extract the relevant title for each 4-digit code.
Exploring BIRT
An exploration of the Eclipse Business Intelligence and Reporting Tools (BIRT). Mainly using Windows and the (public domain) CIA World Factbook.
Tuesday, 22 November 2011
Monday, 21 November 2011
Picking at outliers
In our last chart, we examined the variation in GDP per person (USD) vs. the population growth %. We will look further at a couple of outliers in this data:
- Which country has > $10k GDP per capita and a growth rate > 3% ?
- Which country has approximately 0% growth rate with $1k GDP/person ?
Calculating columns
A comparison of population growth rate (i.e. % increase in population) with GDP for the entire country only makes sense if we expect some link between the size of a country and the population growth rate. To adjust for this, we can examine the GDP per. person rather than the overall GDP.
So, it appears that generally population growth is lower the higher the GDP per person - but this is by no means a concrete rule. We note that Cook Island and the North Mariana Islands are now both in evidence!
| Updated scatter chart of pop. growth versus GDP per person |
Labels:
CIA World Factbook,
Eclipse BIRT
Location:
London, UK
What are those outliers ?
Looking at the previous scatter chart, there is an outlier with a decrease in population of more than 3%... We can easily find which country this is from the CIAWF data:
There's a problem here. The CIAWF says that we should have *two* points with population decreases over 3% - for some reason we're not seeing the Northern Mariana Islands in the chart.
| 229 | Cook Islands | -3.20 | 2011 est. |
| 230 | Northern Mariana Islands | -4.00 | 2011 est. |
There's a problem here. The CIAWF says that we should have *two* points with population decreases over 3% - for some reason we're not seeing the Northern Mariana Islands in the chart.
Joining datasets
Having created a scripted dataset, it is a simple matter to point it to new data: copy and paste the dataset in BIRT; rename it; rename the columns as appropriate.
We can now use scripted datasets to access 2011 CIAWF files for GDP (rawdata_2001.txt) and Population Growth Rate (rawdata_2002.txt). We can look at this data in a scatter chart to see if there is any correlation:
We can now use scripted datasets to access 2011 CIAWF files for GDP (rawdata_2001.txt) and Population Growth Rate (rawdata_2002.txt). We can look at this data in a scatter chart to see if there is any correlation:
| GDP versus Population growth rate. NB: Log scale used for GDP! |
Friday, 18 November 2011
Getting CIAWF data into the reports using a Scripted Dataset
To create a scripted dataset to use with BIRT we need to:
- Create a scripted data source
- Create a scripted dataset
- Add code to the dataset for
- Describe the columns in the dataset
- Opening the file
- Fetching a record
- Closing the file
Wednesday, 16 November 2011
Data (ii)
Yesterday, when we left off, I had a working BIRT and a lot of CIA World Factbook (CIAWF) data and had failed to produce a report. Today things moved on...
Opening an example raw data file in Notepad2 and setting all the "show hidden characters" options on revealed that the CIAWF records were terminated with not LF, not CRLF but... CR. So, that looks like a likely culprit for why the (very basic) flat file option in BIRT failed to open it. Apparently in bygone days, CR terminated lines were an Apple standard but I can't believe that's why the CIAWF do it.
| Sample data from CIAWF raw data displayed in Notepad2. |
Subscribe to:
Comments (Atom)