Wednesday, 16 November 2011

Data (ii)

Yesterday, when we left off, I had a working BIRT and a lot of CIA World Factbook (CIAWF) data and had failed to produce a report. Today things moved on...


Sample data from CIAWF raw data displayed in Notepad2.
Opening an example raw data file in Notepad2 and setting all the "show hidden characters" options on revealed that the CIAWF records were terminated with not LF, not CRLF but... CR. So, that looks like a likely culprit for why the (very basic) flat file option in BIRT failed to open it. Apparently in bygone days, CR terminated lines were an Apple standard but I can't believe that's why the CIAWF do it.


The BIRT data providers are part of the Eclipse Data Tools Platform (DTP) project. Examining the code (FlatFileQuery.java) reveals that the flat file data source does indeed rely on the presence of a LF (\n) to mark the end of a record.

Useful comment saying that, yes, LF is necessary
 

The code that means CIAWF won't load.

They do, however, at least indicate that their approach to tab-separated files is based on a spec for comma-separated (CSV) files... I guess it started life reading CSV and then expanded to handle tabs and semicolons and suchlike!

Another useful comment indicating where the base definition for the flat-file formats comes from!

Quite why the BIRT UI doesn't offer all the usual types of file import options for field and record terminators is beyond me. Especially given the humungous list of character sets that you can choose from for your data.

Anyway, it doesn't suit the data so there are 3 options:
  1. Massage the data so that all the CRs are LFs
  2. Hack / copy-and-edit the flat file dataset to handle CR terminated records
  3. Use the Scripted Dataset instead.

Given that there are 10 years files in the CIAWF data and I'd like to point it at any CIAWF file my initial experiments are going to concentrate on (iii) and seeing what needs to be done to load the data using JavaScript (or Java... but most likely JavaScript). However, updating the FlatFile dataset to be more flexible is a tempting future project!

No comments:

Post a Comment