Climategate's Harry_Read_Me.txt: We All Really Should

One of the most damning pieces of evidence in Climategate (so far) is a text file called HARRY_READ_ME.txt. This file is supposedly written by Ian “Harry” Harris, a researcher at the University of East Anglia’s CRU (Climatic Research Unit). In it he details the trials and tribulations of being tasked with creating a new climate information database from previous publications and databases. According to Harry’s documented struggle, he is confronted with missing, manipulated, and undocumented data that he has to use to try to piece together the newer TS 3.0 database.


Here are the brow-raising excerpts:


So, uhhhh.. what in tarnation is going on? Just how off-beam are these datasets?!!

Unbelievable – even here the conventions have not been followed. It’s botch after botch after botch.

22. Right, time to stop pussyfooting around the niceties of Tim’s labyrinthine software suites – let’s have a go at producing CRU TS 3.0! since failing to do that will be the definitive failure of the entire project..

Nearly 11,000 files! And about a dozen assorted ‘read me’ files addressing individual issues…

(yes, they all have different name formats, and yes, one does begin ‘_’!)

How handy – naming two different files with exactly the same name and relying on their location to differentiate! Aaarrgghh!!

If the latest precipitation database file contained a fatal data error… then surely it has been altered since Tim last used it to produce the precipitation grids? But if that’s the case, why is it dated so early?

So what’s going on? I don’t see how the ‘final’ precip file can have been produced from the ‘final’ precipitation database, even though the dates imply that. The obvious conclusion is that the precip file must have been produced before 23 Dec 2003, and then redated (to match others?) in Jan 04.

There is no way of knowing which Tim used to produce the current public files. The scripts differ internally but – you guessed it! – the descriptions at the start are identical. WHAT IS GOING ON?

So what is this mysterious variable ‘nf’ that isn’t being set? Well strangely, it’s in Mark N’s ‘’. I say strangely because this is a generic prog that’s used all over the place! Nonetheless it does have what certainly looks like a bug…

Where is the documentation to explain all this?!

Bear in mind that there is no working synthetic method for cloud, because Mark New lost the coefficients file and never found it again (despite searching on tape archives at UEA) and never recreated it.

DON’T KNOW, UNDOCUMENTED. Wherever I look, there are data files, no info about what they are other than their names. And that’s useless..

So what the hell did Tim do?!! As I keep asking.

This is irritating as it means precip has only 9 fields and I can’t do a generic mapping from any cru format to cru ts.

Then.. like an idiot.. I had to test the data!

It’s halfway through April and I’m still working on it. This surely is the worst project I’ve ever attempted. Eeeek.

Oh bugger. What the HELL is going on?!

In fact, on examination the US database record is a poor copy of the main database one, it has more missing data and so forth. By 1870 they have diverged, so in this case it’s probably OK.. but what about the others?

Oh GOD if I could start this project again and actually argue the case for junking the inherited program suite!!

Oh Tim what have you done, man?

Just another thing I cannot understand, and another reason why this should all have been rewritten from scratch a year ago!

am I the first person to attempt to get the CRU databases in working order?!!

Oh bum. But, but.. how? I know we do muck around with the header and start/end years, but still..

In the upside-down world of Mark and Tim, the numbers of stations contributing to each cell during the gridding operation are calculated not in the IDL gridding program – oh, no! – but in anomdtb! ..well that was, erhhh.. ‘interesting’…So there is no guarantee that the station number files, which are produced *independently* by anomdtb, will reflect what actually happened!!


I am seriously worried that our flagship gridded data product is produced by Delaunay triangulation – apparently linear as well. As far as I can see, this renders the station counts totally meaningless. It also means that we cannot say exactly how the gridded data is arrived at from a statistical perspective – since we’re using an off-the-shelf product that isn’t documented sufficiently to say that. Why this wasn’t coded up in Fortran I don’t know – time pressures perhaps? Was too much effort expended on homogenisation, that there wasn’t enough time to write a gridding procedure? Of course, it’s too late for me to fix it too. Meh.


Not only do both databases have unnecessary duplicates, introduced for external mapping purposes by the look of it, but the ‘main’ stations (2 and 4) have different station name & country. In fact one of the country names is illegal! Dealing with things like this cannot be automated as they’re the results of non-automatic decisions.

What a bloody mess.

Now looking at the dates.. something bad has happened, hasn’t it. COBAR AIRPORT AWS cannot start in 1962, it didn’t open until 1993! Looking at the data – the COBAR station 1962-2004 seems to be an exact copy of the COBAR AIRPORT AWS station 1962-2004. And wouldn’t you know it, the data for this station has missing data between 12/92 and 12/99 inclusive. So I reckon it’s the old FORREST AERO station (WMO 9464600, .au ID 11004), with the new Australian bulletin updates tacked on (hence starting in 2000) So.. do I split off the 2000-present data to a new station with the new number, or accept that whoever joined them (Dave?) looked into it and decided it would be OK? The BOM website says they’re 800m apart.

Hope that’s right..

All 115 refs now matched in the TMin database. Confidence in the fidelity of the Australian station in the database drastically reduced. Likelihood of invalid merging of Australian stations high. Let’s go..

getting seriously fed up with the state of the Australian data. so many new stations have been introduced, so many false references.. so many changes that aren’t documented. Every time a cloud forms I’m presented with a bewildering selection of similar-sounding sites, some with references, some with WMO codes, and some with both. And if I look up the station metadata with one of the local references, chances are the WMO code will be wrong (another station will have it) and the lat/lon will be wrong too.

I am very sorry to report that the rest of the databases seem to be in nearly as poor a state as Australia was. There are hundreds if not thousands of pairs of dummy stations, one with no WMO and one with, usually overlapping and with the same station name and very similar coordinates. I know it could be old and new stations, but why such large overlaps if that’s the case? Aarrggghhh! There truly is no end in sight.

I honestly have no idea what to do here. and there are countless others of equal bafflingness.

I suspected a couple of stations were being counted twice, so using ‘comm’ I looked for identical headers. Unfortunately there weren’t any!! So I have invented two stations, hmm.

I have to admit, I still don’t understand secondary parameter generation. I’ve read the papers, and the miniscule amount of ‘Read Me’ documentation, and it just doesn’t make sense.


As I was examining the vap database, I noticed there was a ‘wet’ database. Could I not use that to assist with rd0 generation? well.. it’s not documented, but then, none of the process is so I might as well bluff my way into it!


Quite honestly I don’t have time – but it just shows the state our data holdings have drifted into. Who added those two series together? When? Why? Untraceable, except anecdotally.

But I am beginning to wish I could just blindly merge based on WMO code.. the trouble is that then I’m continuing the approach that created these broken databases.

Here, the expected 1990-2003 period is MISSING – so the correlations aren’t so hot! Yet the WMO codes and station names /locations are identical (or close). What the hell is supposed to happen here? Oh yeah – there is no ‘supposed’, I can make it up. So I have :-)

You can’t imagine what this has cost me – to actually allow the operator to assign false WMO codes!! But what else is there in such situations? Especially when dealing with a ‘Master’ database of dubious provenance (which, er, they all are and always will be).

False codes will be obtained by multiplying the legitimate code (5 digits) by 100, then adding 1 at a time until a number is found with no matches in the database. THIS IS NOT PERFECT but as there is no central repository for WMO codes – especially made-up ones – we’ll have to chance duplicating one that’s present in one of the other databases. In any case, anyone comparing WMO codes between databases – something I’ve studiously avoided doing except for tmin/tmax where I had to – will be treating the false codes with suspicion anyway. Hopefully.

This still meant an awful lot of encounters with naughty Master stations, when really I suspect nobody else gives a hoot about. So with a somewhat cynical shrug, I added the nuclear option – to match every WMO possible, and turn the rest into new stations (er, CLIMAT excepted). In other words, what CRU usually do. It will allow bad databases to pass unnoticed, and good databases to become bad, but I really don’t think people care enough to fix ’em, and it’s the main reason the project is nearly a year late.

this was a guess! We’ll see how the results look  Right, erm.. off I jolly well go!

The trouble is, we won’t be able to produce reliable station count files this way. Or can we use the same strategy,producing station counts from the wet database route, and filling in ‘gaps’ with the precip station counts? Err.

…It looks as though the calculation I’m using for percentage anomalies is, not to put too fine a point on it, cobblers.

So, good news – but only in the sense that I’ve found the error. Bad news in that it’s a further confirmation that my abilities are short of what’s required here.

…unusual behaviour of CRU TS 2.10 Vapour Pressure data was observed, I discovered that some of the Wet Days and Vepour Pressure datasets had been swapped!!

Ah – and I was really hoping this time that it would just WORK. But of course not – nothing works first time in this project.


Oh, GOD. What is going on? Are we data sparse and just looking at the climatology? How can a synthetic dataset derived from tmp and dtr produce the same statistics as an ‘real’ dataset derived from observations?


Oh, sod it. It’ll do. I don’t think I can justify spending any longer on a dataset, the previous version of which was completely wrong (misnamed) and nobody noticed for five years.

“Bear in mind that there is no working synthetic method for cloud, because Mark New lost the coefficients file and never found it again (despite searching on tape archives at UEA) and never recreated it. This hasn’t mattered too much, because the synthetic cloud grids had not been discarded for 1901-95, and after 1995 sunshine data is used instead of cloud data anyway.”As for converting sun hours to cloud cover.. we only appear to have interactive, file-by-file programs. Aaaand – another head-banging shocker! The program sh2cld_tdm.for, which describes itself thusly:

program sunh2cld    c converts sun hours monthly time series to cloud percent (n/N)

Does NO SUCH THING!!! Instead it creates SUN percentages! This is clear from the variable names and user interactions.

So.. if I add the sunh -> sun% process from sh2cld_tdm.for into Hsp2cldp_m.for, I should end up with asun hours to cloud percent convertor. Possibly.

It also assisted greatly in understanding what was wrong – Tim was in fact calculating Cloud Percent, despite calling it Sun Percent!! Just awful.

… So to CLOUD. For over a year, rumours have been circulating that money had been found to pay somebody for a month to recreate Mark New’s coefficients. But it never quite gelled. Now, at last, someone’s producing them! Unfortunately.. it’s me.

The idea is to derive the coefficients (for the regressing of cloud against DTR) using the published 2.10 data. We’ll use 5-degree blocks and years 1951-2002, then produce coefficients for each 5-degree latitude band and month. Finally, we’ll interpolate toget half-degree coefficients. Apparently.

So, erm.. now we need to create our synthetic cloud from DTR. Except that’s the thing we CAN’T do because pro needs those bloody coefficients (a.25.7190, etc) that went AWOL.

Hunting for CDDs I found a potential problem with binary DTR (used in the construction of Frost Days, Vapour Pressure, and (eventually) Cloud. It looks as though there was a mistyping when the 2.5-degree binaries were constructed:

Another problem. Apparently I should have derived TMN and TMX from DTR and TMP, as that’s what v2.10 did and that’s what people expect. I disagree with publishing datasets that are simple arithmetic derivations of other datasets published at the same time, when the real data could be published instead.. but no.

I then look in the 1995 anomaly files…This whole process is too convoluted and created myriad problems of this kind. I really think we should change it.

I was going to do further backtracing, but it’s been revealed that the same issues were in 2.1 – meaning that I didn’t add the duff data. The suggested way forward is to not use any observations after 1989, but to allow synthetics to take over. I’m not keen on this approach as it’s likely (imo) to introduce visible jumps at 1990, since we’re effectively introducing a change of data source just after calculating the normals. My compromise is to try it – but to also try a straight derivation from half-degree synthetics.


So actually, this was saving with a gridsize of 5 degrees! Disquietingly, this isn’t born out by the file sizes, but we’ll gloss over that.

Station counts should be straightforward to derive from the anomaly files (.txt), as output by anomdtb.f90. This, however, will only work for Primary parameters, since Secondaries are driven from synthetic data as well. Further, the synthetic element in this is usually at 2.5 degrees, so a direct relationship with half-degree coverage will be hard to establish.

So, we can have a proper result, but only by including a load of garbage!

OK, got cloud working, have to generate it now.. but distracted by starting on the mythical ‘Update’ program.

Of course, one of the problems is that you need a latitude value to perform the conversion – so the CLIMAT bulletins lose the value if they can’t be matched in the WMO list! Not much I can do about that, and let’s face it those stations are going to end up as ‘new’ stations with no possibility of a 61-90 normal.

So the new cloud databases I’ve just produced should be, if not identical, very similar? Oh, dear. There is a passing similarity, though this seems to break down in Winter. I don’t have time to do detailed comparisons, of course, so we’ll just run with the new one.

The procedure last time – that is, when I was trying to re-produce TS 2.10, we have no idea what the procedure was for its initial production!

So after gridding we could add these.. except that after gridding we’ll have incorporated the DTR_derived synthetic cloud, which is of course based on the 1961-1990 normals as it’s derived from DTR!! Arrrrggghh.

So.. {sigh}.. another problem. Well we can’t change the updates side, that has to use 1995-2002 normals. But maybe we’ll have to adjust the station anomalies, prior to gridding? I don’t see an alternative.

The question is, IS THIS ANY GOOD? Well, we currently have published cloud data to 2002. So we can make comparisons between 1996 and 2002. Oh, my. I am sure I’ve written plenty of comparison routines, but as to their location or name..ah…The results were less than ideal, though they could have been much worse. Essentially, North America is totally different…

The deduction so far is that the DTR-derived CLD is waaay off. The DTR looks OK, well OK in the sense that it doesn;t have prominent bands! So it’s either the factors and offsets from the regression, or the way they’ve been applied in dtr2cld.

Well, dtr2cld is not the world’s most complicated program. Wheras cloudreg is, and I immediately found a mistake! Scanning forward to 1951 was done with a loop that, for completely unfathomable reasons, didn’t include months! So we read 50 grids instead of 600!!! That may have had something to do with it. I also noticed, as I was correcting THAT, that I reopened the DTR and CLD data files when I should have been opening the bloody station files!! I can only assume that I was being interrupted continually when I was writing this thing. Running with those bits fixed improved matters somewhat, though now there’s a problem in that one 5-degree band (10S to 5S) has no stations! This will be due to low station counts in that region, plus removal of duplicate values.

Had a think. Phil advised averaging the bands either side to fill the gap, but yuk! And also the band to the North (ie, 5S to equator) is noticeably lower (extreme, even). So after some investigation I found that, well, here’s the email:


<MAIL QUOTE>Phil,  I’ve looked at why we’re getting low counts for valid cloud cells in certain 5-degree latitude bands.

The filtering algorithm omits any cell values where the station count is zero, for either CLD or DTR. In general, it’s the CLD counts that are zero and losing us the data.  However, in many cases, the cloud value in that cell on that month is not equal to the climatology. And there is plenty of DTR data. So I’m wondering how accurate the station counts are for secondary variables, given that they have to reflect observed and synthetic inputs. Here’s a brief example:  (all values are x10)

CLD——————-  DTR——————-     val     stn    anom     val     stn    anom

553.00    0.00  -10.00  134.00   20.00   -1.00  558.00    0.00  -17.00  139.00   20.00    2.00  565.00    0.00  -23.00  137.00   20.00    5.00  581.00    0.00  -32.00  139.00   16.00    8.00  587.00    0.00  -38.00  137.00   16.00    9.00  567.00    0.00  -46.00  127.00   15.00    6.00  564.00    0.00  -49.00  120.00   14.00    3.00  552.00    0.00  -48.00  111.00   12.00    0.00  543.00    0.00  -45.00  105.00   12.00   -1.00  535.00    0.00  -40.00   99.00   10.00   -1.00

So, I’m proposing to filter on only the DTR counts, on the assumption that PRE was probably available if DTR was, so synthesis of CLD was likely to have happened, just not shown in the station counts which are probably ‘conservative’?<END MAIL QUOTE>  I didn’t get an email back but he did verbally consent. So away we go!

Running with a DTR-station-only screening gives us lots of station values, even with duplicate filtering turned back on. Niiice. It’s still not exactly smooth, but it might be enough to ‘fix’ the synthetic cloud.

Differences with the climatology, or with the 2.10 release, are patchy and generally below 30%. Of course it would be nice if the differences with the 2.10 release were negligable, since our regression coefficients were based on 2.10 DTR and CLD.. though of course the sun hours component is an unknown there, as is the fact that 2.10 used PRE as well as DTR for the synthetics. Anyway it gets the thumbs-up. The strategy will be to just produce it for 2003-2006.06, to tie in with the rest of the 3.00 release. So I just need to.. argh. I don’t have any way to create NetCDF files 1901-2006 without the .glo.abs files to work from! I’d have to specially code a version that swallowed the existing 1901-2002 then added ours. Meh.

I really thought I was cracking this project. But every time, it ends up worse than before.

I really do hate this whole project.

No time to finish and test the fortran gridder, which will doubtless sink to some depth and never be seen again, we’ll carry on with this mediocre approach.

It’s not going to be easy to find 14 missing stations, is it? Since the anomalies aren’t exactly the same.

Should I be worried about 14 lost series? Less than 2%. Actually, I noticed something interesting.. look at the anomalies. The anomdtb ones aren’t *rounded* to 1dp, they’re *truncated*! So, er – wrong!

So let’s say, anomalies are done. Hurrah. Onwards, plenty more to do!

NO IDEA why, so saying they affect particular 0.5-degree cells is harder than it should be. So we’ll just gloss over that entirely ;0)

Just went back to check on synthetic production. Apparently – I have no memory of this at all – we’re not doing observed rain days! It’s all synthetic from 1990 onwards.  Probably the worst story is temperature, particularly for MCDW. Over 1000 new stations! Highly unlikely. I am tempted to blame the different lat/lon scale, but for now it will have to rest.


Oh, my giddy aunt. What a crap crap system.

Also went through the parameters one by one and fixed (hopefully) their scaling factors at each stage. What a minefield!

– I was able to look at the first problem (Guatemala in Autumn 1995 has a massive spike) and find that a station in Mexico has a temperature of 78 degrees in November 1995! This gave a local anomaly of 53.23 (which would have been ‘lost’ amongst the rest of Mexico as Tim just did country averages) and an anomaly in Guatemala of 24.08 (which gave us the spike)…

Oh, ****. It’s the bloody WMO codes again. **** these bloody non-standard, ambiguous, illogical systems. Amateur hour again.

This whole project is SUCH A MESS.

I am seriously close to giving up, again. The history of this is so complex that I can’t get far enough into it before by head hurts and I have to stop. Each parameter has a tortuous history of manual and semi-automated interventions that I simply cannot just go back to early versions and run the update prog. I could be throwing away all kinds of corrections – to lat/lons, to WMOs (yes!), and more.

You see how messy it gets when you actually examine the problem? What we really need, and I don’t think it’ll happen of course, is a set of metrics (by latitude band perhaps) so that we have a broad measure of the acceptable minimum value count for a given month and location. Even better, a confidence figure that allowed the actual standard deviation comparison to be made with a looseness proportional to the sample size.

All that’s beyond me – statistically and in terms of time. I’m going to have to say ’30’.. it’s pretty good apart from DJF. For the one station I’ve looked at.

OH F*** THIS. … I’m hitting yet another problem that’s based on the hopeless state of our databases. There is no uniform data integrity, it’s just a catalogue of issues that continues to grow as they’re found.


Trending on PJ Media Videos

Join the conversation as a VIP Member