A spreadsheet isn't good enough. Wine data is messy. Part 1

A workman is only as good as his tools. The 21st century wine critic needs a nose, a palate, a laptop and a database - and the knowledge of how best to use it for wine

Take the following picture. You see a sunset, a glass of wine, and a laptop. I see a spreadsheet and the nagging realisation that 4,000 wines tasted every year means one thing. Data. Big data.

Wine tasting with a sunset in the background and a laptop showing a spreadsheet

I've heard it say that in 2022, data is so cheap to collect that it's effectively limitless. We're not quite there yet, but I'll tell you what's expensive. Using that data.

Whether you're Facebook or a small-town wine critic, collecting the data must always be done with the idea of somehow in the future retrieving it and using it. Data that sits in a forgotten or inaccessible spreadsheet on an old hard drive is useless.

Take these old tasting notes from 1990. You see great wines, great prices, a glimpse of the past. I see data and the nagging realisation that 40 years of experience and knowledge (this is a knowledge industry, after all) are gathering dust.

Old tasting notes

Solutions exist, but every decision here has a trade off. There is no universally best option. The classic is a long spreadsheet, with each line representing a wine that has been tasted. The same wine, tasted twice (perhaps across two vintages) will have two lines. If the data is a little messy - not to worry, add on a column and include the information.

Up to a few hundred, maybe even a few thousand wines or tasting notes, this works fine, but what works for me will not work for you, and what works for you today may not work tomorrow.

One key assumption I made for our system is that every wine has one or several grape varieties. Every time I key in a new tasting note for a given wine, those varieties pop up, the same as the first time I added them to that wine. In Burgundy, that's perfect. A Pinot yesterday will be a Pinot tomorrow. In Portugal, who knows if yesterday's Touriga Nacional won't be a Tinta Cão tomorrow, or if that Côtes du Rhône had a bad Grenache vintage so are messing around with Syrah instead?

Even if vintage variations are dealt with, grape variety data is messy.

Things get even funkier in Bordeaux for example, where pretty much every wine has the same varieties, year in and year out - only in varying proportions. The names of the grapes don't matter, as everything will have some degree of Cabernet Franc or Merlot - what matters is the little percentage numbers for each. The system I have built for pink.wine does not have this capability. I based our system off Airtable, which does not allow for arrays as a field type (so {Cabernet: 12%, Merlot: 88%} would not mean the computer would actually understand this any more than if you wrote "there is some Cabernet and some Merlot" - the data is still not computer-readable.

Decanter Magazine uses a system of three twinned columns, for Grape 1, Grape 1%, Grape 2, Grape 2%, Grape 3, Grape 3%. This works fine until you get to Portugal with its 20+ grape varieties in some wines, or even Bordeaux blends that dabble in Cot and Petit Verdot. How does one account for field blends? Both Decanter and WineSearcher take an interesting approach of classifying certain blends as a single grape variety - much as I suppose a customer might ask for a GSM at a wine bar, even if I find the implementation of 'Southern French Red Blend' a little clumsy. I think this approach has a lot to commend it, if only for the sake of simplicity. I accidentally implemented

And while we're at it… what is a grape variety? We can all agree that Merlot is not Riesling, and we can probably just about agree that Pinot Noir is not Pinot Blanc, but is Vermentino distinct from Rolle? How about Tibouren from Rossese? Vermentino from Pigato? (no, no, yes - maybe) What about clones? Should they figure in a structured data field, or is simply mentioning them in a tasting note or 'comments' field good enough?

There's more to this post than questions without answers. Here are the simplest rules I try and think about whenever I have a data-related question.

Remember the one-to-manys and the many-to-ones. Every estate has multiple wines, but every wine only has one producer. One wine can be tasted several times, but each tasting note can only ever be of one wine. Simplify the 'ones', and the 'manys' will look after themselves. Having good, reliable data for an estate or an appellation is more important than having flawless individual tasting notes or wines.

(of course, my examples don't work for tank samples that will subsequently blended or for estates that collaborate on individual wines with other producers)

Sacrifice perfection. You can't do everything perfectly. In France, for example, a company can either produce and sell wine, or buy and sell wine. It cannot do both without a complicated legal hocus-pocus with the douanes. Most estates, in practice, do both. Fortunately, we are not the douanes, and have no strict need to differentiate between the two companies. Most people don't, anyway. Whispering Angel is invariably sold as 'Chateau d'Esclans' because, at the end of the day, nobody cares if the producing company is actually 'Cave d'Esclans'. Halve the number of records and just have one 'producer' record to include négociants. Life is too short for complications.

Keep it simple. Complexity is multiplicative and exponential. Every extra box to tick or field to fill makes the work 10x harder. Sure, you could include GPS coordinates to every parcel that goes into every wine, harvest dates for every single component of the blend, and even a field for 'max fermentation temperature' or whatever floats your boat - but these place such a burden on data-entry, and odds are they will never be used anyway.

And this brings me to my last point: remember why you need the data. It is only there, after all, as a tool for other work. It is not, in itself, valuable. Sure, it's great to have an integer field of the number of months any given wine has spent in oak; you can do all sorts of fun statistical analysis on it, or maybe use it to compare similar wines, or even work out if there is a time-in-oak-to-score relationship - but what actual use is that? It's time consuming and expensive, so who actually is paying for that sort of data?