Auction Forecast

Empowering Collectors to be Investors

The Challenge of Wine Name Matching

I stepped back from making blog posts for several years, but only because I was digging even deeper into wine analytics. Years of effort are paying off, so that AuctionForecast is about to relaunch our database. We now have about 4.4 million auctions covering mostly vintage wines, but also expanded to non-vintage champagne, liquors, and other. 

The hardest part of this expansion has been in aligning wine names. You might think that Large Language Models would make this problem go away, but unfortunately not. When a period can matter, LLMs are just not precise enough, and they don't know when they're guessing. Partly the problem is a lack of a true reference list of wines. LWIN seemed promising at first, but it lists about half of the wines one can find in other sources. And it is not perfect, even when listing the color of the wine. However, wine name matching is quite difficult, so I won't insult their efforts other than to point out that they are not a complete solution either.

The problems arise from how a human chooses to type what is on a label. Although it might seem obvious, wine experts hoave a rich context with which to interpret a wine label where a designer may have chosen to be creative with what and how the information is presented. Further, wine experts will often omit information that they consider obvious or redundant when describing a fine, yet future production may introduce distinctions that were not important before. When a wine was listed as "red" but in a later year is listed as a Cab - Merlot, is that the same wine or a new wine? There can be different opinions, but the only certain answer is if both are sold the same year.

Another similar problem arises in listing the region in which the grapes are grown. Perhaps originally the wine was listed as Sonoma, but in a later year the same producer lists Sonoma Coast. Is that a later vintage of the same wine? They might also produce a wine from Chalk Hill, which is within Sonoma County. Are these three separate wines? When an auction site does not list the region, which should be assumed?

These questions are challenging with perfect data and often unanswerable with the amount of arbitrariness in many wine listings. Our solution has been to assemble the best wine lists that we can find and reconcile them via a label matching algorithm that combines some core elements of LLMs with a lot of domain-specific knowledge. In the process we find oh-so-many contradictions and errors. But with this expanded list, we then use the same algorithm to match auction listings to our wine list. At the end of this process, we can identify about 90% of all listed items, striking a balance between false positives and false negatives.

If you're curious about the details of the algorithm, it goes list this...

1. Identify the type and color (if possible) from the listing. Type is a high-level category of Wine, Fortified Wine (e.g. Port), Sparkling Wine, Liquor (e.q. Wiskey), and Other. Vintage or Non-vintage is an important distinction. Color is determined from key-word searches through the listing, either the words themselves or varietals that are used primarily to produce certain colors. Unspecified is a large category.

2. Within each of these segments, all information from the label is expressed in a string. Matching can be either on phrases or individual words. Our database of unique items is mined to identify all unique phrases and words with measures of their frequency and uniqueness.

3. Each phrase list or word list is represented in a vector space of all known phrases or words. For words or phrases that are not exact matches, we compute a Levenstein distance to allow for small spelling errors.  Novel words from the label that differ more than a spelling error are added on-the-fly as new dimensions so that we do not ignore any information.

4. Similarities are computed between the vector for the auction label and the vectors for all database entries. 

5. Comparing all of these vectors, the best matching phrase or word vector is pulled from the database, but only if the similarity is above our threshold for balancing false positives and false negatives.

What can be written simply in a few bullet points represents years of work in data gathering, data cleaning, algorithm development, and code optimization. Fitting this process within existing server memory and running in reasonable time is no small feat.

What is missing? There are always more wines, so we need to keep detecting new wines and vintages and adding them to the database if we're certain that they are new, unique entries and not just bad data. Repeat appearances are the best proof of that. Also, our threshold for uniqueness is more stringent than our threshold for matching.

With all that, nothing is perfect. If you find wines listed that are redundant or a label that does not actually exist, feel free to let us know. We're not afraid to go in and make manual edits. Everyone on our team has done that multiple times.

Enjoy!