iCapper.Com Articles by Charles Carroll


iCapper.Com Articles		Big Data, Little Data, and The Fairs

by Charles Carroll Home Article Library	Race results data can now be had in virtually unlimited quantities, in pre-digitized format. Presumably, somebody, somewhere typed the stuff into a computer, but it ain’t me anymore—thank goodness. The raw data is now available, for a price, from several sources on the Web, and if you are going to put it to your own use, the first thing you need to decide is how large your study population should be. With horse racing statistics, the standard take has always been that bigger is better. Everybody knows that tiny samples give invalid results; small samples can give skewed results—big is good—humongous is even better. Right? So, naturally, what horseracing statisticians have tended to do is shoot for “humongous,” and at the same time, there has been a shift from the goals of identifying handicapping factors to identifying ROI (return on investment) factors over, say, 15,000 races. To get right to the point, this “homogenizes” the data. With all due respect to those horseracing statisticians who have labored over massive data sets of race results—they are usually singing under the wrong window. This is a game where short-term variability is crucial—and big samples blur opportunities. What is often lost in these broad-based computer studies is the importance of the variability that makes up the day-to-day reality of the two main sets of information: racing data (times, etc., which Eric Langjahr termed “The Cold Dope”) and tote board data (odds, etc.), which are part-and-parcel of the “variance” that makes betting “scores” possible. In the 1960s and ‘70s, data sets had to be “punched in”—literally, on keypunch machines, while squinting at microscopic print in the Form. I did not wear glasses until I did a lot of this in the late ‘70s. (You also had to stand at the machine, and there was often only one machine per 20,000 or so students and faculty, so it was not unusual to get your turn at 3 o’clock in the morning.) As a result, early data sets were small, but hopes were high—after all, this was a computer the size of a tractor-trailer so some miracle was bound to happen. The goals were simple: looking for patterning of handicapping factors in a fairly traditional sense. There was no miracle, but a lot of the goals were met. We know much more about handicapping factors today than we did then, thanks to the published works of William Quirin and those who have followed. I vividly remember the frustration of simply getting the data then. The charts were published in the paper Form, often hit-or-miss. The day you thought the charts for a certain race day should be published, they weren’t. It was difficult to even find a Form in my area, and past Forms had to be ordered at higher-than-face cost and if you were lucky, they arrived in a tattered bundle, maybe six weeks later. I am still waiting for several bundles I ordered in the early ‘80s. I also distinctly remember wanting more! Bigger data sets! I wanted humongous. I was wrong. Like virtually everyone else then, I was constrained to looking at smaller data sets and smaller questions—and that turned out to be a lucky stroke. There are many questions in horse racing where it would be nice to have a population of 15,000 races, but there are many more where smaller, more compact and focused populations identify patterning, which large populations completely obscure. What researchers have generally looked for in big-data runs are factors that show a certain percentage profit or loss over, say 15,000 instances. An example of one of the simplest types of factors tested would be the profit percentage of theoretical flat bets on favorites. A more complex one might be the profit percentage of hypothetical bets on three-year-olds after a certain length lay off after July 31 of the year. It is a certainty that if you run enough of these little simulations that you will find some that show various profits—always small. It’s only a little tongue-in-cheek to ask: “Okay, now—have you got 15,000 bets—and the several years it would take to make them when the angle arises?” In our little world of horse racing, variance happens. If you’re going to take any of these statistical angles seriously, you’d better have those bets and the time, because you might not score until bet number 14,998, then lose it all on 14,999 and be back to zero again at 15,000. Extremely large samples in horse racing are not totally useless and I’m not suggesting that you don’t invest in some of the statistics-based studies that are available. It is good to know that class-droppers tend to win greater than their fair share of races—duh—and many of the other positive and negative “impact” factors either identified or verified through large sample studies. These are things everyone should be able to grab off a synapse at the appropriate moment during the handicapping thought process. But 15,000-race samples completely blur the hour-to-hour, day-to-day, and week-to-week variability, which creates the opportunities for bettors to score. In the days when handicappers generally focused on one track, Andy Beyer recommended taking a day before the season started in a closed room with a year’s supply of last year’s Forms, and a bottle of Jack Daniels. The purpose was to develop “class-par times,” which I’ve never been too crazy about, but the result—aside from a hangover, if you followed his instructions literally—was a good overview of a year’s racing at your home track. You couldn’t help but pick up on both patterning and quirks in the results charts, which would help you deal with the beginning of a new season. If you follow one or two home tracks, this is still fine advice, although that’s about the only scenario in which I’d worry much about the pars (or the variants for which they form the baseline, but that’s another story). However, many of us today do not follow a single track or even regional circuit, and are more likely to be placing bets at ten tracks or more across the country, although not necessarily on the same day. (Albeit, there are accounts of system players who go much further than that.) With almost unlimited availability of tracks for simulcast betting, most bettors I know have broadened their field of play well beyond a local circuit, though they still tend to focus primarily on tracks that they know to some extent or have played before. Large populations of races for statistical studies are valuable for large, fundamental questions, but usually small profit. The variability that we move on as value bettors is more often short-term—sometimes instantaneous—and a lot more profitable. With comma-delimited past performance and results data available fairly cheaply on the internet, and with spread sheets now virtually standard equipment on every computer, you may dream up your own approaches to identifying short-term patterns at your tracks. My suggestion is not to worry about humongous samples and fundamental questions of racing per se, but think small and think local—local, at least, to the tracks you play, which may be scattered across three time zones. If computer analyses are not your idea of recreation, you can still look for patterning and opportunities by simply eyeballing past performances and results charts. It is extremely handy now to use the computer to get to race results charts provided by a number of Web sites. If for some reason I am going to try working a track I’m not familiar with, or just haven’t worked for a while, I’ll usually pull up some recent results charts on the Web to see what’s going on. For my style of play, I like to see some “normal” variability displayed in the odds payoffs. By that I mean a few races will show $4.20, $3.60, and $2.40, but there will also be a healthy mix of patterns like $26.80, $5.60, and $4.80. I especially like to see patterns like $4.80, $11.80, $3.60, because they often indicate handicappable place overlays and, although none of these patterns predict future events, they suggest that the field is open and the opposition from the rest of the crowd in shaping the odds is normal. Once in a while, you’ll find a pattern where the public is “On.” Early in the year, I pulled up results from (I believe it was) Penn National, where all the races were 1 mi 70 yds and the crowd was nailing every race for a period of at least several days. One way to find value, which has become more difficult with simulcasting, is to find a really dumb crowd—that one obviously wasn’t. For some approaches to betting this scenario might be a goldmine, but not for mine, so when it happens, I’m somewhere else. However, a great opportunity happens each fall and requires no searching through data. It happens every year, and it is the Fall Fairs. The fair circuit is big in California and Maryland, and my home state of New Mexico. Some handicappers specialize in fairs and for those handicappers the fall season is like Christmas for Macy’s; it make’s the bulk of their annual profit. Except for a small percentage of serious handicappers, fair crowds are rank amateurs; they can no more handicap a horse race than a tractor pull. The horses come in from the surrounding circuit tracks, which are generally in hiatus before changing to fall venues. This is one of the few times when you can worry a lot less about fine-tuning “value”—good old-fashioned handicapping comes to the forefront. If you have a way of dealing with complete fields of shippers, this is the time for good handicappers to reap the fall harvest. Copyright © 2009 iCapper.Com

Big Data, Little Data, and The Fairs