There has been a lot of hype about the new MLBAM StatCast system, a player-tracking/raw data machine. With all of this new data will come a need for more data analysis, and most likely, a better way to store and track data. I have manually compiled every piece of StatCast data currently available to the public through the various videos published on MLB.com, demonstrating some of the impressive capabilities of the new system.
The data was comprised from a few 2013-2014 regular season games, the 2014 All-Star Game, and the 2014 Playoffs. Below I have added links to downloadable spreadsheets demonstrating a few of the key fields that might be collected for each play in a major league baseball game using StatCast. The database that I created for this new StatCast data includes seven tables connected to the Lahman database, which I use to query players’ past statistics. Of those seven tables, four hold information that I predict will become the future talking points of not only front offices and statistical baseball writers, but the casual fan as well. The four tables holding all of the fancy new statistics are the Pitching, Batting, Fielding, and Running tables.
This StatCast database is meant to store every play within each game of a season using a play ID to connect plays from table to table. Using the player ID’s from the Lahman database seemed to me to be the easiest way to implement the new statistics, since it will be helpful in the future to query stats from both the Lahman files and the new StatCast files. This setup will also allow me to use counting and rate SQL formulas to easily understand a players season and career StatCast statistics.
As you look over the numbers, you will see some stars like Mike Trout, Andrew McCutchen, and Troy Tulowitzki. As I stated before, I was limited to the stats that have been released by MLB from 2013 through 2014, so the data on some of these players are incomplete or non-existent. This was more of a project about using the data we know can be tracked to create workable tables that can be fused with other different databases; in my case, I am morphing the new data with the Lahman baseball files. While we have little data to work with now, in the future I will be ready to incorporate lots of play-by-play StatCast stats into my database.
As you can see there are lots of null values. This is due to the incomplete information available for each play. In theory all of these fields would be filled if and when StatCast data becomes available to the public.
I suggest that you browse each spreadsheet to get a feel for the data…..
Batting – Download the full Batting table
Fielding – Download the full Fielding table
Pitching – Download the full Pitching table
Running – Download the full Running table
OK, now that you have played around with the spreadsheets, you might be thinking of unique ways to use these numbers to help evaluate players. Personally, I have an ongoing brainstorming journal that lists ways in which teams/management can use StatCast to test the overall performance of players. It might be a good idea for a future crowd sourcing post.
Just for fun, let’s see who ranks highest in some of these new statistical categories based on the micro amount of data we have:
Batters
Greatest Exit Velocity (off bat): Eric Hosmer, KC, 106.1 mph
Longest Fly Time: Juan Perez, SFN, 5.01 sec
Shortest Fly time: Kolten Wong, STL, 0.95 sec
Fielding
Quickest Acceleration: Anthony Recker, NYN 4.27 ft/sec2
Greatest Max Speed: Billy Hamilton, CIN and Ruben Tejada, NYN, 23.3 mph
Highest Route Efficiency: Omar Quintanilla, NYN, 100%
Quickest Release: Tony Cruz, STL, 0.37 sec
Fastest Velocity: Andrew McCutchen, PIT, 78.8 mph
Quickest First Step: Travis d’Arnaud, NYN, -1.7 sec
Base Running
Quickest First Step: Jhonny Peralta, STL -1.18 sec
Quickest Acceleration: Omar Infante, KC 9.99 ft/sec²
Greatest Max Speed: Jarrod Dyson, KC, 22.3 mph
Largest Lead Length: Pablo Sandoval, SFN, 17 ft
Largest Secondary Lead Length: Brandon Crawford, SFN, 21 ft
Pitching
Longest Extension: Yusmiero Petit, SFN, 92 in
Highest Actual Velocity: Kevin Gausman, BAL, 99.6 mph
Highest Perceived Velocity: Kevin Gausman, BAL, 100.7 mph
Largest Difference between Perceived and Actual Velocity: Francisco Rodriguez, MIL, 2.9 mph
Greatest Spin Rate: Sergio Romo, SFN, 3002 rpm
These stats really don’t mean much since they’re only taken from a few plays, but imagine what we could come up with if we had every games’ stats. Also, think about how we could correlate some of this data with other metrics. How does a pitcher’s Spin Rate affect his Fly Ball or Ground Ball rate? How does a player’s Lead Length or First Step affect his Stolen Base percentage? Does a batter’s average Exit Velocity or Launch Angle have any correlation with his BABIP or OPS? No more just eyeballing whether a player is quick out of the box, or if he consistently takes a good route to the ball. This could also help quantify areas that players need to work on. A batter will now know if he needs to work on his acceleration out of the box, and a pitcher will know if his extension is causing him to throw more balls.
All of these things will be dealt with as soon as we get more data. I am trying to increase my “First Step” rate by creating an Access database to house the new data before it is available. By no means do I think I have hit the nail on the head with this first attempt to store the new stats, but I at least wanted to get the ball rolling.
Next post: Remembering Ravishing Randy JohnsonPrevious post: Almost Heroes: The Last Ten Franchises to Lose a World Series, Part 4 – Phillies
Matt Jackson
Great stuff, Stephen. This is a tremendous resource.
Any thoughts on how to handle negative values in the first step field? Perhaps they’re fine to leave as is as they reflect anticipation? I’d be interested to hear your thoughts on that and any possible limitations of the data you’ve noticed.
Stephen Shaw
The negatives values shouldn’t be too much of a problem for calculating rate statistics. You can run basic regression with negative values.
From my understanding the first step starts when the body starts moving. For example, if a player was tagging up on a fly ball and had a negative first step it doesn’t necessarily mean he left early but that his body started moving before the ball was caught.
I might put together a piece on how I would use the data and the limitations that we might have to deal with.
AD
How much, if any, of the statcast data is it reasonable to expect MLB will release to the public?
Stephen Shaw
We do not know for sure how much data will be released to the public. If I had to guess, for the upcoming season we will probably only see more of the same type of videos being periodically released on the MLB.com website.
With that being said, since all of the teams will have these tracking systems installed in their ballpark I do not see why they wouldn’t release the data to the public. If they do it will probably come in some type of XML format much like the PITCHf/x data.
Matt
I am excited to get into this. My thinking is the defensive metrics are about to take on an entirely different meaning. Brush up on your Ordinary Differential Equations and head back into the books for Vector analysis and Vector Calculus. Fortunately I work with this everyday so I am chomping at the bit to get some raw data even if I have to glean it from video for my own person use.
AD
I’m not so sure about your second paragraph.
http://www.baseballprospectus.com/article.php?articleid=25816
AD
(Meant to direct that to Stephen above.)