Thursday, July 8, 2021

stoopidstats: wins/losses graphs update for the negro leagues

Before I start, let me acknowledge that the graphs which are an important visual part of this post are not good examples of data visualization. Some are completely worthless for gleaning information. The others are mostly worthless. But I had fun making them and have fun looking at them.

At some point last year Major League Baseball announced that the Negro Leagues, or at least some of them, would be recognized as Major League. In keeping with that pronouncement Baseball Reference, one of my main sources of information for my baseball stoopidstats has added Negro League baseball to its Major League records. You can read about that decision here. At this point, they are including 64 franchises in seven leagues, covering the years 1920-1948. Based on what they've written I am expecting that this is only the first step, and more Negro League baseball will be added to the Major League record in the future. What's holding them back is that the data are hard to come by and research is taking time.

At any rate, this all affects my stoopidstats. Specifically, my big wins/losses charts I had to update them to recognize this new official information. There are twenty major graphs, all of which can be seen below. But first, some commentary.

I have five graphs that show data by franchise. There have been 180 franchises, so each of these five graphs has 180 data series. The five are as follows:

  • cumulative wins;
  • cumulative wins, but with each series truncated after its franchise' last year of existence;
  • cumulative games over .500;
  • cumulative games over .500, but with each series truncated after its franchise' last year of existence;
  • rank (by cumulative wins.
After that there are five analogous graphs, each grouping franchises by location (as indicated in their team name). Note that, even though Brooklyn is in New York, Brooklyn and New York have separate series -- just as Los Angeles, Anaheim and California are each separate series. Of course, as teams moved (or changed the location indicated in their names), their data go into different series. For example, consider the existing franchise currently known as the Los Angeles Angels. Their data from 1961-1964 and from 2005-2020 are part of the "Los Angeles" series. Their data from 1965-1996 are part of the "California" series. Their data from 1997-2004 are part of the "Anaheim" series.

After that there are five analogous graphs, each grouping franchises by home state. For these purposes, I am considering the District of Columbia, Quebec and Ontario as states. For this reason I sometimes say "state or state-like entity."

Finally there are five analogous graphs, each grouping franchises by nickname.

The Complications
Adding the Negro leagues to my project presented me with some new issues.

The first issue is technological. My file uses a lot of the "sumifs" functions in Excel, which (I understand) are taxing on Excel. I also use a bunch (though not as many) of "rank" functions. I would imagine that they are also taxing on Excel, but I may just be talking out my ass. In addition to the number crunching, there are five graphs with 180 series, five graphs with 155 series, five graphs with 75 series, and five graphs with 38 series. And each series in each graph covers 150 years. Before I added the Negro League information, the file seemed to be working fine. I guess the extra pushed it over the edge, because with all the extra information the file started reacting very slowly. At Meep's suggestion, I broke the file in two. One file does the number-crunching and the other does the graphing. It's not ideal, but what is?

The second issue has to do with team names and locations. Without the Negro Leagues, each team had one home location in each year*. Teams all had names of the form The <location> <nickname>, or names that could be shoehorned into that format. But the Negro Leagues had some franchises that had no home park and didn't identify a home location (e.g., the "Cuban Stars West"). For those I introduced "N/A" as a location and as a state. There were also quite a few teams that split time between two cities. I don't have it in me to try to break down all of their games between multiple locations; given that they also played away games, I think such attributions would be near-impossible if not completely impossible. For these, I am recognizing locations and states such as "Cincinnati/Indianapolis" and "Ohio/Indiana." I don't like to do it, but I see no better alternative.

Finally, I noticed some data inconsistencies. Every time a team wins a game, another team loses. And every time there's a tie game**, two teams register a tie. So, in each year (and, until 1993, each league in each year) registered the same number of wins and losses and an even number of ties. But that is not reflected in the statistics in Baseball Refence. According to BR:
  • The 1890 American Association had a combined record of 525 wins and 526 losses.
  • The 1942 Negro National League had a combined record of 156 wins and 157 losses.
  • The 1890 American Association had 29 ties.
  • The 1928 Eastern Colored League had 9 ties.
  • The 1933 Negro National League had 7 ties.
The statistical mistakes in the Negro Leagues are probably a result of incomplete records. Hopefully they will be corrected as more research is done. The 1890 American Association numbers are curious. In a prior iteration I did not see this issue. This means that the record was updated to something that clearly has an error. All this said, I don't have any basis to reflect numbers different than what is in BR.

Finally, I note that, in ranking franchises, locations, states or nicknames by win total, there are inevitable ties. My first tie breaker is ties (on the theory that a ties is sort of half a win). My second tie breaker is games over .500 (on the theory that getting to n wins with x losses is better than getting to n wins with y losses if x<y). My final tiebreaker is which team/location/state/nickname got to that number of wins first (on the theory that I had to do something).

And now, without further ado, the graphs:
























________________________
*With one exception. The 1884 Union Association had a franchise that moved from Chicago to Pittsburgh during the season. For an earlier version of the file I researched to determine how many wins and losses are attributable to the team in each of its locations.
** Honestly, in the context of baseball, I am not exactly sure how ties occur. Ties were more frequent in the early days of the game, but they do occur in modern times. The last tie was in 2016. It involved the Cubs and the Pirates. If you want to research to find out what happened, please let me know.

No comments:

Post a Comment