Swimming records cleanup and analysis

This is a tutorial in data cleanup using pandas. In one of my trainings, I asked the following questions.

Download the swimming data from https://en.wikipedia.org/wiki/List_of_world_records_in_swimming
Combine the frames with the records and clean up the data
Give a ranking of the swimming speed over different disciplines
Save the clean data in a file called swimming_records.feather
Give an estimate of how much time a turning point adds to swimming a distance
Give an estimate of how much time a start adds to adds to swimming

In this post, I try and answer these.

import pandas as pd

%pylab inline

Populating the interactive namespace from numpy and matplotlib

1. Getting the data

These swimming records are in HTML tables, and luckily the pd.read_html function does work.

url = "https://en.wikipedia.org/wiki/List_of_world_records_in_swimming"

tables = pd.read_html(url, header = 0, encoding='utf-8')
tables[0].head(5)

	Event	Time	Unnamed: 2	Name	Nationality	Date	Meet	Location	Ref
0	01 ! 50 m freestyle	20.91	NaN	Cielo, CésarCésar Cielo	Brazil	18 December 2009	Brazilian Championships	Brazil, São Paulo ! São Paulo, Brazil	[9][10]
1	02 ! 100 m freestyle	46.91	NaN	Cielo, CésarCésar Cielo	Brazil	30 July 2009	World Championships	Italy, Rome ! Rome, Italy	[11][12]
2	03 ! 200 m freestyle	1:42.00	NaN	Biedermann, PaulPaul Biedermann	Germany	28 July 2009	World Championships	Italy, Rome ! Rome, Italy	[13][14]
3	04 ! 400 m freestyle	3:40.07	NaN	Biedermann, PaulPaul Biedermann	Germany	26 July 2009	World Championships	Italy, Rome ! Rome, Italy	[15][16]
4	05 ! 800 m freestyle	7:32.12	NaN	Zhang Lin	China	29 July 2009	World Championships	Italy, Rome ! Rome, Italy	[17][18]

There are a few extra tables that we don’t need. In addition, the tables refer to long course (50m bath) or short course. I use a zip to append that information and then combine all the records into a separate table.

category = "men", "women", "mixed", "men", "women", "mixed"
course = "long", "long", "long", "short", "short", "short"

record_tables = [t for t in tables if "Event" in t.columns]
len(record_tables)

ts = [t.assign(category = c, course = i) for t, c, i in zip(record_tables, category, course)]
raw = pd.concat(ts).reset_index(drop=True)
raw.sample(5)

	Event	Time	Unnamed: 2	Name	Nationality	Date	Meet	Location	Ref	category	course
15	16 ! 200 m individual medley	1:54.00	NaN	Lochte, RyanRyan Lochte	United States	28 July 2011	World Championships	China, Shanghai ! Shanghai, China	[34][35]	men	long
2	03 ! 200 m freestyle	1:42.00	NaN	Biedermann, PaulPaul Biedermann	Germany	28 July 2009	World Championships	Italy, Rome ! Rome, Italy	[13][14]	men	long
49	08 ! 100 m backstroke	48.90	NaN	Kolesnikov, KlimentKliment Kolesnikov	Russia	22 December 2017	Vladimir Salnikov Cup	Russia, Saint Petersburg ! Saint Petersburg, R...	[79]	men	short
73	08 ! 100 m backstroke	55.03	NaN	Hosszú, KatinkaKatinka Hosszú	Hungary	4 December 2014	World Championships	Qatar, Doha ! Doha, Qatar	[100]	women	short
56	15 ! 200 m butterfly	1:48.56	NaN	le Clos, ChadChad le Clos	South Africa	5 November 2013	World Cup	Singapore, Singapore ! Singapore, Singapore	[83]	men	short

First things first, I like to work with column names without spaces. There are two columns that are not useful to us, so I drop them immediately.

raw.shape

(95, 11)

raw.rename(columns = lambda x: x.lower().replace(' ', ''), inplace = True)
raw.drop(['ref', 'unnamed:2'], axis = 1, inplace = True, errors='ignore')
raw.head(2)

	event	time	name	nationality	date	meet	location	category	course
0	01 ! 50 m freestyle	20.91	Cielo, CésarCésar Cielo	Brazil	18 December 2009	Brazilian Championships	Brazil, São Paulo ! São Paulo, Brazil	men	long
1	02 ! 100 m freestyle	46.91	Cielo, CésarCésar Cielo	Brazil	30 July 2009	World Championships	Italy, Rome ! Rome, Italy	men	long

2. Parsing the data in the event and time columns

The data has a lot of challenging parsing problems. In this post, I use regexes to solve most of those. There are non-breaking spaces, as you cannot see in the table, but it does show if you print separate values as below. There is also a weird x character, that you might mistake for an ‘x’. I just copied it into the regex formula.

raw.event.values[0], raw.event.values[25], raw.event.values[17]

('01 ! 50\xa0m freestyle',
 '06 ! 1500\xa0m freestyle',
 '18 ! 4×100\xa0m freestyle relay')

parsed_events = raw.event.str.replace(u'\xa0', ' ').str.extract('[\d\.]+ ! (?:(?P<team_size>4)×)?(?P<distance>\d{2,4}) m (?P<swimstyle>[a-z ]+)', expand = True)
parsed_events[parsed_events.swimstyle.isnull()]

	team_size	distance	swimstyle

parsed_events["team_size"] = parsed_events.team_size.fillna(1).astype(int)
parsed_events.distance = parsed_events.distance.astype(int) * parsed_events.team_size

parsed_events.assign(original = raw.event).sample(5)

	team_size	distance	swimstyle	original
32	1	50	butterfly	13 ! 50 m butterfly
60	4	200	freestyle relay	19.1 ! 4×50 m freestyle relay
73	1	100	backstroke	08 ! 100 m backstroke
23	1	400	freestyle	04 ! 400 m freestyle
14	1	200	butterfly	15 ! 200 m butterfly

Next up are the times. It is possible to tackle these with strftime, but since I’m doing regexes already, let’s use those again.

parsed_times = raw.time.str.extract("(?P<m>\d{1,2})?:?(?P<s>\d{2})\.(?P<ms>\d{2})", expand = True)
time_seconds = parsed_times.m.astype(float).fillna(0) * 60 + parsed_times.s.astype(float) + parsed_times.ms.astype(float) / 100
raw.assign(time_seconds = time_seconds)[["event", "time_seconds"]].sample(10)

	event	time_seconds
64	22 ! 4×50 m medley relay	90.44
92	23 ! 4×100 m medley relay	225.20
34	15 ! 200 m butterfly	121.81
43	02 ! 100 m freestyle	44.94
89	21 ! 4×200 m freestyle relay	452.85
25	06 ! 1500 m freestyle	925.48
20	01 ! 50 m freestyle	23.67
18	19 ! 4×200 m freestyle relay	418.55
50	09 ! 200 m backstroke	105.63
72	07 ! 50 m backstroke	25.67

3. Splitting doubled name and location

The record holder names come out of the html in an interesting way. The pd.read_html function includes the title of the cell, which is the same name, in different order. Instead of splitting the strings on half their length plus one, I use a very explicit backref operator to make sure my assumptions on these data fields are correct.

raw.name.values[0], raw.name.values[10]

('Cielo, CésarCésar Cielo', 'Peaty, AdamAdam Peaty')

## Two patterns. One works for most
## But names like Sun Yang are not repeated
## Maybe because it's a chinese character in the html that is ommitted by the parser?
parsed_names = raw.name.str.extract(r"(?P<last_name>[\w ]+), (?P<first_name>[\S ]+)\2 \1", expand = True)
parsed_names_simple = raw.name.str.extract(r"^(?P<last_name>[\w]+) (?P<first_name>[\w ]+)$", expand = True)

q = parsed_names.first_name.isnull() & parsed_names_simple.first_name.notnull()
parsed_names.loc[q] = parsed_names_simple.loc[q]

parsed_names.assign(original = raw.name).sample(10)

	last_name	first_name	original
26	Zhao	Jing	Zhao Jing
45	Agnel	Yannick	Agnel, YannickYannick Agnel
36	Hosszú	Katinka	Hosszú, KatinkaKatinka Hosszú
0	Cielo	César	Cielo, CésarCésar Cielo
69	Belmonte García	Mireia	Belmonte García, MireiaMireia Belmonte García
5	Sun	Yang	Sun Yang
80	Alshammar	Therese	Alshammar, ThereseTherese Alshammar
50	Larkin	Mitch	Larkin, MitchMitch Larkin
3	Biedermann	Paul	Biedermann, PaulPaul Biedermann
74	Hosszú	Katinka	Hosszú, KatinkaKatinka Hosszú

These names look correct. For the teams, the medley records, I leave the names blank. Next up are the location fields.

raw.location.values[0], raw.location.values[17]

('Brazil, São Paulo ! São Paulo, Brazil', 'China, Beijing ! Beijing, China')

parsed_locations = raw.location.str.extract(r'(?P<country>[\w ]+), (?P<city>[\w ]+) ! \2, \1', expand = True)
parsed_locations.assign(original = raw.location).sample(5)

	country	city	original
66	Germany	Berlin	Germany, Berlin ! Berlin, Germany
42	Qatar	Doha	Qatar, Doha ! Doha, Qatar
74	Qatar	Doha	Qatar, Doha ! Doha, Qatar
48	Qatar	Doha	Qatar, Doha ! Doha, Qatar
37	Brazil	Rio de Janeiro	Brazil, Rio de Janeiro ! Rio de Janeiro, Brazil

## check whether the parsing succeeded for all the locations
parsed_locations[parsed_locations.country.isnull()]

	country	city

4. Combining the data into a clean set

We have all the separate parts of the dataset and assemble them into a pd.DataFrame called `events.

events = parsed_events.join(
    time_seconds.rename("time")
).join(
    pd.to_datetime(raw.date)
).join(parsed_locations).join(parsed_names).join(
    raw[["course", "category", "nationality", "meet"]]
)

events.sample(5)

	team_size	distance	swimstyle	time	date	country	city	last_name	first_name	course	category	nationality	meet
86	4	200	freestyle relay	93.91	2017-12-15	Denmark	Copenhagen	NaN	NaN	short	women	Netherlands	European Championships
5	1	1500	freestyle	871.02	2012-08-04	United Kingdom	London	Sun	Yang	long	men	China	Olympic Games
57	1	100	individual medley	50.30	2016-08-30	Germany	Berlin	Morozov	Vladimir	short	men	Russia	World Cup
29	1	50	breaststroke	29.40	2017-07-30	Hungary	Budapest	King	Lilly	long	women	United States	World Championships
43	1	100	freestyle	44.94	2008-12-13	Croatia	Rijeka	Leveaux	Amaury	short	men	France	European Championships

events.dtypes

team_size               int64
distance                int64
swimstyle              object
time                  float64
date           datetime64[ns]
country                object
city                   object
last_name              object
first_name             object
course                 object
category               object
nationality            object
meet                   object
dtype: object

This looks fine. I’d like to use the categorical type in pandas, but it actually gives me some problems here, see https://github.com/pandas-dev/pandas/issues/19136 for example. Let’s make some plots and the ranking of swimming speeds to check if this data looks allright.

import seaborn as sns

fig = plt.figure(figsize = (8, 6))
ax = fig.subplots(1)
pal = sns.color_palette()

for i, (st, frame) in enumerate(events.groupby(["swimstyle"])):
    frame.plot.scatter(x = 'time', y = 'distance', label = st, ax = ax, color = pal[i], s = 50, alpha = .4);
    
plt.title("Speed versus distance of swimming records")
plt.grid(True);

png

import calendar
month_names = [calendar.month_name[i] for i in range(1, 13)]

events.date.dt.strftime("%B").value_counts().reindex(month_names).plot.bar(color = "steelblue")
plt.title("Number of records in swimming set in each month");

png

Interesting that most records are broken in December. My first hypothesis would be that this has to do with a recent tournament.

5. Ranking the disciplines on speed

events["speed"] = events.distance / events.time
events.sort_values('speed', ascending = False).head()

	team_size	distance	swimstyle	time	date	country	city	last_name	first_name	course	category	nationality	meet	speed
61	4	200	freestyle relay	80.77	2008-12-14	Croatia	Rijeka	NaN	NaN	short	men	France	European Championships	2.476167
42	1	50	freestyle	20.26	2014-12-05	Qatar	Doha	Manaudou	Florent	short	men	France	World Championships	2.467917
60	4	200	freestyle relay	82.60	2014-12-06	Qatar	Doha	NaN	NaN	short	men	Russia	World Championships	2.421308
0	1	50	freestyle	20.91	2009-12-18	Brazil	São Paulo	Cielo	César	long	men	Brazil	Brazilian Championships	2.391200
54	1	50	butterfly	21.80	2009-11-14	Germany	Berlin	Deibler	Steffen	short	men	Germany	World Cup	2.293578

events.sort_values('speed', ascending = False).tail()

	team_size	distance	swimstyle	time	date	country	city	last_name	first_name	course	category	nationality	meet	speed
30	1	100	breaststroke	64.13	2017-07-25	Hungary	Budapest	King	Lilly	long	women	United States	World Championships	1.559333
85	1	400	individual medley	258.94	2017-08-12	Netherlands	Eindhoven	Belmonte Garcia	Mireia	short	women	Spain	World Cup	1.544759
36	1	400	individual medley	266.36	2016-08-06	Brazil	Rio de Janeiro	Hosszú	Katinka	long	women	Hungary	Olympic Games	1.501727
79	1	200	breaststroke	134.57	2009-12-18	United Kingdom	Manchester	Soni	Rebecca	short	women	United States	Duel in the Pool	1.486215
31	1	200	breaststroke	139.11	2013-08-01	Spain	Barcelona	Pedersen	Rikke Møller	long	women	Denmark	World Championships	1.437711

The fastest discipline is the 4 times 50 freestyle relay. That’s interesting because that would mean that the split times on this have to be faster than the world record, at least for one. Another aspect to keep in mind is that some records are listed twice, I assume because of the records that were broken with the floating suits that came up some time ago, and were banned afterwards.

The slowest discipline is the breaststroke, that makes sense. Apparently, there is no breaststroke 400m record. Also, from the table below, we can see quite clearly that short course is faster than long course.

events.loc[events.swimstyle == "breaststroke", ["distance", "swimstyle", "speed", "time", "category", "course"]].sort_values('speed')

	distance	swimstyle	speed	time	category	course
31	200	breaststroke	1.437711	139.11	women	long
79	200	breaststroke	1.486215	134.57	women	short
30	100	breaststroke	1.559333	64.13	women	long
11	200	breaststroke	1.578906	126.67	men	long
76	100	breaststroke	1.603592	62.36	women	short
77	100	breaststroke	1.603592	62.36	women	short
78	100	breaststroke	1.603592	62.36	women	short
53	200	breaststroke	1.660578	120.44	men	short
29	50	breaststroke	1.700680	29.40	women	long
75	50	breaststroke	1.745810	28.64	women	short
10	100	breaststroke	1.750394	57.13	men	long
52	100	breaststroke	1.798238	55.61	men	short
9	50	breaststroke	1.926782	25.95	men	long
51	50	breaststroke	1.980198	25.25	men	short

events.to_feather("./data/swimming-records.feather")

Now that we have the data in what looks like a clean format, it is time to turn to the two other questions.

6. Give an estimate of how much time a turning point adds to swimming a distance

For calculating the extra time a turn adds, I use a pivot table. The idea is that I can only really compare the same distances and swimming styles.

view = events.pivot_table(index=["distance", "swimstyle", "category"], columns='course', values='time').reset_index()
view.sample(5)

course	distance	swimstyle	category	long	short
46	800	freestyle relay	men	418.55	409.04
37	400	freestyle relay	mixed	199.60	NaN
13	100	butterfly	women	55.48	54.61
32	200	medley relay	mixed	NaN	97.17
48	1500	freestyle	men	871.02	848.06

view["additional_turnpoints"] = ((view.distance / 25) - 1) - ((view.distance / 50) - 1)
view["additional_time_per_turn"] = (view.short - view.long) / (view.additional_turnpoints)

view = view[~view.swimstyle.str.contains('relay')].copy()
view.sample(5)

course	distance	swimstyle	category	long	short	additional_turnpoints	additional_time_per_turn
23	200	butterfly	women	121.81	119.61	4.0	-0.5500
10	100	breaststroke	men	57.13	55.61	2.0	-0.7600
40	400	individual medley	women	266.36	258.94	8.0	-0.9275
18	200	backstroke	men	111.92	105.63	4.0	-1.5725
34	400	freestyle	men	220.07	212.25	8.0	-0.9775

Instead of averaging this, I will create a visualization that shows the differences in each discipline.

pal = sns.color_palette()

def plot_cat(category, ax):
    for i, (style, frame) in enumerate(view.groupby("swimstyle")):
        frame.plot.scatter(x = "additional_time_per_turn", y = "distance", ax = ax, label = style, c = pal[i]);
    
fig, axes = plt.subplots(1, 2, figsize = (10, 4))

for cat, ax in zip(["women", "men"], axes):
    plot_cat(cat, ax)

png

It’s just a bit painful to fix this in pandas / matplotlib. I like to use ggplot for this, switch to R, but for this post I’ll give plotnine a try.

import plotnine as pn

fig = pn.ggplot(view[view.category != "mixed"].dropna()) + \
    pn.geom_jitter(
        pn.aes(x = "factor(distance)", y = "additional_time_per_turn", color = "swimstyle"), 
        width = .1, height = 0) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Distance", 
        y = "Average additional time for each turn (s)",
        title = "How long does it take to turn around in swimming?")
    
fig.draw();

/home/gijsx/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

png

plotnine works fine. I don’t like the distance on the x axis. Below I’ve created one with speed, which is also not ideal. But we can see how long a turn takes.

fig = pn.ggplot(view[view.category != "mixed"].dropna().assign(long_speed = view.distance / view.long)) + \
    pn.geom_point(pn.aes(x = "long_speed", y = "additional_time_per_turn", color = "swimstyle")) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Record speed on long course (m/s)", 
        y = "Average additional time for each turn (s)",
        title = "How long does it take to turn around in swimming?")
    
fig.draw();

png

We can see that a turn doesn’t really take time, it’s actually quicker than swimming. The difference between long and short course seems larger for backstroke. The turn in the butterfly style isn’t too quick, perhaps because it’s a tricky technique, or because the butterfly is already quite quick? On longer distances, the advantage of the turns disappears a bit.

I’ll consider this question answered, allthough there is much more to be said. The most important critique that I can come up with here is that it’s perhaps not the best idea to answer this question using records. These are weird cases, and with more data, old records or just data from many swimming tournaments, we could produce a much more reliable estimate.

7. Give an estimate of how much time a start adds to adds to swimming

Now onto the next question; how long does a start take? This is tricky because we cannot really isolate the starts from the data, since each record has exactly one start. You could say 100m equals 2 x 50m without the extra start, but the swimming speeds are also different. I think I’d need some physical model to estimate speed from this data, and then use that in the equation.

Interesting. Hopefully I can do that some other time. For now, I’ll just use the simplified idea.

view = pd.concat([events.assign(doubled = "real"), events.assign(time = events.time * 2, distance = events.distance * 2, doubled = "hypothetical")])
view = view[["distance", "swimstyle", "doubled", "category", "course", "time"]].copy()
view = view.pivot_table(index=["distance", "swimstyle", "category", "course"], values="time", columns="doubled").dropna().reset_index()
view["difference"] = view.real - view.hypothetical
view.sample(5)

doubled	distance	swimstyle	category	course	hypothetical	real	difference
42	400	individual medley	women	long	252.24	266.36	14.12
51	800	freestyle relay	men	short	366.60	409.04	42.44
27	200	butterfly	women	short	109.22	119.61	10.39
24	200	butterfly	men	long	99.64	111.51	11.87
26	200	butterfly	women	long	110.96	121.81	10.85

q = (view.category != "mixed") & ~view.swimstyle.str.contains("relay")

fig = pn.ggplot(view[q].dropna()) + \
    pn.geom_jitter(
        pn.aes(x = "factor(distance)", y = "difference", color = "swimstyle", shape = "course"),
        width = .1, height = 0) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Distance", 
        y = "Difference (s)",
        title = "Time difference between record and twice that record on half the distance")
    
fig.draw();

png

Well, the start doesn’t save more than a couple of seconds, but it’s hard to say more. These difference are mainly fatigue of the swimmer.

I do see some other interesting patterns. It seems the record for freestyle 400m, for men, on a long course, is not as sharp as it could be. Also the male records on 100m butterfly are quite fast compared to the 200m records.

8. Conclusion

The data is cleaned. We have an idea of how long a turn takes, or rather, how much time it saves. I haven’t come up with an easy method to isolate the starting times. The clean data can be found on GitHub if you want to give this a try, or perhaps you want to have a look at one of these other open questions:

Why are the records posted mainly in June, July, November and December?
How does the time saved with a turn differ across swimmers, for example world records vs. professionals vs. amateurs?
How can we make a model that predicts the next record to be set?