This is a tutorial in data cleanup using pandas. In one of my trainings, I asked the following questions.

  • Download the swimming data from https://en.wikipedia.org/wiki/List_of_world_records_in_swimming
  • Combine the frames with the records and clean up the data
  • Give a ranking of the swimming speed over different disciplines
  • Save the clean data in a file called swimming_records.feather
  • Give an estimate of how much time a turning point adds to swimming a distance
  • Give an estimate of how much time a start adds to adds to swimming

In this post, I try and answer these.


import pandas as pd

%pylab inline
Populating the interactive namespace from numpy and matplotlib

1. Getting the data

These swimming records are in HTML tables, and luckily the pd.read_html function does work.

url = "https://en.wikipedia.org/wiki/List_of_world_records_in_swimming"

tables = pd.read_html(url, header = 0, encoding='utf-8')
tables[0].head(5)
Event Time Unnamed: 2 Name Nationality Date Meet Location Ref
0 01 ! 50 m freestyle 20.91 NaN Cielo, CésarCésar Cielo Brazil 18 December 2009 Brazilian Championships Brazil, São Paulo ! São Paulo, Brazil [9][10]
1 02 ! 100 m freestyle 46.91 NaN Cielo, CésarCésar Cielo Brazil 30 July 2009 World Championships Italy, Rome ! Rome, Italy [11][12]
2 03 ! 200 m freestyle 1:42.00 NaN Biedermann, PaulPaul Biedermann Germany 28 July 2009 World Championships Italy, Rome ! Rome, Italy [13][14]
3 04 ! 400 m freestyle 3:40.07 NaN Biedermann, PaulPaul Biedermann Germany 26 July 2009 World Championships Italy, Rome ! Rome, Italy [15][16]
4 05 ! 800 m freestyle 7:32.12 NaN Zhang Lin China 29 July 2009 World Championships Italy, Rome ! Rome, Italy [17][18]

There are a few extra tables that we don’t need. In addition, the tables refer to long course (50m bath) or short course. I use a zip to append that information and then combine all the records into a separate table.

category = "men", "women", "mixed", "men", "women", "mixed"
course = "long", "long", "long", "short", "short", "short"

record_tables = [t for t in tables if "Event" in t.columns]
len(record_tables)
6
ts = [t.assign(category = c, course = i) for t, c, i in zip(record_tables, category, course)]
raw = pd.concat(ts).reset_index(drop=True)
raw.sample(5)
Event Time Unnamed: 2 Name Nationality Date Meet Location Ref category course
15 16 ! 200 m individual medley 1:54.00 NaN Lochte, RyanRyan Lochte United States 28 July 2011 World Championships China, Shanghai ! Shanghai, China [34][35] men long
2 03 ! 200 m freestyle 1:42.00 NaN Biedermann, PaulPaul Biedermann Germany 28 July 2009 World Championships Italy, Rome ! Rome, Italy [13][14] men long
49 08 ! 100 m backstroke 48.90 NaN Kolesnikov, KlimentKliment Kolesnikov Russia 22 December 2017 Vladimir Salnikov Cup Russia, Saint Petersburg ! Saint Petersburg, R... [79] men short
73 08 ! 100 m backstroke 55.03 NaN Hosszú, KatinkaKatinka Hosszú Hungary 4 December 2014 World Championships Qatar, Doha ! Doha, Qatar [100] women short
56 15 ! 200 m butterfly 1:48.56 NaN le Clos, ChadChad le Clos South Africa 5 November 2013 World Cup Singapore, Singapore ! Singapore, Singapore [83] men short

First things first, I like to work with column names without spaces. There are two columns that are not useful to us, so I drop them immediately.

raw.shape
(95, 11)
raw.rename(columns = lambda x: x.lower().replace(' ', ''), inplace = True)
raw.drop(['ref', 'unnamed:2'], axis = 1, inplace = True, errors='ignore')
raw.head(2)
event time name nationality date meet location category course
0 01 ! 50 m freestyle 20.91 Cielo, CésarCésar Cielo Brazil 18 December 2009 Brazilian Championships Brazil, São Paulo ! São Paulo, Brazil men long
1 02 ! 100 m freestyle 46.91 Cielo, CésarCésar Cielo Brazil 30 July 2009 World Championships Italy, Rome ! Rome, Italy men long

2. Parsing the data in the event and time columns

The data has a lot of challenging parsing problems. In this post, I use regexes to solve most of those. There are non-breaking spaces, as you cannot see in the table, but it does show if you print separate values as below. There is also a weird x character, that you might mistake for an ‘x’. I just copied it into the regex formula.

raw.event.values[0], raw.event.values[25], raw.event.values[17]
('01 ! 50\xa0m freestyle',
 '06 ! 1500\xa0m freestyle',
 '18 ! 4×100\xa0m freestyle relay')
parsed_events = raw.event.str.replace(u'\xa0', ' ').str.extract('[\d\.]+ ! (?:(?P<team_size>4)×)?(?P<distance>\d{2,4}) m (?P<swimstyle>[a-z ]+)', expand = True)
parsed_events[parsed_events.swimstyle.isnull()]
team_size distance swimstyle
parsed_events["team_size"] = parsed_events.team_size.fillna(1).astype(int)
parsed_events.distance = parsed_events.distance.astype(int) * parsed_events.team_size

parsed_events.assign(original = raw.event).sample(5)
team_size distance swimstyle original
32 1 50 butterfly 13 ! 50 m butterfly
60 4 200 freestyle relay 19.1 ! 4×50 m freestyle relay
73 1 100 backstroke 08 ! 100 m backstroke
23 1 400 freestyle 04 ! 400 m freestyle
14 1 200 butterfly 15 ! 200 m butterfly

Next up are the times. It is possible to tackle these with strftime, but since I’m doing regexes already, let’s use those again.

parsed_times = raw.time.str.extract("(?P<m>\d{1,2})?:?(?P<s>\d{2})\.(?P<ms>\d{2})", expand = True)
time_seconds = parsed_times.m.astype(float).fillna(0) * 60 + parsed_times.s.astype(float) + parsed_times.ms.astype(float) / 100
raw.assign(time_seconds = time_seconds)[["event", "time_seconds"]].sample(10)
event time_seconds
64 22 ! 4×50 m medley relay 90.44
92 23 ! 4×100 m medley relay 225.20
34 15 ! 200 m butterfly 121.81
43 02 ! 100 m freestyle 44.94
89 21 ! 4×200 m freestyle relay 452.85
25 06 ! 1500 m freestyle 925.48
20 01 ! 50 m freestyle 23.67
18 19 ! 4×200 m freestyle relay 418.55
50 09 ! 200 m backstroke 105.63
72 07 ! 50 m backstroke 25.67

3. Splitting doubled name and location

The record holder names come out of the html in an interesting way. The pd.read_html function includes the title of the cell, which is the same name, in different order. Instead of splitting the strings on half their length plus one, I use a very explicit backref operator to make sure my assumptions on these data fields are correct.

raw.name.values[0], raw.name.values[10]
('Cielo, CésarCésar Cielo', 'Peaty, AdamAdam Peaty')
## Two patterns. One works for most
## But names like Sun Yang are not repeated
## Maybe because it's a chinese character in the html that is ommitted by the parser?
parsed_names = raw.name.str.extract(r"(?P<last_name>[\w ]+), (?P<first_name>[\S ]+)\2 \1", expand = True)
parsed_names_simple = raw.name.str.extract(r"^(?P<last_name>[\w]+) (?P<first_name>[\w ]+)$", expand = True)

q = parsed_names.first_name.isnull() & parsed_names_simple.first_name.notnull()
parsed_names.loc[q] = parsed_names_simple.loc[q]

parsed_names.assign(original = raw.name).sample(10)
last_name first_name original
26 Zhao Jing Zhao Jing
45 Agnel Yannick Agnel, YannickYannick Agnel
36 Hosszú Katinka Hosszú, KatinkaKatinka Hosszú
0 Cielo César Cielo, CésarCésar Cielo
69 Belmonte García Mireia Belmonte García, MireiaMireia Belmonte García
5 Sun Yang Sun Yang
80 Alshammar Therese Alshammar, ThereseTherese Alshammar
50 Larkin Mitch Larkin, MitchMitch Larkin
3 Biedermann Paul Biedermann, PaulPaul Biedermann
74 Hosszú Katinka Hosszú, KatinkaKatinka Hosszú

These names look correct. For the teams, the medley records, I leave the names blank. Next up are the location fields.

raw.location.values[0], raw.location.values[17]
('Brazil, São Paulo ! São Paulo, Brazil', 'China, Beijing ! Beijing, China')
parsed_locations = raw.location.str.extract(r'(?P<country>[\w ]+), (?P<city>[\w ]+) ! \2, \1', expand = True)
parsed_locations.assign(original = raw.location).sample(5)
country city original
66 Germany Berlin Germany, Berlin ! Berlin, Germany
42 Qatar Doha Qatar, Doha ! Doha, Qatar
74 Qatar Doha Qatar, Doha ! Doha, Qatar
48 Qatar Doha Qatar, Doha ! Doha, Qatar
37 Brazil Rio de Janeiro Brazil, Rio de Janeiro ! Rio de Janeiro, Brazil
## check whether the parsing succeeded for all the locations
parsed_locations[parsed_locations.country.isnull()]
country city

4. Combining the data into a clean set

We have all the separate parts of the dataset and assemble them into a pd.DataFrame called `events.

events = parsed_events.join(
    time_seconds.rename("time")
).join(
    pd.to_datetime(raw.date)
).join(parsed_locations).join(parsed_names).join(
    raw[["course", "category", "nationality", "meet"]]
)

events.sample(5)
team_size distance swimstyle time date country city last_name first_name course category nationality meet
86 4 200 freestyle relay 93.91 2017-12-15 Denmark Copenhagen NaN NaN short women Netherlands European Championships
5 1 1500 freestyle 871.02 2012-08-04 United Kingdom London Sun Yang long men China Olympic Games
57 1 100 individual medley 50.30 2016-08-30 Germany Berlin Morozov Vladimir short men Russia World Cup
29 1 50 breaststroke 29.40 2017-07-30 Hungary Budapest King Lilly long women United States World Championships
43 1 100 freestyle 44.94 2008-12-13 Croatia Rijeka Leveaux Amaury short men France European Championships
events.dtypes
team_size               int64
distance                int64
swimstyle              object
time                  float64
date           datetime64[ns]
country                object
city                   object
last_name              object
first_name             object
course                 object
category               object
nationality            object
meet                   object
dtype: object

This looks fine. I’d like to use the categorical type in pandas, but it actually gives me some problems here, see https://github.com/pandas-dev/pandas/issues/19136 for example. Let’s make some plots and the ranking of swimming speeds to check if this data looks allright.

import seaborn as sns

fig = plt.figure(figsize = (8, 6))
ax = fig.subplots(1)
pal = sns.color_palette()

for i, (st, frame) in enumerate(events.groupby(["swimstyle"])):
    frame.plot.scatter(x = 'time', y = 'distance', label = st, ax = ax, color = pal[i], s = 50, alpha = .4);
    
plt.title("Speed versus distance of swimming records")
plt.grid(True);

png

import calendar
month_names = [calendar.month_name[i] for i in range(1, 13)]

events.date.dt.strftime("%B").value_counts().reindex(month_names).plot.bar(color = "steelblue")
plt.title("Number of records in swimming set in each month");

png

Interesting that most records are broken in December. My first hypothesis would be that this has to do with a recent tournament.


5. Ranking the disciplines on speed

events["speed"] = events.distance / events.time
events.sort_values('speed', ascending = False).head()
team_size distance swimstyle time date country city last_name first_name course category nationality meet speed
61 4 200 freestyle relay 80.77 2008-12-14 Croatia Rijeka NaN NaN short men France European Championships 2.476167
42 1 50 freestyle 20.26 2014-12-05 Qatar Doha Manaudou Florent short men France World Championships 2.467917
60 4 200 freestyle relay 82.60 2014-12-06 Qatar Doha NaN NaN short men Russia World Championships 2.421308
0 1 50 freestyle 20.91 2009-12-18 Brazil São Paulo Cielo César long men Brazil Brazilian Championships 2.391200
54 1 50 butterfly 21.80 2009-11-14 Germany Berlin Deibler Steffen short men Germany World Cup 2.293578
events.sort_values('speed', ascending = False).tail()
team_size distance swimstyle time date country city last_name first_name course category nationality meet speed
30 1 100 breaststroke 64.13 2017-07-25 Hungary Budapest King Lilly long women United States World Championships 1.559333
85 1 400 individual medley 258.94 2017-08-12 Netherlands Eindhoven Belmonte Garcia Mireia short women Spain World Cup 1.544759
36 1 400 individual medley 266.36 2016-08-06 Brazil Rio de Janeiro Hosszú Katinka long women Hungary Olympic Games 1.501727
79 1 200 breaststroke 134.57 2009-12-18 United Kingdom Manchester Soni Rebecca short women United States Duel in the Pool 1.486215
31 1 200 breaststroke 139.11 2013-08-01 Spain Barcelona Pedersen Rikke Møller long women Denmark World Championships 1.437711

The fastest discipline is the 4 times 50 freestyle relay. That’s interesting because that would mean that the split times on this have to be faster than the world record, at least for one. Another aspect to keep in mind is that some records are listed twice, I assume because of the records that were broken with the floating suits that came up some time ago, and were banned afterwards.

The slowest discipline is the breaststroke, that makes sense. Apparently, there is no breaststroke 400m record. Also, from the table below, we can see quite clearly that short course is faster than long course.

events.loc[events.swimstyle == "breaststroke", ["distance", "swimstyle", "speed", "time", "category", "course"]].sort_values('speed')
distance swimstyle speed time category course
31 200 breaststroke 1.437711 139.11 women long
79 200 breaststroke 1.486215 134.57 women short
30 100 breaststroke 1.559333 64.13 women long
11 200 breaststroke 1.578906 126.67 men long
76 100 breaststroke 1.603592 62.36 women short
77 100 breaststroke 1.603592 62.36 women short
78 100 breaststroke 1.603592 62.36 women short
53 200 breaststroke 1.660578 120.44 men short
29 50 breaststroke 1.700680 29.40 women long
75 50 breaststroke 1.745810 28.64 women short
10 100 breaststroke 1.750394 57.13 men long
52 100 breaststroke 1.798238 55.61 men short
9 50 breaststroke 1.926782 25.95 men long
51 50 breaststroke 1.980198 25.25 men short
events.to_feather("./data/swimming-records.feather")

Now that we have the data in what looks like a clean format, it is time to turn to the two other questions.


6. Give an estimate of how much time a turning point adds to swimming a distance

For calculating the extra time a turn adds, I use a pivot table. The idea is that I can only really compare the same distances and swimming styles.

view = events.pivot_table(index=["distance", "swimstyle", "category"], columns='course', values='time').reset_index()
view.sample(5)
course distance swimstyle category long short
46 800 freestyle relay men 418.55 409.04
37 400 freestyle relay mixed 199.60 NaN
13 100 butterfly women 55.48 54.61
32 200 medley relay mixed NaN 97.17
48 1500 freestyle men 871.02 848.06
view["additional_turnpoints"] = ((view.distance / 25) - 1) - ((view.distance / 50) - 1)
view["additional_time_per_turn"] = (view.short - view.long) / (view.additional_turnpoints)

view = view[~view.swimstyle.str.contains('relay')].copy()
view.sample(5)
course distance swimstyle category long short additional_turnpoints additional_time_per_turn
23 200 butterfly women 121.81 119.61 4.0 -0.5500
10 100 breaststroke men 57.13 55.61 2.0 -0.7600
40 400 individual medley women 266.36 258.94 8.0 -0.9275
18 200 backstroke men 111.92 105.63 4.0 -1.5725
34 400 freestyle men 220.07 212.25 8.0 -0.9775

Instead of averaging this, I will create a visualization that shows the differences in each discipline.

pal = sns.color_palette()

def plot_cat(category, ax):
    for i, (style, frame) in enumerate(view.groupby("swimstyle")):
        frame.plot.scatter(x = "additional_time_per_turn", y = "distance", ax = ax, label = style, c = pal[i]);
    
fig, axes = plt.subplots(1, 2, figsize = (10, 4))

for cat, ax in zip(["women", "men"], axes):
    plot_cat(cat, ax)

png

It’s just a bit painful to fix this in pandas / matplotlib. I like to use ggplot for this, switch to R, but for this post I’ll give plotnine a try.

import plotnine as pn

fig = pn.ggplot(view[view.category != "mixed"].dropna()) + \
    pn.geom_jitter(
        pn.aes(x = "factor(distance)", y = "additional_time_per_turn", color = "swimstyle"), 
        width = .1, height = 0) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Distance", 
        y = "Average additional time for each turn (s)",
        title = "How long does it take to turn around in swimming?")
    
fig.draw();
/home/gijsx/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

png

plotnine works fine. I don’t like the distance on the x axis. Below I’ve created one with speed, which is also not ideal. But we can see how long a turn takes.

fig = pn.ggplot(view[view.category != "mixed"].dropna().assign(long_speed = view.distance / view.long)) + \
    pn.geom_point(pn.aes(x = "long_speed", y = "additional_time_per_turn", color = "swimstyle")) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Record speed on long course (m/s)", 
        y = "Average additional time for each turn (s)",
        title = "How long does it take to turn around in swimming?")
    
fig.draw();

png

We can see that a turn doesn’t really take time, it’s actually quicker than swimming. The difference between long and short course seems larger for backstroke. The turn in the butterfly style isn’t too quick, perhaps because it’s a tricky technique, or because the butterfly is already quite quick? On longer distances, the advantage of the turns disappears a bit.

I’ll consider this question answered, allthough there is much more to be said. The most important critique that I can come up with here is that it’s perhaps not the best idea to answer this question using records. These are weird cases, and with more data, old records or just data from many swimming tournaments, we could produce a much more reliable estimate.

7. Give an estimate of how much time a start adds to adds to swimming

Now onto the next question; how long does a start take? This is tricky because we cannot really isolate the starts from the data, since each record has exactly one start. You could say 100m equals 2 x 50m without the extra start, but the swimming speeds are also different. I think I’d need some physical model to estimate speed from this data, and then use that in the equation.

Interesting. Hopefully I can do that some other time. For now, I’ll just use the simplified idea.

view = pd.concat([events.assign(doubled = "real"), events.assign(time = events.time * 2, distance = events.distance * 2, doubled = "hypothetical")])
view = view[["distance", "swimstyle", "doubled", "category", "course", "time"]].copy()
view = view.pivot_table(index=["distance", "swimstyle", "category", "course"], values="time", columns="doubled").dropna().reset_index()
view["difference"] = view.real - view.hypothetical
view.sample(5)
doubled distance swimstyle category course hypothetical real difference
42 400 individual medley women long 252.24 266.36 14.12
51 800 freestyle relay men short 366.60 409.04 42.44
27 200 butterfly women short 109.22 119.61 10.39
24 200 butterfly men long 99.64 111.51 11.87
26 200 butterfly women long 110.96 121.81 10.85
q = (view.category != "mixed") & ~view.swimstyle.str.contains("relay")

fig = pn.ggplot(view[q].dropna()) + \
    pn.geom_jitter(
        pn.aes(x = "factor(distance)", y = "difference", color = "swimstyle", shape = "course"),
        width = .1, height = 0) + \
    pn.facet_wrap(facets = "category") + \
    pn.labs(
        x = "Distance", 
        y = "Difference (s)",
        title = "Time difference between record and twice that record on half the distance")
    
fig.draw();

png

Well, the start doesn’t save more than a couple of seconds, but it’s hard to say more. These difference are mainly fatigue of the swimmer.

I do see some other interesting patterns. It seems the record for freestyle 400m, for men, on a long course, is not as sharp as it could be. Also the male records on 100m butterfly are quite fast compared to the 200m records.

8. Conclusion

The data is cleaned. We have an idea of how long a turn takes, or rather, how much time it saves. I haven’t come up with an easy method to isolate the starting times. The clean data can be found on GitHub if you want to give this a try, or perhaps you want to have a look at one of these other open questions:

  • Why are the records posted mainly in June, July, November and December?
  • How does the time saved with a turn differ across swimmers, for example world records vs. professionals vs. amateurs?
  • How can we make a model that predicts the next record to be set?