Analysis of scores in a football match#

In this example we will look at a football match from 2009:

The Champions League quarter-final between Chelsea and Liverpool in 2009 is recognised as among the best games of all time. Liverpool scored twice in the first half in the 19th and 28th minute. Chelsea then opened their account in the second half with three unanswered goals in the 51st, 57th and 76th minute. Liverpool responded with two goals in the 81st and 83rd minute to put themselves ahead, however Chelsea drew with a goal in the 89th minute and advanced to the next stage on aggregate.

We start by importing pandas and piso

In [1]: import pandas as pd

In [2]: import piso

For the analysis we will create a pandas.Series, indexed by a pandas.IntervalIndex for each team. The values of each series will be the team’s score and the interval index, defined by pandas.Timedelta, will describe the durations corresponding to each score. We define the following function which creates such a Series, given the minute marks for each score.

In [3]: def make_series(goal_time_mins):
   ...:     breaks = pd.to_timedelta([0] + goal_time_mins + [90], unit="min")
   ...:     ii = pd.IntervalIndex.from_breaks(breaks)
   ...:     return pd.Series(range(len(ii)), index = ii, name="score")
   ...: 

We can now create each Series.

In [4]: chelsea = make_series([51,57,76,89])

In [5]: liverpool = make_series([19,28,81,83])

For reference, the Series corresponding to chelsea is

In [6]: chelsea
Out[6]: 
(0 days 00:00:00, 0 days 00:51:00]    0
(0 days 00:51:00, 0 days 00:57:00]    1
(0 days 00:57:00, 0 days 01:16:00]    2
(0 days 01:16:00, 0 days 01:29:00]    3
(0 days 01:29:00, 0 days 01:30:00]    4
Name: score, dtype: int64

To enable analysis for separate halves of the game we’ll define a similar Series which defines the time intervals for each half

In [7]: halves = pd.Series(
   ...:     ["1st", "2nd"],
   ...:     pd.IntervalIndex.from_breaks(pd.to_timedelta([0, 45, 90], unit="min")),
   ...:     name="half",
   ...: )
   ...: 

In [8]: halves
Out[8]: 
(0 days 00:00:00, 0 days 00:45:00]    1st
(0 days 00:45:00, 0 days 01:30:00]    2nd
Name: half, dtype: object

We can now perform a join on these three Series. Since chelsea and liverpool Series have the same name it will be necessary to provide suffixes to differentiate the columns in the result. The halves Series does not have the same name, but a suffix must be defined for each of the join operands if there are any overlaps.

In [9]: CvsL = piso.join(chelsea, liverpool, halves, suffixes=["_chelsea", "_liverpool", ""])

In [10]: CvsL
Out[10]: 
                                    score_chelsea  score_liverpool half
(0 days 00:00:00, 0 days 00:19:00]              0                0  1st
(0 days 00:19:00, 0 days 00:28:00]              0                1  1st
(0 days 00:28:00, 0 days 00:45:00]              0                2  1st
(0 days 00:45:00, 0 days 00:51:00]              0                2  2nd
(0 days 00:51:00, 0 days 00:57:00]              1                2  2nd
(0 days 00:57:00, 0 days 01:16:00]              2                2  2nd
(0 days 01:16:00, 0 days 01:21:00]              3                2  2nd
(0 days 01:21:00, 0 days 01:23:00]              3                3  2nd
(0 days 01:23:00, 0 days 01:29:00]              3                4  2nd
(0 days 01:29:00, 0 days 01:30:00]              4                4  2nd

By default, the piso.join() function performs a left-join. Since every interval index represents the same domain, that is (0’, 90’], all join types - left, right, inner, outer - will give the same result.

Using this dataframe we will now provide answers for miscellaneous questions. In particular we will filter the dataframe based on values in the columns, then sum the lengths of the intervals in the filtered index.

How much game time did Chelsea lead for?

In [11]: CvsL.query("score_chelsea > score_liverpool").index.length.sum()
Out[11]: Timedelta('0 days 00:05:00')

How much game time did Liverpool lead for?

In [12]: CvsL.query("score_liverpool > score_chelsea").index.length.sum()
Out[12]: Timedelta('0 days 00:44:00')

How much game time were the teams tied for?

In [13]: CvsL.query("score_liverpool == score_chelsea").index.length.sum()
Out[13]: Timedelta('0 days 00:41:00')

How much game time in the first half were the teams tied for?

In [14]: CvsL.query("score_chelsea == score_liverpool and half == '1st'").index.length.sum()
Out[14]: Timedelta('0 days 00:19:00')

For how long did Liverpool lead Chelsea by exactly one goal (split by half)?

In [15]: CvsL.groupby("half").apply(
   ....:     lambda df: df.query("score_liverpool - score_chelsea == 1").index.length.sum()
   ....: )
   ....: 
Out[15]: 
half
1st   0 days 00:09:00
2nd   0 days 00:12:00
dtype: timedelta64[ns]

What was the score at the 80 minute mark?

In [16]: piso.lookup(CvsL, pd.Timedelta(80, unit="min"))
Out[16]: 
                 score_chelsea  score_liverpool half
0 days 01:20:00              3                2  2nd

This analysis is also straightforward using staircase. For more information on this please see the corresponding example with staircase