piso logo

Date: May 04, 2022 Version: 0.9.0

Useful links: Source Repository | Issues & Ideas

Pandas Interval Set Operations: methods for set operations, analytics, lookups and joins on pandas’ Interval, IntervalArray and IntervalIndex

powered_by_staircase

Contents#

Getting Started#

Installation#

piso can be installed from PyPI or Anaconda.

To install the latest version from PyPI:

python -m pip install piso

To install the latest version through conda-forge:

conda install -c conda-forge piso

Package overview#

piso exists to bring set operations (union, intersection, difference + more), analytical methods, and lookup and join functionality to pandas interval classes, specifically

Currently, there is a lack of such functionality in pandas, although it has been earmarked for development. Until this eventuates, piso aims to fill the void. Many of the methods can be used via accessors, which can be registered to pandas.arrays.IntervalArray and pandas.IntervalIndex classes.

An array of intervals can be interpreted in two different ways. It can be seen as a container for intervals, which are sets, or if the intervals are disjoint it may be seen as a set itself. Both interpretations are supported by the methods introduced by piso.

The domain of the intervals can be either numerical, pandas.Timestamp or pandas.Timedelta. Currently, most of the set operaitons in piso are limited to intervals which:

  • have a non-zero length

  • have a finite, length

  • are left-closed right-open, or right-closed left-open

To check if these restrictions apply to a particular method, please consult the API reference.

Several case studies using piso can be found in the user guide. Further examples, and a detailed explanation of functionality, are provided in the API reference.

Versioning#

SemVer is used by piso for versioning releases. For versions available, see the tags on this repository.

License#

This project is licensed under the MIT License:

Copyright © 2021 <Riley Clement>

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the “Software”), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be included in all copies
or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.

Acknowledgments#

Currently, piso is a pure-python implentation which relies heavily on staircase and pandas. It is clearly designed to operate as part of the pandas ecosystem. The colours for the piso logo have been assimilated from pandas as a homage, and is not to intended to imply and affiliation with, or endorsement by, pandas.

Additionally, two classes have been borrowed, almost verbatim, from the pandas source code:

  • pandas.util._decorators.Appender

  • pandas.core.accessor.CachedAccessor

User Guide#

Intervals and sets#

Below is a brief discussion on the mathematical definition of intervals and sets, and how they relate to piso - aside from making up half of the acronym!

Sets#

A set is a collection of elements where each element, or member, of the set is unique, i.e. a set does not contain duplicated elements. There may be an infinite number of members, such as the set of positive integers, or it could be empty. There is both standard notation and standard operations for sets.

For example:

  • \(A = \{1, 2, 3\}\) is a set containing three numbers

  • \(B = \{3, 4\}\) is a set containing two numbers

The union of these sets, \(A \cup B\), is the set containing all elements in \(A\) and \(B\). That is, \(A \cup B = \{1, 2, 3, 4\}\).

The intersection of these sets, \(A \cap B\), is the set containing elements both in \(A\) and \(B\). That is, \(A \cap B = \{3\}\).

The difference of \(A\) and \(B\) = \(A \setminus B\), is the set containing all elements in \(A\) but not in \(B\). That is, \(A \setminus B = \{1, 2\}\).

The symmetric difference of these sets, \(A \Delta B\), is the set containing all elements in \(A\) and in \(B\), but not in both. That is, \(A \Delta B = \{1, 2, 4\}\). Symmetric difference of sets is equivalent to the difference between the union and the intersection.

Python is no stranger to set operations, with the set class being a built-in data structure (for sets with a finite number of elements) with the following methods:

Set operations, for sets with a finite number of elements, also exist for pandas.Index:

To continue a gentle introduction to sets, please refer to math is fun.

Intervals#

An interval is a set of (real) numbers that contains all real numbers lying between any two numbers (endpoints). The notion of an interval can be applied to other domains in which a “total ordering” exists, such as time-related domains modelled by pandas.Timestamp and pandas.Timedelta.

Intervals are classified as being open, closed, or half-open. An open interval does not include its endpoints. For example, the set of numbers between 0 and 1 (but not including 0 and 1) is an open interval. In set notation it is written as \(\{x | 0 < x < 1\}\). In interval notation it is written as \((0,1)\). A closed interval contains both of its endpoints, while a half-closed interval contains only one of its endpoints. The notation is as follows:

  • \([0,1] = \{x | 0 \leq x \leq 1\}\) (closed)

  • \([0,1) = \{x | 0 \leq x < 1\}\) (left-closed)

  • \((0,1] = \{x | 0 < x \leq 1\}\) (right-closed)

The length of an interval is defined by subtracting the smaller end point from the larger. Intervals could have an infinite length, such as the set of numbers greater than zero, or they could have zero length such as the set containing a single number (known as a degenerate interval).

The definitions of set operations, outlined above, follow through to intervals, however the result of set operations with intervals may not be an interval - but it will be a set! For example,

  • \([0,2] \cup [1,3] = \{x | 1 \leq x \leq 3\} = [0,3]\)

  • \([0,2] \cap [1,3] = \{x | 1 \leq x \leq 2\} = [1,2]\)

  • \([0,2] \setminus [1,3] = \{x | 0 \leq x < 1\} = [0,1)\)

  • \([0,2] \Delta [1,3] = \{x | 0 \leq x < 1 \text{ or } 2 < x \leq 3\} = [0,1) \cup (2,3]\)

The result in the last of these examples above cannot be expressed as an interval. It can however be expressed as the union of two disjoint (non-overlapping) intervals. Modelling intervals in pandas is facilitated through pandas.Interval, while representing the union of disjoint intervals can be achieved through an interval array such as pandas.arrays.IntervalArray or pandas.IntervalIndex. The intervals contained in one of these interval arrays do not have to be disjoint, so with respect to set operations an interval array can be interpreted in one of two ways:

  1. a collection of intervals, which become operands in a set operation, or

  2. a set itself, formed by the union of disjoint intervals, and used as an operand in a set operation.

An example of a) is applying an intersection operation to a interval array containing the intervals

\([0, 5), [4, 6), [7, 9), [8, 12)\)

which results in an in interval array containing the intervals

\([4, 5), [8, 9)\)

An example of b) is applying an intersection operation to two interval arrays (of disjoint intervals)

\([0, 5), [7, 9) \hspace{6 mm} \text{and} \hspace{6 mm} [2, 3), [6, 8)\)

which results in an in interval array containing

\([0, 2), [3, 5), [8, 9)\)

Both of these interpretations are supported by methods in piso. The methods will switch interpretations depending on the number of interval array arguments supplied to the methods. Note that if a interval array is used as an operand (as shown in the example for b) above) then any overlapping intervals will be merged to create a set of disjoint intervals before the operation is applied.

It is important to note that piso does not support all types of intervals. Specifically, intervals must have a length which is non-zero and finite. It must be either left-closed, or right-closed. Any pandas.Interval, pandas.IntervalIndex and pandas.array.IntervalArray arguments supplied to piso methods must have the same value for their closed attribute (either “left” or “right”).

For code examples involving piso set operations please see the small case study or examples provided in the API reference.

The piso accessors#

Applying set operations to interval array objects: pandas.arrays.Interval and pandas.IntervalIndex can be done with top-level functions:

The exact same functionality is available through accessors on these classes. Using accessors allows us to extend the functionality associated with these classes, without adding the methods directly. Before the accessors can be used they must be registered:

In [1]: import piso

In [2]: piso.register_accessors()

Registering the accessors adds a piso property to the classes which can be used like so:

In [3]: arr = pd.arrays.IntervalArray.from_tuples(
   ...:     [(1,5), (3,6), (2,4)]
   ...: )
   ...: 

In [4]: arr.piso.intersection()
Out[4]: 
<IntervalArray>
[(3, 4]]
Length: 1, dtype: interval[int64, right]

Further examples using the piso accessors can be found in the API reference.

Case studies#

Finding common gaps in daily calendars

This case study introduces the use of piso for set operations such as piso.intersection() and piso.union() and applies it to an example where personal calendars are represented by interval arrays.

Verifying a maintenance schedule

This case study introduces the use of piso for analysis with functions that return scalars, such as piso.issuperset() and piso.coverage(). In this example maintenance schedules and windows of opportunity are represented by interval arrays.

Estimating tax payable

This case study demonstrates the use of piso.lookup() where tax brackets are represented by a pandas.DataFrame, indexed by a pandas.IntervalIndex. The tax payable for an array of income values is calculated by efficiently finding the corresponding tax brackets.

Analysis of scores in a football match

This case study introduces the idea of joins using pandas.IntervalIndex. Using piso.join() a dataframe is constructed, indexed by intervals for unique score combinations in the 2009 Chelsea vs Liverpool Champions League quarter-final.

Finding common gaps in daily calendars#

In this example we are given the following scenario:

Alice, Bob, and Carol are busy people, after all they appear in their fair share of computing examples. They wish to meet on the 5th of October 2021 to discuss their next appearance. Given calendar data which details their scheduled meetings, find the gaps during business hours (9am to 5pm) that they can meet.

We start by importing pandas and piso

In [1]: import pandas as pd

In [2]: import piso

Running the piso.register_accessors() function will add “piso” accessors to pandas.IntervalIndex and pandas.arrays.IntervalArray. Using accessors allows us to extend the functionality associated with these classes, without adding the methods directly.

In [3]: piso.register_accessors()

Next we load the data from a csv file and store it into a pandas.DataFrame. Each row of the dataframe corresponds to a meeting that has been booked for the 5th of October, and is characterised by the attendee, a start time and an end time.

In [4]: data = pd.read_csv("./data/calendar.csv", parse_dates=["start", "end"])

In [5]: data
Out[5]: 
     name               start                 end
0   Alice 2021-10-05 09:00:00 2021-10-05 10:00:00
1   Alice 2021-10-05 10:00:00 2021-10-05 11:00:00
2   Alice 2021-10-05 11:30:00 2021-10-05 12:30:00
3   Alice 2021-10-05 14:30:00 2021-10-05 15:30:00
4     Bob 2021-10-05 10:00:00 2021-10-05 10:30:00
5     Bob 2021-10-05 10:30:00 2021-10-05 11:30:00
6     Bob 2021-10-05 11:00:00 2021-10-05 11:30:00
7     Bob 2021-10-05 16:00:00 2021-10-05 17:00:00
8   Carol 2021-10-05 09:30:00 2021-10-05 10:00:00
9   Carol 2021-10-05 09:30:00 2021-10-05 10:30:00
10  Carol 2021-10-05 12:30:00 2021-10-05 13:30:00
11  Carol 2021-10-05 16:30:00 2021-10-05 17:30:00

This data is reasonably readable however to work with piso we need it in interval arrays. The following code creates a pandas.Series, indexed by the names Alice, Bob and Carol, and where the values are instances of pandas.arrays.IntervalArray and hold the data corresponding to each person.

In [6]: meetings = (
   ...:     data
   ...:     .groupby("name")
   ...:     .apply(
   ...:         lambda df: pd.arrays.IntervalArray.from_arrays(
   ...:             df["start"],
   ...:             df["end"],
   ...:             closed="left",
   ...:         ),
   ...:     )
   ...: )
   ...: 

In [7]: meetings
Out[7]: 
name
Alice    [[2021-10-05 09:00:00, 2021-10-05 10:00:00), [...
Bob      [[2021-10-05 10:00:00, 2021-10-05 10:30:00), [...
Carol    [[2021-10-05 09:30:00, 2021-10-05 10:00:00), [...
dtype: object

We define a method print_intervals which is designed to make it easy for us to display interval array data. It prints a heading, then each interval in an array.

In [8]: def print_intervals(header, interval_array):
   ...:     print(header)
   ...:     print("-------------")
   ...:     for interval in interval_array:
   ...:         print(interval)
   ...:     print()
   ...: 

Let’s see this method in action by printing the meeting times for each person.

In [9]: for person in ("Alice", "Bob", "Carol"):
   ...:     print_intervals(person, meetings[person])
   ...: 
Alice
-------------
[2021-10-05 09:00:00, 2021-10-05 10:00:00)
[2021-10-05 10:00:00, 2021-10-05 11:00:00)
[2021-10-05 11:30:00, 2021-10-05 12:30:00)
[2021-10-05 14:30:00, 2021-10-05 15:30:00)

Bob
-------------
[2021-10-05 10:00:00, 2021-10-05 10:30:00)
[2021-10-05 10:30:00, 2021-10-05 11:30:00)
[2021-10-05 11:00:00, 2021-10-05 11:30:00)
[2021-10-05 16:00:00, 2021-10-05 17:00:00)

Carol
-------------
[2021-10-05 09:30:00, 2021-10-05 10:00:00)
[2021-10-05 09:30:00, 2021-10-05 10:30:00)
[2021-10-05 12:30:00, 2021-10-05 13:30:00)
[2021-10-05 16:30:00, 2021-10-05 17:30:00)

Where are the overlaps in meetings for each person?

You may notice that there seems to be some overlaps in the individual calendars for each person. Who amongst us can say they’ve never double booked?

We will examine these overlaps using piso.intersection() - but we will use it via the piso accessor. We will not supply any additional array arguments, so the sets are those intervals belonging to the IntervalArray. The min_overlaps parameter value of 2 indicates that we are looking for overlaps between two or more intervals. If we do not specify this parameter then the default behaviour is to find regions where every interval in the interval array overlaps (there are no such cases in this data).

In [10]: print("******** Individual Meeting Clashes ********\n")
******** Individual Meeting Clashes ********


In [11]: for person in ("Alice", "Bob", "Carol"):
   ....:     print_intervals(
   ....:         person,
   ....:         meetings[person].piso.intersection(min_overlaps=2),
   ....:     )
   ....: 
Alice
-------------

Bob
-------------
[2021-10-05 11:00:00, 2021-10-05 11:30:00)

Carol
-------------
[2021-10-05 09:30:00, 2021-10-05 10:00:00)

As you can see Bob and Carol each have an interval of time where they have meeting clashes.

What are the busy times for each person?

Let’s not worry about the meeting clashes, they are irrelevant for finding the schedule gaps shared by Alice, Bob and Carol. We can simplify the “busy” times in each calendar with the piso.union() method (via the piso accessor).

In [12]: print("*************** Busy periods ***************\n")
*************** Busy periods ***************


In [13]: for person in ("Alice", "Bob", "Carol"):
   ....:     print_intervals(
   ....:         person,
   ....:         meetings[person].piso.union(),
   ....:     )
   ....: 
Alice
-------------
[2021-10-05 09:00:00, 2021-10-05 11:00:00)
[2021-10-05 11:30:00, 2021-10-05 12:30:00)
[2021-10-05 14:30:00, 2021-10-05 15:30:00)

Bob
-------------
[2021-10-05 10:00:00, 2021-10-05 11:30:00)
[2021-10-05 16:00:00, 2021-10-05 17:00:00)

Carol
-------------
[2021-10-05 09:30:00, 2021-10-05 10:30:00)
[2021-10-05 12:30:00, 2021-10-05 13:30:00)
[2021-10-05 16:30:00, 2021-10-05 17:30:00)

So these are the disjoint intervals, in each person’s calendar, in which they are busy. But we are interested in the complement of these intervals. That is, the times (between 9am and 5pm) that each person is free.

Where are the gaps in the schedule for each person?

We’ll create an interval array holding a single pandas.Interval which represents the business day. For each person we can use the piso.difference() method (via the piso accessor), to remove the busy intervals from the business day interval. We do this using pandas.Series.map() and a lambda function but there are more verbose ways to perform this calculation. The result will be a pandas.Series called gaps which is indexed by the names, and whose values are interval arrays containing the “free” intervals in each person’s calendar.

In [14]: business_day = pd.arrays.IntervalArray.from_breaks(
   ....:     [pd.Timestamp("2021-10-5 9:00"), pd.Timestamp("2021-10-5 17:00")],
   ....:     closed="left",
   ....: )
   ....: 

In [15]: gaps = meetings.map(lambda ia: business_day.piso.difference(ia))

In [16]: print("************* Gaps in schedule *************\n")
************* Gaps in schedule *************


In [17]: for person in ("Alice", "Bob", "Carol"):
   ....:     print_intervals(person, gaps[person])
   ....: 
Alice
-------------
[2021-10-05 11:00:00, 2021-10-05 11:30:00)
[2021-10-05 12:30:00, 2021-10-05 14:30:00)
[2021-10-05 15:30:00, 2021-10-05 17:00:00)

Bob
-------------
[2021-10-05 09:00:00, 2021-10-05 10:00:00)
[2021-10-05 11:30:00, 2021-10-05 16:00:00)

Carol
-------------
[2021-10-05 09:00:00, 2021-10-05 09:30:00)
[2021-10-05 10:30:00, 2021-10-05 12:30:00)
[2021-10-05 13:30:00, 2021-10-05 16:30:00)

Where can we schedule a meeting between Alice, Bob and Carol?

All that remains to do is find the intersection between the interval array of gaps calculated above. We do this with piso.intersection(), but we will provide it with multiple pandas.arrays.IntervalArray operands, which indicates that each IntervalArray is interpreted as a set (as opposed to the intervals contained within.). We use python’s “*” unpack operator to transform the values of the gaps series - which is a numpy array of pandas.arrays.IntervalArray - into the arguments in the method call.

In [18]: print_intervals(
   ....:     "Potential meetings times",
   ....:     piso.intersection(*gaps.values)
   ....: )
   ....: 
Potential meetings times
-------------
[2021-10-05 13:30:00, 2021-10-05 14:30:00)
[2021-10-05 15:30:00, 2021-10-05 16:00:00)

So there we have it. There is a one-hour opportunity at 1:30pm and a half-hour opportunity at 3:30pm.

This has not been an exhaustive illustration of the functions in piso. There are many methods and parameters which have not been demonstrated above, but hopefully it has whet your appetite. For details of all the full functionality offered by piso refer to the API reference.

Verifying a maintenance schedule#

In this example we are given the following scenario:

There are three identical assets X, Y and Z which require periodic maintenance. No more than one asset should be under maintenance at any time, in order to handle the workload. Futhermore any maintenance should occur within windows of opportunity which represent when maintenance will be least disruptive. Given a proposed schedule for 2021, verify these rules are respected, and analyse time usage.

We start by importing pandas and piso

In [1]: import pandas as pd

In [2]: import piso

Running the piso.register_accessors() function will add “piso” accessors to pandas.IntervalIndex and pandas.arrays.IntervalArray. Using accessors allows us to extend the functionality associated with these classes, without adding the methods directly.

In [3]: piso.register_accessors()

Next we load the data from a csv file and store it into a pandas.DataFrame. Each row of the dataframe corresponds to an interval of maintenance for a particular asset.

In [4]: data = pd.read_csv("./data/asset_maintenance.csv", parse_dates=["start", "end"], dayfirst=True)

In [5]: data
Out[5]: 
   asset               start                 end
0      X 2021-12-14 00:00:00 2021-12-17 07:00:00
1      X 2021-01-31 23:00:00 2021-02-01 00:00:00
2      X 2021-03-15 05:00:00 2021-03-21 17:00:00
3      X 2021-09-07 13:00:00 2021-09-13 22:00:00
4      X 2021-05-02 00:00:00 2021-05-07 10:00:00
5      X 2021-08-03 14:00:00 2021-08-05 00:00:00
6      Y 2021-11-14 00:00:00 2021-11-16 22:00:00
7      Y 2021-09-01 00:00:00 2021-09-02 13:00:00
8      Y 2021-06-23 05:00:00 2021-06-30 11:00:00
9      Y 2021-01-18 19:00:00 2021-01-27 08:00:00
10     Y 2021-05-28 00:00:00 2021-05-31 16:00:00
11     Y 2021-03-23 05:00:00 2021-03-25 00:00:00
12     Z 2021-07-06 21:00:00 2021-07-10 00:00:00
13     Z 2021-04-11 07:00:00 2021-04-18 05:00:00
14     Z 2021-10-03 00:00:00 2021-10-06 14:00:00
15     Z 2021-02-25 00:00:00 2021-02-26 22:00:00
16     Z 2021-11-21 09:00:00 2021-11-26 00:00:00
17     Z 2021-06-03 06:00:00 2021-06-04 00:00:00

To work with piso we need the data in interval arrays. The following code creates a pandas.Series, indexed by the assets X, Y and Z, where the values are instances of pandas.arrays.IntervalArray.

In [6]: maintenance = (
   ...:     data
   ...:     .groupby("asset")
   ...:     .apply(
   ...:         lambda df: pd.arrays.IntervalArray.from_arrays(
   ...:             df["start"],
   ...:             df["end"],
   ...:             closed="left",
   ...:         ),
   ...:     )
   ...: )
   ...: 

In [7]: maintenance
Out[7]: 
asset
X    [[2021-12-14, 2021-12-17 07:00:00), [2021-01-3...
Y    [[2021-11-14, 2021-11-16 22:00:00), [2021-09-0...
Z    [[2021-07-06 21:00:00, 2021-07-10), [2021-04-1...
dtype: object

Checking that no more than one asset is under maintenance at any time is equivalent to checking that the sets corresponding to each interval array are disjoint. This is as simple as the following code, where we unpack the values of the maintenance Series as arguments to piso.isdisjoint().

In [8]: piso.isdisjoint(*maintenance.values)
Out[8]: True

The windows in which maintenance is preferred is described by the following data

In [9]: window_df = pd.read_csv(
   ...:     "./data/maintenance_windows.csv",
   ...:     parse_dates=["start", "end"],
   ...:     dayfirst=True,
   ...: )
   ...: 

In [10]: window_df
Out[10]: 
        start        end
0  2021-01-18 2021-02-01
1  2021-02-25 2021-03-05
2  2021-03-15 2021-03-25
3  2021-04-10 2021-04-20
4  2021-05-02 2021-05-09
5  2021-05-28 2021-06-04
6  2021-06-20 2021-07-10
7  2021-08-01 2021-08-05
8  2021-09-01 2021-09-14
9  2021-10-03 2021-10-08
10 2021-11-14 2021-11-26
11 2021-12-14 2021-12-22

As before, we transform this to an interval array

In [11]: windows = pd.arrays.IntervalArray.from_arrays(
   ....:     window_df["start"],
   ....:     window_df["end"],
   ....:     closed="left",
   ....: )
   ....: 

In [12]: windows
Out[12]: 
<IntervalArray>
[[2021-01-18, 2021-02-01), [2021-02-25, 2021-03-05), [2021-03-15, 2021-03-25), [2021-04-10, 2021-04-20), [2021-05-02, 2021-05-09) ... [2021-08-01, 2021-08-05), [2021-09-01, 2021-09-14), [2021-10-03, 2021-10-08), [2021-11-14, 2021-11-26), [2021-12-14, 2021-12-22)]
Length: 12, dtype: interval[datetime64[ns], left]

Checking that the maintenance occurs within the preferred windows can be done by checking that the set corresponding to the windows interval array is a superset of each of the sets corresponding to the asset interval arrays. Instead of doing this for each asset we can check against the union of these sets.

In [13]: combined_maintenance = piso.union(*maintenance.values)

In [14]: windows.piso.issuperset(combined_maintenance, squeeze=True)
Out[14]: True

Now let’s answer some questions using piso, specifically piso.coverage() and its accessor counterpart.

What fraction of the year 2021 constitutes maintenance window opportunities?

In [15]: windows.piso.coverage(pd.Interval(pd.Timestamp("2021"), pd.Timestamp("2022")))
Out[15]: 0.3232876712328767

How many days in each month in 2021 constitute maintenance window opportunities?

For this we’ll create a pandas.IntervalIndex for the months, then construct a pandas.Series with a monthly pandas.PeriodIndex.

In [16]: months = pd.IntervalIndex.from_breaks(pd.date_range("2021", "2022", freq="MS"))

In [17]: pd.Series(
   ....:     [windows.piso.coverage(month)*month.length for month in months],
   ....:     index = months.left.to_period()
   ....: )
   ....: 
Out[17]: 
2021-01   14 days
2021-02    4 days
2021-03   14 days
2021-04   10 days
2021-05   11 days
2021-06   14 days
2021-07    9 days
2021-08    4 days
2021-09   13 days
2021-10    5 days
2021-11   12 days
2021-12    8 days
Freq: M, dtype: timedelta64[ns]

What fraction of the time in window opportunities is utilised by the combined maintenance?

In [18]: combined_maintenance.piso.coverage(windows)
Out[18]: 0.5903954802259888

What fraction of the combined maintenance is occupied by each asset

In [19]: maintenance.apply(piso.coverage, domain=combined_maintenance)
Out[19]: 
asset
X    0.330742
Y    0.369019
Z    0.300239
dtype: float64

Estimating tax payable#

In this example we are given the following scenario:

Personal tax (before deductions) in Australia is based on the table below. The tax payable at the end of the financial year depends on the individual’s income. The higher the income, the higher the tax rate, as defined by tax brackets (or tiers). Given a list of incomes, calculate the corresponding tax payable for each income.


Income Thresholds

Rate

Tax payable

$0 - $18,200

0%

Nil

$18,200 - $45,000

19%

19c for each $1 over $18,200

$45,000 - $120,000

32.5%

$5,092 plus 32.5c for each $1 over $45,000

$120,000 - $180,000

37%

$29,467 plus 37c for each $1 over $120,000

$180,000 and over

45%

$51,667 plus 45c for each $1 over $180,000


We start by importing pandas, numpy and piso, and creating an interval index for the tax brackets.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import piso

In [4]: tax_brackets = pd.IntervalIndex.from_breaks(
   ...:     [0,18200,45000,120000,180000,np.inf],
   ...:     closed="left",
   ...: )
   ...: 

In [5]: tax_brackets
Out[5]: IntervalIndex([[0.0, 18200.0), [18200.0, 45000.0), [45000.0, 120000.0), [120000.0, 180000.0), [180000.0, inf)], dtype='interval[float64, left]')

With each interval in the tax bracket, we’ll associate three values:

  1. the lower threshold for the tax bracket

  2. the fixed amount payable

  3. the tax rate for each dollar above the threshold (as a fraction)

We describe this data as a pandas.DataFrame indexed by tax_brackets.

In [6]: tax_rates = pd.DataFrame(
   ...:     {
   ...:         "threshold":tax_brackets.left,
   ...:         "fixed":[0, 0, 5092, 29467, 51667],
   ...:         "rate":[0, 0.19, 0.325, 0.37, 0.45],
   ...:     },
   ...:     index = tax_brackets,
   ...: )
   ...: 

In [7]: tax_rates
Out[7]: 
                      threshold  fixed   rate
[0.0, 18200.0)              0.0      0  0.000
[18200.0, 45000.0)      18200.0      0  0.190
[45000.0, 120000.0)     45000.0   5092  0.325
[120000.0, 180000.0)   120000.0  29467  0.370
[180000.0, inf)        180000.0  51667  0.450

For the income, we’ll generate some random integers (and plot the distribution) corresponding to 100,000 individuals.

In [8]: income = pd.Series(np.random.beta(5,50, size=100000)*1e6).astype(int)

In [9]: income.plot.hist(bins=20);
_images/case_study_tax_income_dist.png

We are now in a position to use piso.lookup(), which take two parameters:

  1. a pandas.DataFrame or pandas.Series which is indexed by a pandas.IntervalIndex

  2. the values which are will be compared to the interval index

In [10]: tax_params = piso.lookup(tax_rates, income)

In [11]: tax_params
Out[11]: 
        threshold  fixed   rate
116326    45000.0   5092  0.325
55283     45000.0   5092  0.325
94683     45000.0   5092  0.325
82674     45000.0   5092  0.325
117607    45000.0   5092  0.325
...           ...    ...    ...
93226     45000.0   5092  0.325
43870     18200.0      0  0.190
62870     45000.0   5092  0.325
123482   120000.0  29467  0.370
202414   180000.0  51667  0.450

[100000 rows x 3 columns]

The result is a dataframe, indexed by the values of income, sharing the same columns as tax_rates.

We can then use a vectorised calculation for the tax payable:

In [12]: tax_params["fixed"] + (tax_params.index-tax_params["threshold"])*tax_params["rate"]
Out[12]: 
116326    28272.950
55283      8433.975
94683     21238.975
82674     17336.050
117607    28689.275
            ...    
93226     20765.450
43870      4877.300
62870     10899.750
123482    30755.340
202414    61753.300
Length: 100000, dtype: float64
Alternative approaches#

There are a couple of alternative, straightforward solutions which do not require piso which we detail below.

Alternative 1: pandas.cut

The tax_params dataframe that was produced above by piso.lookup() can be reproduced using pandas.cut() which can be used to assign bins to data with an interval index.

In [13]: tax_params = tax_rates.loc[pd.cut(income, tax_brackets)].set_index(income)

In [14]: tax_params
Out[14]: 
        threshold  fixed   rate
116326    45000.0   5092  0.325
55283     45000.0   5092  0.325
94683     45000.0   5092  0.325
82674     45000.0   5092  0.325
117607    45000.0   5092  0.325
...           ...    ...    ...
93226     45000.0   5092  0.325
43870     18200.0      0  0.190
62870     45000.0   5092  0.325
123482   120000.0  29467  0.370
202414   180000.0  51667  0.450

[100000 rows x 3 columns]

This approach however runs approximately 20 times slower than piso.lookup().

Alternative 2: applying function

The second approach involves writing a function which takes a single value (an income for an individual) and returns the tax payable. The function can then used with pandas.Series.apply

In [15]: def calc_tax(value):
   ....:     if value <= 18200:
   ....:         tax = 0
   ....:     elif value <= 45000:
   ....:         tax = (value-18200)*0.19
   ....:     elif value <= 120000:
   ....:         tax = 5092 + (value-45000)*0.325
   ....:     elif value <= 180000:
   ....:         tax = 29467 + (value-120000)*0.37
   ....:     else:
   ....:         tax = 51667 + (value-180000)*0.45
   ....:     return tax
   ....: 

In [16]: income.apply(calc_tax)
Out[16]: 
0        28272.950
1         8433.975
2        21238.975
3        17336.050
4        28689.275
           ...    
99995    20765.450
99996     4877.300
99997    10899.750
99998    30755.340
99999    61753.300
Length: 100000, dtype: float64

This approach runs approximately 3 times slower than piso.lookup(). It also requires a function to be defined which is relatively cumbersome to implement. This approach becomes increasingly unattractive, and error prone, as the number of tax brackets increases.

Analysis of scores in a football match#

In this example we will look at a football match from 2009:

The Champions League quarter-final between Chelsea and Liverpool in 2009 is recognised as among the best games of all time. Liverpool scored twice in the first half in the 19th and 28th minute. Chelsea then opened their account in the second half with three unanswered goals in the 51st, 57th and 76th minute. Liverpool responded with two goals in the 81st and 83rd minute to put themselves ahead, however Chelsea drew with a goal in the 89th minute and advanced to the next stage on aggregate.

We start by importing pandas and piso

In [1]: import pandas as pd

In [2]: import piso

For the analysis we will create a pandas.Series, indexed by a pandas.IntervalIndex for each team. The values of each series will be the team’s score and the interval index, defined by pandas.Timedelta, will describe the durations corresponding to each score. We define the following function which creates such a Series, given the minute marks for each score.

In [3]: def make_series(goal_time_mins):
   ...:     breaks = pd.to_timedelta([0] + goal_time_mins + [90], unit="min")
   ...:     ii = pd.IntervalIndex.from_breaks(breaks)
   ...:     return pd.Series(range(len(ii)), index = ii, name="score")
   ...: 

We can now create each Series.

In [4]: chelsea = make_series([51,57,76,89])

In [5]: liverpool = make_series([19,28,81,83])

For reference, the Series corresponding to chelsea is

In [6]: chelsea
Out[6]: 
(0 days 00:00:00, 0 days 00:51:00]    0
(0 days 00:51:00, 0 days 00:57:00]    1
(0 days 00:57:00, 0 days 01:16:00]    2
(0 days 01:16:00, 0 days 01:29:00]    3
(0 days 01:29:00, 0 days 01:30:00]    4
Name: score, dtype: int64

To enable analysis for separate halves of the game we’ll define a similar Series which defines the time intervals for each half

In [7]: halves = pd.Series(
   ...:     ["1st", "2nd"],
   ...:     pd.IntervalIndex.from_breaks(pd.to_timedelta([0, 45, 90], unit="min")),
   ...:     name="half",
   ...: )
   ...: 

In [8]: halves
Out[8]: 
(0 days 00:00:00, 0 days 00:45:00]    1st
(0 days 00:45:00, 0 days 01:30:00]    2nd
Name: half, dtype: object

We can now perform a join on these three Series. Since chelsea and liverpool Series have the same name it will be necessary to provide suffixes to differentiate the columns in the result. The halves Series does not have the same name, but a suffix must be defined for each of the join operands if there are any overlaps.

In [9]: CvsL = piso.join(chelsea, liverpool, halves, suffixes=["_chelsea", "_liverpool", ""])

In [10]: CvsL
Out[10]: 
                                    score_chelsea  score_liverpool half
(0 days 00:00:00, 0 days 00:19:00]              0                0  1st
(0 days 00:19:00, 0 days 00:28:00]              0                1  1st
(0 days 00:28:00, 0 days 00:45:00]              0                2  1st
(0 days 00:45:00, 0 days 00:51:00]              0                2  2nd
(0 days 00:51:00, 0 days 00:57:00]              1                2  2nd
(0 days 00:57:00, 0 days 01:16:00]              2                2  2nd
(0 days 01:16:00, 0 days 01:21:00]              3                2  2nd
(0 days 01:21:00, 0 days 01:23:00]              3                3  2nd
(0 days 01:23:00, 0 days 01:29:00]              3                4  2nd
(0 days 01:29:00, 0 days 01:30:00]              4                4  2nd

By default, the piso.join() function performs a left-join. Since every interval index represents the same domain, that is (0’, 90’], all join types - left, right, inner, outer - will give the same result.

Using this dataframe we will now provide answers for miscellaneous questions. In particular we will filter the dataframe based on values in the columns, then sum the lengths of the intervals in the filtered index.

How much game time did Chelsea lead for?

In [11]: CvsL.query("score_chelsea > score_liverpool").index.length.sum()
Out[11]: Timedelta('0 days 00:05:00')

How much game time did Liverpool lead for?

In [12]: CvsL.query("score_liverpool > score_chelsea").index.length.sum()
Out[12]: Timedelta('0 days 00:44:00')

How much game time were the teams tied for?

In [13]: CvsL.query("score_liverpool == score_chelsea").index.length.sum()
Out[13]: Timedelta('0 days 00:41:00')

How much game time in the first half were the teams tied for?

In [14]: CvsL.query("score_chelsea == score_liverpool and half == '1st'").index.length.sum()
Out[14]: Timedelta('0 days 00:19:00')

For how long did Liverpool lead Chelsea by exactly one goal (split by half)?

In [15]: CvsL.groupby("half").apply(
   ....:     lambda df: df.query("score_liverpool - score_chelsea == 1").index.length.sum()
   ....: )
   ....: 
Out[15]: 
half
1st   0 days 00:09:00
2nd   0 days 00:12:00
dtype: timedelta64[ns]

What was the score at the 80 minute mark?

In [16]: piso.lookup(CvsL, pd.Timedelta(80, unit="min"))
Out[16]: 
                 score_chelsea  score_liverpool half
0 days 01:20:00              3                2  2nd

This analysis is also straightforward using staircase. For more information on this please see the corresponding example with staircase

Frequently asked questions#

API reference#

This page gives an overview of all public piso functionality. Classes and functions exposed in the piso.* and piso.interval.* namespaces are public. Other top-level modules should be considered private until specified otherwise.

Top level functions#

register_accessors()

When called this function will register the "piso" ArrayAccessor on pandas.IntervalIndex and pandas.arrays.IntervalArray.

union(interval_array, *interval_arrays[, ...])

Performs a set union operation.

intersection(interval_array, *interval_arrays)

Performs a set intersection operation.

difference(interval_array, *interval_arrays)

Performs a set difference operation.

symmetric_difference(interval_array, ...[, ...])

Performs a set symmetric difference operation.

isdisjoint(interval_array, *interval_arrays)

Indicates whether one, or more, sets are disjoint or not.

issuperset(interval_array, *interval_arrays)

Indicates whether a set is a superset of one, or more, other sets.

issubset(interval_array, *interval_arrays[, ...])

Indicates whether a set is a subset of one, or more, other sets.

coverage(interval_array[, domain, bins, how])

Calculates the fraction of a domain (or possibly multiple domains) covered by a collection of intervals.

complement(interval_array[, domain])

Calculates the complement of a collection of intervals (in an array) over some domain.

contains(interval_array, x[, include_index, ...])

Evaluates the intersection of a set of intervals with a set of points.

split(interval_array, x)

Given a set of intervals, and break points, splits the intervals into pieces wherever the overlap a break point.

bridge(interval_array, threshold)

Given a set of intervals, and a threshold, merges intervals which are separated by a gap less than or equal to the threshold.

lookup(frame_or_series, x)

Given a pandas.DataFrame, or pandas.Series, indexed by a pandas.IntervalIndex, finds the intervals which contain each point in an array and returns the associated rows/elements.

join(*frames_or_series[, how, suffixes, sort])

Joins multiple dataframes or series by their pandas.IntervalIndex.

adjacency_matrix(interval_array, ...[, ...])

Returns a 2D array (or dataframe) of boolean values indicating edges between nodes in a graph.

Accessors#

ArrayAccessor.union(*interval_arrays[, ...])

Performs a set union operation.

ArrayAccessor.intersection(*interval_arrays)

Performs a set intersection operation.

ArrayAccessor.difference(*interval_arrays[, ...])

Performs a set difference operation.

ArrayAccessor.symmetric_difference(...[, ...])

Performs a set symmetric difference operation.

ArrayAccessor.isdisjoint(*interval_arrays)

Indicates whether one, or more, sets are disjoint or not.

ArrayAccessor.issuperset(*interval_arrays[, ...])

Indicates whether a set is a superset of one, or more, other sets.

ArrayAccessor.issubset(*interval_arrays[, ...])

Indicates whether a set is a subset of one, or more, other sets.

ArrayAccessor.coverage([domain, bins, how])

Calculates the size of a domain (or possibly multiple domains) covered by a collection of intervals.

ArrayAccessor.complement([domain])

Calculates the complement of a collection of intervals (in an array) over some domain.

ArrayAccessor.contains(x[, include_index, ...])

Evaluates the intersection of a set of intervals with a set of points.

ArrayAccessor.split(x)

Given a set of intervals, and break points, splits the intervals into pieces wherever the overlap a break point.

ArrayAccessor.bridge(threshold)

Given a set of intervals, and a threshold, merges intervals which are separated by a gap less than or equal to the threshold.

ArrayAccessor.adjacency_matrix(*interval_arrays)

Returns a 2D array (or dataframe) of boolean values indicating edges between nodes in a graph.

Interval#

union(interval1, interval2[, squeeze])

Performs the union of two pandas.Interval

intersection(interval1, interval2[, squeeze])

Performs the intersection of two pandas.Interval

difference(interval1, interval2[, squeeze])

Performs the set difference of two pandas.Interval

symmetric_difference(interval1, interval2[, ...])

Performs the symmetric difference of two pandas.Interval

issuperset(interval, *intervals[, squeeze])

Indicates whether one pandas.Interval is a superset of one, or more, others.

issubset(interval, *intervals[, squeeze])

Indicates whether one pandas.Interval is a subset of one, or more, others.

Release notes#

ADD UNRELEASED CHANGES ABOVE THIS LINE

v0.9.0 2022-05-04

Added the following methods

v0.8.0 2022-01-29

v0.7.0 2021-11-20

Added the following methods

Removed the following methods

v0.6.0 2021-11-05

The following methods were extended to accommodate intervals with closed = “both” or “neither”

v0.5.0 2021-11-02

Added the following methods

Performance improvements for

v0.4.0 2021-10-30

Added the following methods

v0.3.0 2021-10-23

Added the following methods

v0.2.0 2021-10-15

Added the following methods

v0.1.0 2021-10-10

The following methods are included in the initial release of piso

Indices and tables#