This is a fast way to yield a subset of rows from multiple Pandas dataframes or Series, when one needs to work on a sliding window basis over a predefined minimum and maximum number of rows. This approach is among the fastest available and is based on the .iloc accessor of both series and dataframes.

def rolling_expanding_window(seq, n_min, n_max):
    """
    Emits the elements over a rolling or expanding window of an iterable sequence
    Parameters
    ----------
    seq
    n_min
    n_max

    Returns
    -------

    """
    it = iter(range(len(seq)))  # makes it iterable
    # roll it forward at least warmup steps
    win = deque((next(it, None) for _ in range(n_min)), maxlen=n_max)
    # yield win
    for e in it:
        win.append(e)
        yield win


def sliding_dataframes(*arrays, n_min=0, n_max=None):
    """
    Like scikit-learn train_test_split but with rolling-expanding window
    Parameters
    ----------
    *arrays: one or more pandas dataframes or series that we want to slice over in parallel
    n_min: minimum window size
    n_max: maximum window size
    Returns
    -------
    """

    n_dataframes = len(arrays)
    if n_dataframes == 0:
        raise ValueError("At least one array required as input")

    n_samples = arrays[0].shape[0]
    for df in arrays:
        if not isinstance(df, (pd.DataFrame, pd.Series)) and df is not None:
            raise TypeError("This method only supports pandas dataframes and series")
        if df is not None and df.shape[0] != n_samples:
            raise ValueError("Specify equal length dataframes or series")

    # the first dataframe is the one dictating
    indices = rolling_expanding_window(arrays[0], n_min=n_min, n_max=n_max)
    for index in indices:
        yield list(
            chain.from_iterable(
                (a.iloc[index] if a is not None else None,) for a in arrays
            )
        )

Further reading

Read more in the tech topic.

Let's talk!

I'm Carlo Nicolini — I am interested on the reliability of AI reasoning systems (interpretability, inference-time methods, probabilistic language programming) and on quantitative portfolio optimization (I am a maintainer of skfolio). If you're working on something in these areas and think we might collaborate, chat, discuss, I'm happy to talk about it!

The best way to reach me is on via DM on LinkedIn.