Theory and notation
===================

This page defines the notation used throughout ``xyz`` and summarizes the
mathematical quantities estimated by the library.

Why this matters
----------------

The same high-level quantity, such as transfer entropy, can be estimated under
very different assumptions:

- a linear-Gaussian model,
- a nonparametric nearest-neighbor model,
- a fixed-radius kernel approximation,
- or a discrete/binned empirical distribution.

The estimators in ``xyz`` are therefore best understood as different numerical
approximations to the same information-theoretic functionals.

Notation
--------

Let :math:`Y_t` be the target process at time :math:`t`. Its embedded past is
written as

.. math::

   Y_t^- = \left[Y_{t-\tau}, Y_{t-2\tau}, \ldots, Y_{t-d\tau}\right],

where :math:`d` is the embedding dimension and :math:`\tau` is the embedding
spacing. Likewise:

- :math:`X_t^-` denotes the embedded past of a driver process,
- :math:`Z_t^-` denotes the embedded past of one or more conditioning processes,
- :math:`u` denotes an interaction delay between source and target when a
  delay-specific TE estimator is used.

In ``xyz``, these state vectors are assembled by the delay-embedding helpers in
``xyz.preprocessing``.

Units
-----

Unless otherwise stated, values are reported in **nats**:

.. math::

   1\ \text{nat} = \log_2(e)\ \text{bits} \approx 1.4427\ \text{bits}.

Core information quantities
---------------------------

Entropy
^^^^^^^

For a continuous random vector :math:`Y \in \mathbb{R}^d`,

.. math::

   H(Y) = - \int p(y)\log p(y)\,dy.

If :math:`Y` is Gaussian with covariance matrix :math:`\Sigma`,

.. math::

   H(Y)
   = \frac{1}{2}\log\!\left((2\pi e)^d \det(\Sigma)\right).

This is the quantity estimated by :class:`xyz.MVNEntropy`.

Conditional entropy
^^^^^^^^^^^^^^^^^^^

For two random variables :math:`X` and :math:`Y`,

.. math::

   H(Y \mid X) = H(X, Y) - H(X).

In a regression-based Gaussian setting, this can be expressed via the
covariance of the residual process:

.. math::

   H(Y \mid X)
   = \frac{1}{2}\log\!\left((2\pi e)^d \det(\Sigma_{\varepsilon})\right),

where :math:`\varepsilon` are the residuals from regressing :math:`Y` on
:math:`X`.

Mutual information
^^^^^^^^^^^^^^^^^^

Mutual information measures statistical dependence:

.. math::

   I(X;Y) = H(X) + H(Y) - H(X,Y)
          = H(Y) - H(Y \mid X).

It is symmetric in :math:`X` and :math:`Y` and nonnegative in the population.
In finite samples, nonparametric estimators can produce small negative values
because of estimation variance.

Conditional mutual information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Conditional mutual information measures dependence that remains after adjusting
for a third variable:

.. math::

   I(X;Y \mid Z)
   = H(X \mid Z) + H(Y \mid Z) - H(X,Y \mid Z).

This is the core building block of transfer entropy.

Transfer entropy
^^^^^^^^^^^^^^^^

Bivariate transfer entropy from :math:`X` to :math:`Y` quantifies predictive
information flow from the past of :math:`X` to the present of :math:`Y` beyond
the information already contained in the past of :math:`Y`:

.. math::

   TE_{X \to Y}
   = I(X_t^-; Y_t \mid Y_t^-)
   = H(Y_t \mid Y_t^-) - H(Y_t \mid Y_t^-, X_t^-).

If a separate interaction delay :math:`u` is used, the source state can be
written more explicitly as

.. math::

   X_{t,u}^- = \left[X_{t-u}, X_{t-u-\tau}, \ldots, X_{t-u-(d-1)\tau}\right].

This distinction is important in TRENTOOL-style delay reconstruction.

Partial transfer entropy
^^^^^^^^^^^^^^^^^^^^^^^^

Partial transfer entropy adjusts for additional confounding processes
:math:`Z_t^-`:

.. math::

   PTE_{X \to Y \mid Z}
   = I(X_t^-; Y_t \mid Y_t^-, Z_t^-)
   = H(Y_t \mid Y_t^-, Z_t^-)
     - H(Y_t \mid Y_t^-, X_t^-, Z_t^-).

This is the natural choice when the apparent effect of :math:`X` on :math:`Y`
could be mediated or confounded by known controls.

Self-entropy / information storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``xyz`` uses the term *self-entropy* for information storage:

.. math::

   SE_Y = I(Y_t; Y_t^-)
        = H(Y_t) - H(Y_t \mid Y_t^-).

This quantifies how much of the present of a process is predictable from its
own past.

Estimator families in ``xyz``
-----------------------------

Gaussian / linear
^^^^^^^^^^^^^^^^^

These estimators assume the relevant distributions are well approximated by
linear regressions with Gaussian residuals. They are fast, interpretable, and
provide analytical F-tests for TE, PTE, and self-entropy.

KSG / nearest-neighbor
^^^^^^^^^^^^^^^^^^^^^^

These estimators are nonparametric and approximate entropies from
nearest-neighbor distances. They are more flexible than Gaussian estimators,
especially for nonlinear dependence, but require more data and more careful
parameter tuning.

Kernel / fixed-radius
^^^^^^^^^^^^^^^^^^^^^

These estimators replace the fixed-:math:`k` neighborhood of KSG with a fixed
radius :math:`r`. They are intuitive and useful for sensitivity analysis, but
their performance can change substantially with the chosen radius.

Discrete / binning
^^^^^^^^^^^^^^^^^^

These estimators quantize the data and estimate probabilities from empirical
frequencies. They are especially appropriate for symbolic or truly discrete
state spaces, but can become sparse in high-dimensional embeddings.

How to choose an estimator family
---------------------------------

.. list-table::
   :header-rows: 1

   * - Family
     - Best when
     - Main strengths
     - Main risks
   * - Gaussian
     - Dynamics are approximately linear and homoscedastic
     - Fast, stable, interpretable, analytical significance
     - Misses nonlinear structure
   * - KSG
     - Nonlinear dependence is plausible and sample size is adequate
     - Flexible, widely used, closest to TRENTOOL-style continuous TE
     - Higher variance, more tuning, more expensive
   * - Kernel
     - A local geometric neighborhood view is desirable
     - Simple radius interpretation, useful for robustness sweeps
     - Highly sensitive to ``r``
   * - Discrete
     - Data are symbolic, categorical, or deliberately quantized
     - Conceptually simple, easy to interpret
     - Binning bias and state-space sparsity

ITS / TSTOOL / TRENTOOL alignment
---------------------------------

The continuous nearest-neighbor estimators in ``xyz`` follow the same broad
strategy as ITS/TSTOOL/TRENTOOL:

1. find a neighborhood in the highest-dimensional joint space,
2. project that neighborhood into lower-dimensional marginal spaces,
3. use projected counts to estimate entropy differences with reduced bias.

For the TE/PTE/SE parity tests, ``xyz`` excludes self-matches in the projected
count stage, mirroring the ITS ``range_search(..., past=0)`` behavior.

The TRENTOOL workflow then layers additional methodology on top of those core
estimators: ACT-aware trial selection, Ragwitz embedding search, interaction
delay reconstruction, surrogate testing, and group-level harmonization. Those
workflow components are the bridge between low-level estimator parity and a
full causal-analysis pipeline.