t-Distributed Stochastic Neighbor Embedding (t-SNE)
===================================================

`t-Distributed Stochastic Neighbor Embedding <https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html>`_ (t-SNE) is a non-linear dimensionality reduction technique that projects high-dimensional data into a lower-dimensional space—typically two or three dimensions—while preserving local structure. Unlike Principal Component Analysis (PCA), t-SNE focuses on keeping similar points close together in the embedding, making it particularly effective at revealing clusters and groupings that may not be apparent in the original data.

For example, t-SNE can be applied to multi-element geochemical datasets to visually separate rock types or alteration zones. Samples sharing similar geochemical signatures will appear as tight groups in the embedding, helping geologists identify compositional domains or hydrothermal overprints that span multiple properties simultaneously.

Interface
---------

General parameters
^^^^^^^^^^^^^^^^^^

The general parameters controlling the **t-SNE** application are shown in the :ref:`figure below <tsne_ui_general>`.

.. image:: ../../images/tsne/ui_general.png
    :name: tsne_ui_general
    :align: center

The options are described as follows:

Input
~~~~~

* **Object**: The object containing the data to be reduced in dimensionality.
* **Data**: The data properties to use. Users can select as many ``Float`` data properties as desired. Only rows with valid values across all selected properties are included in the computation.

Output
~~~~~~

* **Data group name**: The name of the output group that will contain the t-SNE components.
* **Prefix**: The prefix added to each component name in the output group. Components are named ``[prefix]_1``, ``[prefix]_2``, etc. For example, if the prefix is set to ``TSNE`` and the number of components is set to ``2``, two output components named ``TSNE_1`` and ``TSNE_2`` will be created.

Advanced parameters
^^^^^^^^^^^^^^^^^^^

Advanced controls on the `Scikit-Learn t-SNE <https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html>`_ algorithm are available as advanced parameters, as shown in the :ref:`figure below <tsne_ui_optional>`.

.. image:: ../../images/tsne/ui_optional.png
    :name: tsne_ui_optional
    :align: center

Pre-processing
~~~~~~~~~~~~~~

* **Standardize** (*default = True*): If checked, the data will be zero-mean and unit-variance standardized before running t-SNE. Recommended in most cases, as t-SNE is sensitive to feature scale. Standardization is automatically skipped when **Metric** is set to ``precomputed``. Standardization is strongly recommended because t-SNE is sensitive to differences in feature scale; properties measured on very different scales can otherwise dominate the distance calculations.

Data points containing any properties with *no-data* values are excluded from the t-SNE computation. The output array restores these rows as *NaN* so that the result aligns with the original object.

t-SNE hyperparameters
~~~~~~~~~~~~~~~~~~~~~

* **Number of components** (*default = 2*): The number of dimensions to reduce the data to. Must be an integer greater than 0. Set to ``2`` for a 2-D scatter plot or ``3`` for a 3-D embedding.

* **Perplexity** (*default = 30*): Balances attention between local and global structure. Low values focus on tight local clusters; high values capture broader patterns. Typical range is 5–50; larger datasets generally benefit from higher values. Must be less than the number of valid data points (rows without any no-data values). If perplexity is too large, t-SNE cannot be performed -- you will be required to reduce the perplexity or select more data.

* **Early exaggeration** (*default = 12*): Amplifies distances between clusters during the initial phase of optimization, helping the algorithm separate groups before refining their layout. Higher values push clusters further apart in the final embedding.

* **Learning rate** (*optional, default = auto*): Step size for the gradient descent optimization. If left unchecked, the learning rate is set automatically based on the sample size and early exaggeration factor. If set manually, values between 10 and 1000 are typical; too high a value causes points to spread into a uniform ball, while too low a value compresses most points into a dense blob.

* **Max iterations** (*default = 1000*): Maximum number of iterations for the optimization. A minimum of 250 is recommended.

* **Iterations without progress** (*default = 300*): Stops optimization early if no measurable improvement is made for this many iterations (checked every 50 steps). Only applies after the initial 250 iterations.

* **Minimum gradient norm** (*default = 1e-7*): Convergence threshold on the gradient norm. Optimization stops when gradients fall below this value, indicating the embedding has stabilized.

* **Metric** (*default = 'euclidean'*): The distance metric used to calculate pairwise distances between data points. Available options:

  * ``euclidean``: Straight-line distance between points in feature space (default).
  * ``manhattan``: Sum of absolute differences along each dimension.
  * ``cosine``: Similarity based on the angle between two feature vectors.
  * ``precomputed``: Treat the input data as a pre-computed *n × n* pairwise distance matrix. Requires the data to be square and **Initialization** to be set to ``random``. When **Metric** is set to ``precomputed``, standardization is automatically skipped and **Initialization** must be set to ``random``.

* **Metric parameters** (*optional*): Additional keyword arguments for the metric function, supplied as a JSON-formatted dictionary (e.g., ``{"p": 3}``). Leave unchecked if not required.

* **Initialization** (*default = 'pca'*): How the embedding is initialized before optimization begins:

  * ``pca``: Uses the first two principal components as the starting point. Provides a globally stable initialization and is recommended for most cases.
  * ``random``: Randomly initializes the embedding. Required when **metric** is set to ``precomputed``.

* **Method** (*default = 'barnes_hut'*): The gradient calculation algorithm:

  * ``barnes_hut``: Runs in O(N log N) time by grouping distant points—suitable for large datasets.
  * ``exact``: Computes all pairwise interactions in O(N²) time, giving higher accuracy at the cost of speed. Use only for small datasets where precision matters.

* **Angle** (*default = 0.5*): Only used with ``method='barnes_hut'``. Sets how far away a group of points must be before it is approximated as a single summary point. Smaller values are more accurate but slower. Values in the range 0.2–0.8 offer a good speed/accuracy trade-off.

* **Verbosity** (*default = 0*): Verbosity level for logging during optimization.

* **Random state** (*optional*): Seed for the random number generator. Set to an integer for reproducible results. Used when **Initialization** is ``random`` or **Method** is ``barnes_hut``.

* **Number of jobs** (*optional*): The number of parallel jobs to use for the nearest-neighbours search. Set to ``-1`` to use all available processors. Leave unchecked to use a single job.

Once the parameters are selected, press **OK** to run the analysis.

Results
-------

The application creates a new group in the output object, as shown in the :ref:`figure below <tsne_group_result>`. The group is named as defined in the **Data group name** parameter. Each t-SNE component is stored as a separate data property, named ``[prefix]_1``, ``[prefix]_2``, etc. Rows corresponding to data points that were excluded due to no-data values are filled with *NaN* in the output.

.. image:: ../../images/tsne/group_results.png
    :name: tsne_group_result
    :align: center

Tutorial
--------
This tutorial demonstrates how to use the t-SNE application to reduce the dimensionality of a dataset.


1. Open the t-SNE application.

    .. image:: ../../images/tsne/tutorial_1.png
        :align: center

2. Either by using the dropdowns or dragging-and-dropping, select the input object and data properties to use which belong to the same object.

   In this example, we would like to use the ``X`` and ``Y`` *geometric* data properties, but the data properties must be of type ``Float``. To convert the geometric data to float data, simply right click on the property you would like to convert, and select **Convert to Float**. This will create a new data property of type ``Float`` with the same values as the original geometric property. Select these new float properties as input for t-SNE.

    .. image:: ../../images/tsne/tutorial_2a.png
        :align: center

    .. image:: ../../images/tsne/tutorial_2b.png
        :align: center

    .. image:: ../../images/tsne/tutorial_2c.png
        :align: center

3. Set the output group name and prefix for the t-SNE components. In this example, we leave them both as the default, "TSNE".

    .. image:: ../../images/tsne/tutorial_3.png
        :align: center

4. Set the advanced parameters as desired. In this example, we leave them all as default, except for the **Random state** field, which we set to 42 for reproducibility.

    .. image:: ../../images/tsne/tutorial_4.png
        :align: center

5. Press **OK** to close the UI and run the application, or **Apply** to run the application and keep the UI open.

    .. image:: ../../images/tsne/tutorial_5a.png
        :align: center

    .. image:: ../../images/tsne/tutorial_5b.png
        :align: center

6. Examine the results: a new data group named "TSNE" has been created, containing two data properties named "TSNE_1" and "TSNE_2". These properties contain the t-SNE components for each data point. We can create a 2D Cross Plot of these two components in Geoscience ANALYST to visualize the embedding.

    .. image:: ../../images/tsne/tutorial_6a.png
        :align: center

    .. image:: ../../images/tsne/group_results.png
        :align: center

    .. image:: ../../images/tsne/geochem_pts_xy_crossplot.png
        :align: center