# The NDDataset object

The NDDataset is the main object use by **SpectroChemPy**.

Like numpy ndarrays, NDDataset have the capability to be sliced, sorted and subject to
mathematical operations.

But, in addition, NDDataset may have units, can be masked and each dimensions can have
coordinates also with units.
This make NDDataset aware of units compatibility, *e.g.*, for binary operation such as
additions or subtraction or during the application of mathematical operations.
In addition or in replacement of numerical data for coordinates,
NDDataset can also have labeled coordinates where labels can be different kind of
objects (strings, datetime,
numpy nd.ndarray or other NDDatasets, etc...).

This offers a lot of flexibility in using NDDatasets that,  we hope, will be useful
for applications.
See the See the [Examples](../../../gettingstarted/examples/gallery/auto_examples_core/index.rst) for
additional information about such possible applications.

**Below (and in the next sections), we try to give an almost complete view of the
NDDataset features.**

As we will make some reference to the
**[numpy](https://numpy.org/doc/stable/index.html)** library, we also import it here.

In [None]:
import numpy as np

import spectrochempy as scp

We additionally import the three main SpectroChemPy objects that we will use through
this tutorial

In [None]:
from spectrochempy import Coord
from spectrochempy import CoordSet
from spectrochempy import NDDataset

For a convenient usage of units, we will also directly import
**[ur]((#Units)**, the unit registry which contains all available
units.

In [None]:
from spectrochempy import ur

Multidimensional array are defined in Spectrochempy using the `NDDataset` object.

`NDDataset` objects mostly behave as numpy's `numpy.ndarray`
(see for instance __
[numpy quickstart tutorial](https://numpy.org/doc/stable/user/quickstart.html)__).

However, unlike raw numpy's ndarray, the presence of optional properties make
them (hopefully) more appropriate for handling spectroscopic information,
one of the major objectives of the SpectroChemPy package:

*  `mask`: Data can be partially masked at will
*  `units`: Data can have units, allowing units-aware operations
*  `CoordSet`: Data can have a set of coordinates, one or several by dimensions

Additional metadata can also be added to the instances of this class
through the `meta` properties.

## 1D-Dataset (unidimensional dataset)

In the following example, a minimal 1D dataset is created from a simple list,
to which we can add some metadata:

In [None]:
d1D = NDDataset(
    [10.0, 20.0, 30.0],
    name="Dataset N1",
    author="Blake and Mortimer",
    description="A dataset from scratch",
    history="creation",
)
d1D

In [None]:
print(d1D)

In [None]:
_ = d1D.plot(figsize=(3, 2))

Except few additional metadata such `author` , `created` ..., there is not much
difference with respect to a conventional
**[numpy.array](
https://numpy.org/doc/stable/reference/generated/numpy.array.html#numpy.array)**.
For example, one
can apply numpy
**[ufunc](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs)'s**
directly to a NDDataset or make basic arithmetic
operation with these objects:

In [None]:
np.sqrt(d1D)

In [None]:
d1D += d1D / 2.0
d1D

As seen above, there are some attributes that are automatically added to the dataset:

* `id`      : This is a unique identifier for the object.
* `name`: A short and unique name for the dataset. It will beequal to the automatic
`id` if it is not provided.
* `author`  : Author determined from the computer name if not provided.
* `created` : Date and time of creation.
* `modified`: Date and time of modification.

These attributes can be modified by the user, but the `id` , `created` and `modified`
attributes are read only.

Some other attributes are defined to describe the data:
* `title`: A long name that will be used in plots or in some other operations.
* `history`: history of operation achieved on the object since the object creation.
* `description`: A comment or a description of the objects purpose or contents.
* `origin`: An optional reference to the source of the data.

Here is an example of the use of the NDDataset attributes:

In [None]:
d1D.title = "intensity"
d1D.name = "mydataset"
d1D.history = "created from scratch"
d1D.description = "Some experimental measurements"
d1D

d1D is a 1D (1-dimensional) dataset with only one dimension.

Some attributes are useful to check this kind of information:

In [None]:
d1D.shape  # the shape of 1D contain only one dimension size

In [None]:
d1D.ndim  # the number of dimensions

In [None]:
d1D.dims  # the name of the dimension (it has been automatically attributed)

**Note**: The names of the dimensions are set automatically. But they can be changed,
with <u>the limitation</u> that the
name must be a single letter.

In [None]:
d1D.dims = ["q"]  # change the list of dim names.

In [None]:
d1D.dims

### nD-Dataset (multidimensional dataset)

To create a nD NDDataset, we can provide a nD-array like object to the NDDataset
instance constructor

In [None]:
a = np.random.rand(2, 4, 6)
a

In [None]:
d3D = NDDataset(a)
d3D.title = "energy"
d3D.author = "Someone"
d3D.name = "3D dataset creation"
d3D.history = "created from scratch"
d3D.description = "Some example"
d3D.dims = ["u", "v", "t"]
d3D

We can also add all information in a single statement

In [None]:
d3D = NDDataset(
    a,
    dims=["u", "v", "t"],
    title="Energy",
    author="Someone",
    name="3D_dataset",
    history="created from scratch",
    description="a single statement creation example",
)
d3D

Three names are attributed at the creation (if they are not provided with the `dims`
attribute, then the name are:
'z','y','x' automatically attributed)

In [None]:
d3D.dims

In [None]:
d3D.ndim

In [None]:
d3D.shape

## About the dates and times
The dates and times are stored internally as
[UTC (Coordinated_Universal_Time)](https://en.wikipedia.org/wiki/Coordinated_
Universal_Time).
Timezone information is stored in the timezone attribute.
If not set, the default is to use the local timezone,
which is probably the most common case.

In [None]:
nd = NDDataset()
nd.created

In this case our local timezone has been used by default for the conversion from
UTC datetime.

In [None]:
nd.local_timezone

In [None]:
nd.timezone = "EST"
nd.created

For a list of timezone code (TZ) you can have a look at
[List_of_tz_database_time_zones](
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).

## About the `history` attribute

The history is saved internally into a list, but its has a different behaviour than
the usual list.
The first time a NDDataset is created, the list is empty

In [None]:
nd = NDDataset()
nd.history

Assigning a string to the history attribute has two effects. The first one is that
the string is appended automatically to the previous history list, and second it is
preceeded by the time it has been added.

In [None]:
nd.history = "some history"
nd.history = "another history to append"
nd.history = "..."
nd.history

If you want to erase the history, assign an empty list

In [None]:
nd.history = []
nd.history

If you want to replace the full history use bracket around your history line:

In [None]:
nd.history = "Created form scratch"
nd.history = "a second ligne that will be erased"
nd.history = ["A more interesting message"]
nd.history

## Units

One interesting possibility for a NDDataset is to have defined units for the internal
data.

In [None]:
d1D.units = ur.eV  # ur is a registry containing all available units

In [None]:
d1D  # note the eV symbol of the units added to the values field below

This allows to make units-aware calculations:

In [None]:
d1D**2  # note the results in eV^2

In [None]:
np.sqrt(d1D)  # note the result in e^0.5

In [None]:
time = 5.0 * ur.second
d1D / time  # here we get results in eV/s

Conversion can be done between different units transparently

In [None]:
d1D.to("J")

In [None]:
d1D.to("K")

For more examples on how to use units with NDDataset, see the
[gallery example](gettingstarted/examples/gallery/auto_examples_core/a_nddataset/plot_c_units.html

## Coordinates

The above created `d3D` dataset has 3 dimensions, but no coordinate for these
dimensions. Here arises a big difference
with simple `numpy`-arrays:
* We can add coordinates to each dimension of a NDDataset.

To get the list of all defined coordinates, we can use the `coords` attribute:

In [None]:
d3D.coordset  # no coordinates, so it returns nothing (None)

In [None]:
d3D.t  # the same for coordinate  t, v, u which are not yet set

To add coordinates, on way is to set them one by one:

In [None]:
d3D.t = (
    Coord.arange(6) * 0.1
)  # we need a sequence of 6 values for `t` dimension (see shape above)
d3D.t.title = "time"
d3D.t.units = ur.seconds
d3D.coordset  # now return a list of coordinates

In [None]:
d3D.t

In [None]:
d3D.coordset("t")  # Alternative way to get a given coordinates

In [None]:
d3D["t"]  # another alternative way to get a given coordinates

The two other coordinates u and v are still undefined

In [None]:
d3D.u, d3D.v

When the dataset is printed, only the information for the existing coordinates is
given.

In [None]:
d3D

Programmatically, we can use the attribute `is_empty` or `has_data` to check this

In [None]:
d3D.v.has_data, d3D.v.is_empty

An error is raised when a coordinate doesn't exist

In [None]:
try:
    d3D.x
except KeyError as e:
    scp.error_(KeyError, e)

In some case it can also be useful to get a coordinate from its title instead of its
name (the limitation is that if
several coordinates have the same title, then only the first ones that is found in
the coordinate list, will be
returned - this can be ambiguous)

In [None]:
d3D["time"]

In [None]:
d3D.time

## Labels

It is possible to use labels instead of numerical coordinates. They are sequence of
objects .The length of the
sequence must be equal to the size of a dimension.

The labels can be simple strings, *e.g.,*

In [None]:
tags = list("ab")
d3D.u.title = "some tags"
d3D.u.labels = tags  # TODO: avoid repetition
d3D

or more complex objects.

For instance here we use datetime.timedelta objects:

In [None]:
from datetime import timedelta

start = timedelta(0)
times = [start + timedelta(seconds=x * 60) for x in range(6)]
d3D.t = None
d3D.t.labels = times
d3D.t.title = "time"
d3D

In this case, getting a coordinate that doesn't possess numerical data but labels,
will return the labels

In [None]:
d3D.time

# More insight on coordinates

## Sharing coordinates between dimensions

Sometimes it is not necessary to have different coordinates for each axe. Some can be
shared between axes.

For example, if we have a square matrix with the same coordinate in the two
dimensions, the second dimension can
refer to the first. Here we create a square 2D dataset, using the `diag` method:

In [None]:
nd = NDDataset.diag((3, 3, 2.5))
nd

and then we add the same coordinate for both dimensions

In [None]:
coordx = Coord.arange(3)
nd.set_coordset(x=coordx, y="x")
nd

## Setting coordinates using `set_coordset`

Let's create 3 `Coord` objects to be used as coordinates for the 3 dimensions of the
previous d3D dataset.

In [None]:
d3D.dims = ["t", "v", "u"]
s0, s1, s2 = d3D.shape
coord0 = Coord.linspace(10.0, 100.0, s0, units="m", title="distance")
coord1 = Coord.linspace(20.0, 25.0, s1, units="K", title="temperature")
coord2 = Coord.linspace(0.0, 1000.0, s2, units="hour", title="elapsed time")

### Syntax 1

In [None]:
d3D.set_coordset(u=coord2, v=coord1, t=coord0)
d3D

### Syntax 2

In [None]:
d3D.set_coordset({"u": coord2, "v": coord1, "t": coord0})
d3D

## Adding several coordinates to a single dimension
We can add several coordinates to the same dimension

In [None]:
coord1b = Coord([1, 2, 3, 4], units="millitesla", title="magnetic field")

In [None]:
d3D.set_coordset(u=coord2, v=[coord1, coord1b], t=coord0)
d3D

We can retrieve the various coordinates for a single dimension easily:

In [None]:
d3D.v_1

## Math operations on coordinates
Arithmetic operations can be performed on single coordinates:

In [None]:
d3D.u = d3D.u * 2
d3D.u

The ufunc numpy functions can also be applied, and will affect both the magnitude and
the units of the coordinates:

In [None]:
d3D.u = 1.5 + np.sqrt(d3D.u)
d3D.u

A particularly frequent use case is to subtract the initial value from a coordinate. This can be done
directly with the `-` operator:

In [None]:
d3D.u = d3D.u - d3D.u[0]
d3D.u

The operations above will generally *not* work on multiple coordinates, and
will raise an error if attempted:

In [None]:
try:
    d3D.v = d3D.v - 1.5
except NotImplementedError as e:
    scp.error_(NotImplementedError, e)

Only subtraction between multiple coordinates is allowed, and will return a new `CoordSet` where each coordinate
has been subtracted:

In [None]:
d3D.v = d3D.v - d3D.v[0]
d3D.v

It is always possible to carry out operations on a given coordinate
of a CoordSet. This must be done by accessing the coordinate by its name, e.g. `'temperature'` or `'_2'` for
the second coordinate of the `v` dimension:

In [None]:
d3D.v["_2"] = d3D.v["_2"] + 5.0
d3D.v

## Summary of the coordinate setting syntax
Some additional information about coordinate setting syntax

**A.** First syntax (probably the safer because the name of the dimension is
specified, so this is less prone to
errors!)

In [None]:
d3D.set_coordset(u=coord2, v=[coord1, coord1b], t=coord0)
# or equivalent
d3D.set_coordset(u=coord2, v=CoordSet(coord1, coord1b), t=coord0)
d3D

**B.** Second syntax assuming the coordinates are given in the order of the
dimensions.

Remember that we can check this order using the `dims` attribute of a NDDataset

In [None]:
d3D.dims

In [None]:
d3D.set_coordset((coord0, [coord1, coord1b], coord2))
# or equivalent
d3D.set_coordset(coord0, CoordSet(coord1, coord1b), coord2)
d3D

**C.** Third syntax (from a dictionary)

In [None]:
d3D.set_coordset({"t": coord0, "u": coord2, "v": [coord1, coord1b]})
d3D

**D.** It is also possible to use directly the `CoordSet` property

In [None]:
d3D.coordset = coord0, [coord1, coord1b], coord2
d3D

In [None]:
d3D.coordset = {"t": coord0, "u": coord2, "v": [coord1, coord1b]}
d3D

In [None]:
d3D.coordset = CoordSet(t=coord0, u=coord2, v=[coord1, coord1b])
d3D

<div class='alert alert-warning'>
<b>WARNING</b>

Do not use list for setting multiples coordinates! use tuples
</div>

This raise an error (list have another signification: it's used to set a "same dim"
CoordSet see example A or B)

In [None]:
try:
    d3D.coordset = [coord0, coord1, coord2]
except ValueError:
    scp.error_(
        ValueError,
        "Coordinates must be of the same size for a dimension with multiple coordinates",
    )

This works : it uses a tuple `()` , not a list `[]`

In [None]:
d3D.coordset = (
    coord0,
    coord1,
    coord2,
)  # equivalent to d3D.coordset = coord0, coord1, coord2
d3D

**E.** Setting the coordinates individually

Either a single coordinate

In [None]:
d3D.u = coord2
d3D

or multiple coordinates for a single dimension

In [None]:
d3D.v = [coord1, coord1b]
d3D

or using a CoordSet object.

In [None]:
d3D.v = CoordSet(coord1, coord1b)
d3D

# Methods to create NDDataset

There are many ways to create `NDDataset` objects.

Let's first create 2 coordinate objects, for which we can define `labels` and `units`!
Note the use of the function
`linspace`to generate the data.

In [None]:
c0 = Coord.linspace(
    start=4000.0, stop=1000.0, num=5, labels=None, units="cm^-1", title="wavenumber"
)

In [None]:
c1 = Coord.linspace(
    10.0, 40.0, 3, labels=["Cold", "RT", "Hot"], units="K", title="temperature"
)

The full coordset will be the following

In [None]:
cs = CoordSet(c0, c1)
cs

Now we will generate the full dataset, using a `fromfunction` method.
All needed information are passed as
parameter of the NDDataset instance constructor.

## Create a dataset from a function

In [None]:
def func(x, y, extra):
    return x * y / extra

In [None]:
ds = NDDataset.fromfunction(
    func,
    extra=100 * ur.cm**-1,  # extra arguments passed to the function
    coordset=cs,
    name="mydataset",
    title="absorbance",
    units=None,
)  # when None, units will be determined from the function results

ds.description = """Dataset example created for this tutorial.
It's a 2-D dataset"""

ds.author = "Blake & Mortimer"
ds

## Using numpy-like constructors of NDDatasets

In [None]:
dz = NDDataset.zeros(
    (5, 3), coordset=cs, units="meters", title="Datasets with only zeros"
)

In [None]:
do = NDDataset.ones(
    (5, 3), coordset=cs, units="kilograms", title="Datasets with only ones"
)

In [None]:
df = NDDataset.full(
    (5, 3), fill_value=1.25, coordset=cs, units="radians", title="with only float=1.25"
)
df

As with numpy, it is also possible to take another dataset as a template:

In [None]:
df = NDDataset.full_like(d3D, dtype="int", fill_value=2)
df

In [None]:
nd = NDDataset.diag((3, 3, 2.5))
nd

## Copying existing NDDataset

To copy an existing dataset, this is as simple as:

In [None]:
d3D_copy = d3D.copy()

or alternatively:

In [None]:
d3D_copy = d3D[:]

Finally, it is also possible to initialize a dataset using an existing one:

In [None]:
d3Dduplicate = NDDataset(d3D, name=f"duplicate of {d3D.name}", units="absorbance")
d3Dduplicate

## Importing from external dataset

NDDataset can be created from the importation of external data

A **test**'s data folder contains some data for experimenting some features of
datasets.

In [None]:
# let check if this directory exists and display its actual content:
datadir = scp.preferences.datadir
if datadir.exists():
    print(datadir.name)

Let's load grouped IR spectra acquired using OMNIC:

In [None]:
nd = NDDataset.read_omnic(datadir / "irdata/nh4y-activation.spg")
nd.preferences.reset()
_ = nd.plot()

Even if we do not specify the **datadir**, the application first look in the
directory by default.

Now, lets load a NMR dataset (in the Bruker format).

In [None]:
path = datadir / "nmrdata" / "bruker" / "tests" / "nmr" / "topspin_2d"

# load the data directly (no need to create the dataset first)
nd2 = NDDataset.read_topspin(path, expno=1, remove_digital_filter=True)

# view it...
nd2.x.to("s")
nd2.y.to("ms")

ax = nd2.plot(method="map")