Tutorial: SOMA Objects¶

In this notebook, we’ll go through the various objects available as part of the SOMA API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.

We’ll start by importing tiledbsoma.

[1]:

import tiledbsoma

Experiment¶

An Experiment is a class that represents a single-cell experiment. It always contains two objects: 1. obs: A DataFrame with primary annotations on the observation axis. 2. ms: A Collection of measurements.

[5]:

experiment = tiledbsoma.open("data/dense/pbmc3k")
experiment

[5]:

<Experiment 'data/dense/pbmc3k' (open for 'r') (2 items)
    'ms': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (unopened)
    'obs': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/obs' (unopened)>

Each object can be opened like this:

[6]:

experiment.ms

[6]:

<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (open for 'r') (2 items)
    'raw': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (unopened)
    'RNA': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA' (unopened)>

[7]:

experiment.obs

[7]:

<DataFrame 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/obs' (open for 'r')>

Note that by default an Experiment is opened lazily, i.e. only the minimal requested objects are opened.

Also, opening an object doesn’t mean that it will entirely be fetched in memory. It only returns a pointer to the object on disk.

DataFrame¶

A DataFrame is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.

As an example, let’s take a look at obs, which is represented as a SOMA DataFrame.

We can inspect the schema using .schema:

[8]:

obs = experiment.obs
obs.schema

[8]:

soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string

Note that soma_joinid is a field that exists in each DataFrame and acts as a join key for other objects, such as SparseNDArray (more on this later).

When a DataFrame is accessed, only metadata is retrieved, not actual data. This is important since a DataFrame can be very large and might not fit in memory.

To materialize the dataframe (or a subset) in memory, we call df.read().

If the dataframe is small, we can convert it to an in-memory Pandas object like this:

[9]:

obs.read().concat().to_pandas()

[9]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
2	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
3	3	AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes
4	4	AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells
...	...	...	...	...	...	...
2633	2633	TTTCGAACTCTCAT-1	1155	0.021104	3459.0	CD14+ Monocytes
2634	2634	TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells
2635	2635	TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells
2636	2636	TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells
2637	2637	TTTGCATGCCTCAC-1	724	0.008065	1984.0	CD4 T cells

2638 rows × 6 columns

Here, read() returns an iterator, concat() materializes all rows to memory and to_pandas() returns a Pandas view of the dataframe.

If the dataframe is bigger, we can only select a subset of it before materializing. This will only retrieve the required subset from disk to memory, so very large dataframes can be queried this way. In this example, we will only select the first 10 rows:

[10]:

obs.read((slice(0,10),)).concat().to_pandas()

[10]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
2	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
3	3	AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes
4	4	AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells
5	5	AAACGCACTGGTAC-1	782	0.016644	2163.0	CD8 T cells
6	6	AAACGCTGACCAGT-1	783	0.038161	2175.0	CD8 T cells
7	7	AAACGCTGGTTCTT-1	790	0.030973	2260.0	CD8 T cells
8	8	AAACGCTGTAGCCA-1	533	0.011765	1275.0	CD4 T cells
9	9	AAACGCTGTTTCTG-1	550	0.029012	1103.0	FCGR3A+ Monocytes
10	10	AAACTTGAAAAACG-1	1116	0.026316	3914.0	B cells

We can also select a subset of the columns:

[11]:

obs.read((slice(0, 10),), column_names=["obs_id", "n_genes"]).concat().to_pandas()

[11]:

	obs_id	n_genes
0	AAACATACAACCAC-1	781
1	AAACATTGAGCTAC-1	1352
2	AAACATTGATCAGC-1	1131
3	AAACCGTGCTTCCG-1	960
4	AAACCGTGTATGCG-1	522
5	AAACGCACTGGTAC-1	782
6	AAACGCTGACCAGT-1	783
7	AAACGCTGGTTCTT-1	790
8	AAACGCTGTAGCCA-1	533
9	AAACGCTGTTTCTG-1	550
10	AAACTTGAAAAACG-1	1116

Finally, we can use value_filter to retrieve a filtered subset of rows that match a certain condition.

[12]:

obs.read((slice(None),), value_filter="n_genes > 1500").concat().to_pandas()

[12]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	26	AAATCAACCCTATT-1	1545	0.024313	5676.0	CD4 T cells
1	59	AACCTACTGTGAGG-1	1652	0.015839	5682.0	CD14+ Monocytes
2	107	AAGCACTGGTTCTT-1	1717	0.023566	6153.0	B cells
3	109	AAGCCATGAACTGC-1	1877	0.014015	7064.0	Dendritic cells
4	247	ACCCAGCTGTTAGC-1	1547	0.020600	5534.0	CD14+ Monocytes
...	...	...	...	...	...	...
70	2508	TTACTCGACGCAAT-1	1603	0.024851	5030.0	Dendritic cells
71	2530	TTATGGCTTATGGC-1	1783	0.022064	6164.0	Dendritic cells
72	2597	TTGAGGACTACGCA-1	1794	0.024440	6342.0	Dendritic cells
73	2623	TTTAGCTGTACTCT-1	1567	0.021160	5671.0	Dendritic cells
74	2632	TTTCGAACACCTGA-1	1544	0.013019	4455.0	Dendritic cells

75 rows × 6 columns

Collection¶

A Collection is a persistent container of named SOMA objects, stored as a mapping of string keys and SOMA object values.

The ms member in an Experiment is implemented as a Collection. Let’s take a look:

[13]:

experiment.ms

[13]:

<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (open for 'r') (2 items)
    'raw': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (unopened)
    'RNA': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA' (unopened)>

In this case, we have two members: raw and test_exp_name. They can be accessed as they were dict members:

[14]:

experiment.ms["raw"]

[14]:

<Measurement 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (open for 'r') (2 items)
    'X': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw/X' (unopened)
    'var': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw/var' (unopened)>

DenseNDArray¶

A DenseNDArray is a dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension.

DenseNDArray has a user-defined schema, which includes: - the element type, expressed as an Arrow type, indicating the type of data contained within the array, and - the shape of the array, i.e., the number of dimensions and the length of each dimension

In a SOMA single cell experiment, the cell by gene matrix X is typically represented either by DenseNDArray or SparseNDArray. Let’s take a look at our example:

[23]:

X = experiment["ms"]["RNA"].X
X

[23]:

<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X' (open for 'r') (1 item)
    'data': DenseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X/data' (open for 'r')>

Within the experiment, X is a Collection and the data can be accessed using ["data"]:

[26]:

X = X["data"]
X

[26]:

<DenseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X/data' (open for 'r')>

We can inspect the DenseNDArray and get useful information by using .schema:

[28]:

X.schema

[28]:

soma_dim_0: int64
soma_dim_1: int64
soma_data: float

In this case, we see there are two dimensions and the data is of type float.

We can see the shape of the matrix by calling .shape. In this case, since this represents a dense matrix, this will be the exact size of the matrix:

[34]:

X.shape

[34]:

(2638, 1838)

Similarly to DataFrame, when opening a DenseNDArray only metadata is fetched, and the array isn’t fetched into memory.

We can convert the matrix into a pyarrow.Tensor using .read():

[35]:

X.read()

[35]:

<pyarrow.Tensor>
type: float
shape: (2638, 1838)
strides: (7352, 4)

From here, we can convert it further to a numpy.ndarray:

[38]:

X.read().to_numpy()

[38]:

array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
        -0.2090951 , -0.5312034 ],
       [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
        -0.31314576, -0.5966544 ],
       [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
        -0.17087643,  1.379     ],
       ...,
       [-0.2070895 , -0.250464  , -0.046397  , ..., -0.05114426,
        -0.16106427,  2.0414972 ],
       [-0.19032837, -0.2263336 , -0.04399938, ..., -0.00591773,
        -0.13521303, -0.48211113],
       [-0.33378917, -0.2535875 , -0.05271563, ..., -0.07842438,
        -0.13032717, -0.4713379 ]], dtype=float32)

This will only work on small matrices, since a numpy array needs to be in memory.

We can retrieve a subset of the matrix passing coordinates to .read(). Here we’re only retrieving the first 10 rows of the matrix:

[48]:

sliced_X = X.read((slice(0,9),)).to_numpy()
sliced_X

[48]:

array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
        -0.2090951 , -0.5312034 ],
       [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
        -0.31314576, -0.5966544 ],
       [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
        -0.17087643,  1.379     ],
       ...,
       [-0.15813293, -0.27562705, -0.04569191, ..., -0.08687588,
        -0.2062048 ,  1.6869122 ],
       [ 4.861763  , -0.23054866, -0.04826924, ..., -0.02755091,
        -0.11788268, -0.4664504 ],
       [-0.12453113, -0.23373608, -0.04131226, ..., -0.00758654,
        -0.16255915, -0.50339466]], dtype=float32)

[49]:

sliced_X.shape

[49]:

(10, 1838)

Note that DenseNDArray is always indexed, on each dimension, using zero-based integers. If this dimension matches any other object in the experiment, the soma_joinid column can be used to retrieve the correct slice.

In the following example, we will get the values of X for the gene tagged as ICOSLG. This involves reading the var DataFrame using a value_filter, retrieving the soma_joinid for the gene and passing it as coordinate to X.read:

[104]:

var = experiment.ms["RNA"].var
idx = var.read(value_filter="var_id == 'ICOSLG'").concat()["soma_joinid"].to_numpy()

X.read((None, int(idx[0]))).to_numpy()

[104]:

array([[-0.12167774],
       [-0.05866209],
       [-0.07043106],
       ...,
       [-0.1320983 ],
       [-0.14978862],
       [-0.10383061]], dtype=float32)

SparseNDArray¶

A SparseNDArray is a sparse, N-dimensional array, with offset (zero-based) integer indexing on each dimension. SparseNDArray has a user-defined schema, which includes: - the element type, expressed as an Arrow type, indicating the type of data contained within the array, and - the shape of the array, i.e., the number of dimensions and the length of each dimension

A SparseNDArray is functionally similar to a DenseNDArray, except that only elements that have a nonzero value are actually stored. Elements that are not explicitly stored are assumed to be zeros.

As an example, we will load a version of pbmc3k that has been generated using a SparseNDArray:

[16]:

experiment = tiledbsoma.open("data/sparse/pbmc3k")
X = experiment.ms["RNA"].X["data"]
X

[16]:

<SparseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/sparse/pbmc3k/ms/RNA/X/data' (open for 'r')>

Let’s take a look at the schema:

[17]:

X.schema

[17]:

soma_dim_0: int64
soma_dim_1: int64
soma_data: float

This is the same as the DenseNDArray version, which makes sense since it’s still a 2-dimensional matrix with float data.

Let’s look at the shape:

[43]:

X.shape

[43]:

(9223372036854773760, 9223372036854773760)

Since sparse matrices are not represented as contiguous arrays in memory, they don’t have a fixed size like a dense matrix would have. Instead, .shape() returns the capacity of the matrix, which means that those are valid indices for reading/writing to that matrix. These are dependent on the capacity of the system rather than the current bounding box of the array.

The closest concept to size for a SparseNDArray is the non-empty domain which can be defined as the largest coordinates that correspond to a nonzero value (across each dimension). There is currently no direct way to infer the nonzero domain of a SparseNDArray without materializing the array; however, obs.count and var.count provide these values.

We can get the number of nonzero elements by calling .nnz:

[56]:

X.nnz

[56]:

In order to work with a SparseNDArray, we call .read():

[44]:

X.read()

[44]:

<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x12cc3bdc0>

This returns a SparseNDArrayRead that can be used for getting iterators. For instance, we can do:

[60]:

tensor = X.read().coos().concat()

This returns an Arrow Tensor that can be used to access the array, or convert it further to different formats. For instance:

[61]:

tensor.to_scipy()

[61]:

<9223372036854773760x9223372036854773760 sparse matrix of type '<class 'numpy.float32'>'
        with 4848644 stored elements in COOrdinate format>

can be used to transform it to a SciPy coo_matrix.

Similarly to DenseNDArrays, we can call .read() with a slice to only obtain a subset of the matrix. As an example:

[101]:

sliced_X = X.read((slice(0,9),)).coos().concat().to_scipy()
sliced_X

[101]:

<9223372036854773760x9223372036854773760 sparse matrix of type '<class 'numpy.float32'>'
        with 18380 stored elements in COOrdinate format>

Let’s verify that the slice is correct. To do that, we can call nonzero() on the scipy.sparse.coo_matrix to obtain the coordinates of the nonzero items, and look at the coordinates over the first dimension:

[104]:

sliced_X.nonzero()[0]

[104]:

array([0, 1, 2, ..., 7, 8, 9])