{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2cf7c05c-f723-489d-8c39-3e2841f655b0",
   "metadata": {},
   "source": [
    "# Tutorial: SOMA Objects\n",
    "\n",
    "In this notebook, we'll go through the various objects available as part of the SOMA API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "167dba53-7da6-4984-bbe7-a5416e60325d",
   "metadata": {},
   "source": [
    "We'll start by importing `tiledbsoma`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ab458224-5353-4e15-baa9-46689729e071",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import tiledbsoma"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca9f7272-09e0-4eda-a569-8796a14bf776",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41358011-b835-4c3a-a75e-79a80f4cc3a1",
   "metadata": {},
   "source": [
    "An `Experiment` is a class that represents a single-cell experiment. It always contains two objects:\n",
    "1. `obs`: A `DataFrame` with primary annotations on the observation axis.\n",
    "2. `ms`: A `Collection` of measurements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c3af1793-e2be-45e1-8128-bb64536673f7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Experiment 'data/dense/pbmc3k' (open for 'r') (2 items)\n",
       "    'ms': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (unopened)\n",
       "    'obs': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/obs' (unopened)>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment = tiledbsoma.open(\"data/dense/pbmc3k\")\n",
    "experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18ed8201-084b-4c09-bd45-9ad769318d3c",
   "metadata": {
    "tags": []
   },
   "source": [
    "Each object can be opened like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "228e9411-434e-4c55-8fb4-fef3216dca08",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (open for 'r') (2 items)\n",
       "    'raw': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (unopened)\n",
       "    'RNA': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA' (unopened)>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.ms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5d92e331-5c6c-4971-b956-442996d5efa9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<DataFrame 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/obs' (open for 'r')>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.obs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebbd3605-7601-4142-8979-4748966d91d7",
   "metadata": {
    "tags": []
   },
   "source": [
    "Note that by default an `Experiment` is opened lazily, i.e. only the minimal requested objects are opened. \n",
    "\n",
    "Also, opening an object doesn't mean that it will entirely be fetched in memory. It only returns a pointer to the object on disk."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccb496d3-09c7-4627-9ed1-eca5a87dc4b4",
   "metadata": {
    "tags": []
   },
   "source": [
    "## DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71bb70de-5665-4e25-97bf-32d28d383f66",
   "metadata": {
    "tags": []
   },
   "source": [
    "A `DataFrame` is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cfb6c9f-9101-4083-908a-c61c6b088110",
   "metadata": {
    "tags": []
   },
   "source": [
    "As an example, let's take a look at `obs`, which is represented as a SOMA DataFrame.\n",
    "\n",
    "We can inspect the schema using `.schema`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "66b67624-3cbe-4401-a297-e008cf18ab0b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "soma_joinid: int64\n",
       "obs_id: large_string\n",
       "n_genes: int64\n",
       "percent_mito: float\n",
       "n_counts: float\n",
       "louvain: large_string"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs = experiment.obs\n",
    "obs.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfcccee0-6d0a-4a8c-bf5a-897ed38c1749",
   "metadata": {
    "tags": []
   },
   "source": [
    "Note that `soma_joinid` is a field that exists in each `DataFrame` and acts as a join key for other objects, such as `SparseNDArray` (more on this later)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a25cb83c-0e1a-4a2a-be21-8235ff63a647",
   "metadata": {
    "tags": []
   },
   "source": [
    "When a `DataFrame` is accessed, only metadata is retrieved, not actual data. This is important since a DataFrame can be very large and might not fit in memory.\n",
    "\n",
    "To materialize the dataframe (or a subset) in memory, we call `df.read()`. \n",
    "\n",
    "If the dataframe is small, we can convert it to an in-memory Pandas object like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "26676c4f-dfb8-4f48-9bc5-1a66ee085f9e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "      <td>0.017431</td>\n",
       "      <td>2639.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "      <td>0.012245</td>\n",
       "      <td>980.0</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2633</th>\n",
       "      <td>2633</td>\n",
       "      <td>TTTCGAACTCTCAT-1</td>\n",
       "      <td>1155</td>\n",
       "      <td>0.021104</td>\n",
       "      <td>3459.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2634</th>\n",
       "      <td>2634</td>\n",
       "      <td>TTTCTACTGAGGCA-1</td>\n",
       "      <td>1227</td>\n",
       "      <td>0.009294</td>\n",
       "      <td>3443.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2635</th>\n",
       "      <td>2635</td>\n",
       "      <td>TTTCTACTTCCTCG-1</td>\n",
       "      <td>622</td>\n",
       "      <td>0.021971</td>\n",
       "      <td>1684.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2636</th>\n",
       "      <td>2636</td>\n",
       "      <td>TTTGCATGAGAGGC-1</td>\n",
       "      <td>454</td>\n",
       "      <td>0.020548</td>\n",
       "      <td>1022.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2637</th>\n",
       "      <td>2637</td>\n",
       "      <td>TTTGCATGCCTCAC-1</td>\n",
       "      <td>724</td>\n",
       "      <td>0.008065</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0               0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1               1  AAACATTGAGCTAC-1     1352      0.037936    4903.0   \n",
       "2               2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "3               3  AAACCGTGCTTCCG-1      960      0.017431    2639.0   \n",
       "4               4  AAACCGTGTATGCG-1      522      0.012245     980.0   \n",
       "...           ...               ...      ...           ...       ...   \n",
       "2633         2633  TTTCGAACTCTCAT-1     1155      0.021104    3459.0   \n",
       "2634         2634  TTTCTACTGAGGCA-1     1227      0.009294    3443.0   \n",
       "2635         2635  TTTCTACTTCCTCG-1      622      0.021971    1684.0   \n",
       "2636         2636  TTTGCATGAGAGGC-1      454      0.020548    1022.0   \n",
       "2637         2637  TTTGCATGCCTCAC-1      724      0.008065    1984.0   \n",
       "\n",
       "              louvain  \n",
       "0         CD4 T cells  \n",
       "1             B cells  \n",
       "2         CD4 T cells  \n",
       "3     CD14+ Monocytes  \n",
       "4            NK cells  \n",
       "...               ...  \n",
       "2633  CD14+ Monocytes  \n",
       "2634          B cells  \n",
       "2635          B cells  \n",
       "2636          B cells  \n",
       "2637      CD4 T cells  \n",
       "\n",
       "[2638 rows x 6 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read().concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "630e8cdd-7ff2-4fb9-8c2a-650e16b3b43b",
   "metadata": {
    "tags": []
   },
   "source": [
    "Here, `read()` returns an iterator, `concat()` materializes all rows to memory and `to_pandas()` returns a Pandas view of the dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5a1bbc5-742d-483d-b572-fef0c5caa4c4",
   "metadata": {
    "tags": []
   },
   "source": [
    "If the dataframe is bigger, we can only select a subset of it before materializing. This will only retrieve the required subset from disk to memory, so very large dataframes can be queried this way. In this example, we will only select the first 10 rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "32bfed6c-b0b7-41ed-986c-df7d462498c4",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "      <td>0.017431</td>\n",
       "      <td>2639.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "      <td>0.012245</td>\n",
       "      <td>980.0</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>782</td>\n",
       "      <td>0.016644</td>\n",
       "      <td>2163.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>AAACGCTGACCAGT-1</td>\n",
       "      <td>783</td>\n",
       "      <td>0.038161</td>\n",
       "      <td>2175.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>AAACGCTGGTTCTT-1</td>\n",
       "      <td>790</td>\n",
       "      <td>0.030973</td>\n",
       "      <td>2260.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>AAACGCTGTAGCCA-1</td>\n",
       "      <td>533</td>\n",
       "      <td>0.011765</td>\n",
       "      <td>1275.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>AAACGCTGTTTCTG-1</td>\n",
       "      <td>550</td>\n",
       "      <td>0.029012</td>\n",
       "      <td>1103.0</td>\n",
       "      <td>FCGR3A+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>AAACTTGAAAAACG-1</td>\n",
       "      <td>1116</td>\n",
       "      <td>0.026316</td>\n",
       "      <td>3914.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0             0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1             1  AAACATTGAGCTAC-1     1352      0.037936    4903.0   \n",
       "2             2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "3             3  AAACCGTGCTTCCG-1      960      0.017431    2639.0   \n",
       "4             4  AAACCGTGTATGCG-1      522      0.012245     980.0   \n",
       "5             5  AAACGCACTGGTAC-1      782      0.016644    2163.0   \n",
       "6             6  AAACGCTGACCAGT-1      783      0.038161    2175.0   \n",
       "7             7  AAACGCTGGTTCTT-1      790      0.030973    2260.0   \n",
       "8             8  AAACGCTGTAGCCA-1      533      0.011765    1275.0   \n",
       "9             9  AAACGCTGTTTCTG-1      550      0.029012    1103.0   \n",
       "10           10  AAACTTGAAAAACG-1     1116      0.026316    3914.0   \n",
       "\n",
       "              louvain  \n",
       "0         CD4 T cells  \n",
       "1             B cells  \n",
       "2         CD4 T cells  \n",
       "3     CD14+ Monocytes  \n",
       "4            NK cells  \n",
       "5         CD8 T cells  \n",
       "6         CD8 T cells  \n",
       "7         CD8 T cells  \n",
       "8         CD4 T cells  \n",
       "9   FCGR3A+ Monocytes  \n",
       "10            B cells  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read((slice(0,10),)).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be2bfd77-d54c-4181-a962-6c00610c122a",
   "metadata": {
    "tags": []
   },
   "source": [
    "We can also select a subset of the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "703fe8ad-7123-4311-a58b-b00a27c7a483",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>782</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AAACGCTGACCAGT-1</td>\n",
       "      <td>783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>AAACGCTGGTTCTT-1</td>\n",
       "      <td>790</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>AAACGCTGTAGCCA-1</td>\n",
       "      <td>533</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>AAACGCTGTTTCTG-1</td>\n",
       "      <td>550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>AAACTTGAAAAACG-1</td>\n",
       "      <td>1116</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              obs_id  n_genes\n",
       "0   AAACATACAACCAC-1      781\n",
       "1   AAACATTGAGCTAC-1     1352\n",
       "2   AAACATTGATCAGC-1     1131\n",
       "3   AAACCGTGCTTCCG-1      960\n",
       "4   AAACCGTGTATGCG-1      522\n",
       "5   AAACGCACTGGTAC-1      782\n",
       "6   AAACGCTGACCAGT-1      783\n",
       "7   AAACGCTGGTTCTT-1      790\n",
       "8   AAACGCTGTAGCCA-1      533\n",
       "9   AAACGCTGTTTCTG-1      550\n",
       "10  AAACTTGAAAAACG-1     1116"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read((slice(0, 10),), column_names=[\"obs_id\", \"n_genes\"]).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67e2b4f4-ba5b-4a9c-a847-9da954b4c467",
   "metadata": {
    "tags": []
   },
   "source": [
    "Finally, we can use `value_filter` to retrieve a filtered subset of rows that match a certain condition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a5ef3a97-abc3-4d80-ab48-1898fa64d566",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>26</td>\n",
       "      <td>AAATCAACCCTATT-1</td>\n",
       "      <td>1545</td>\n",
       "      <td>0.024313</td>\n",
       "      <td>5676.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>59</td>\n",
       "      <td>AACCTACTGTGAGG-1</td>\n",
       "      <td>1652</td>\n",
       "      <td>0.015839</td>\n",
       "      <td>5682.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>107</td>\n",
       "      <td>AAGCACTGGTTCTT-1</td>\n",
       "      <td>1717</td>\n",
       "      <td>0.023566</td>\n",
       "      <td>6153.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>109</td>\n",
       "      <td>AAGCCATGAACTGC-1</td>\n",
       "      <td>1877</td>\n",
       "      <td>0.014015</td>\n",
       "      <td>7064.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>247</td>\n",
       "      <td>ACCCAGCTGTTAGC-1</td>\n",
       "      <td>1547</td>\n",
       "      <td>0.020600</td>\n",
       "      <td>5534.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>2508</td>\n",
       "      <td>TTACTCGACGCAAT-1</td>\n",
       "      <td>1603</td>\n",
       "      <td>0.024851</td>\n",
       "      <td>5030.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>2530</td>\n",
       "      <td>TTATGGCTTATGGC-1</td>\n",
       "      <td>1783</td>\n",
       "      <td>0.022064</td>\n",
       "      <td>6164.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>2597</td>\n",
       "      <td>TTGAGGACTACGCA-1</td>\n",
       "      <td>1794</td>\n",
       "      <td>0.024440</td>\n",
       "      <td>6342.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>2623</td>\n",
       "      <td>TTTAGCTGTACTCT-1</td>\n",
       "      <td>1567</td>\n",
       "      <td>0.021160</td>\n",
       "      <td>5671.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>2632</td>\n",
       "      <td>TTTCGAACACCTGA-1</td>\n",
       "      <td>1544</td>\n",
       "      <td>0.013019</td>\n",
       "      <td>4455.0</td>\n",
       "      <td>Dendritic cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>75 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0            26  AAATCAACCCTATT-1     1545      0.024313    5676.0   \n",
       "1            59  AACCTACTGTGAGG-1     1652      0.015839    5682.0   \n",
       "2           107  AAGCACTGGTTCTT-1     1717      0.023566    6153.0   \n",
       "3           109  AAGCCATGAACTGC-1     1877      0.014015    7064.0   \n",
       "4           247  ACCCAGCTGTTAGC-1     1547      0.020600    5534.0   \n",
       "..          ...               ...      ...           ...       ...   \n",
       "70         2508  TTACTCGACGCAAT-1     1603      0.024851    5030.0   \n",
       "71         2530  TTATGGCTTATGGC-1     1783      0.022064    6164.0   \n",
       "72         2597  TTGAGGACTACGCA-1     1794      0.024440    6342.0   \n",
       "73         2623  TTTAGCTGTACTCT-1     1567      0.021160    5671.0   \n",
       "74         2632  TTTCGAACACCTGA-1     1544      0.013019    4455.0   \n",
       "\n",
       "            louvain  \n",
       "0       CD4 T cells  \n",
       "1   CD14+ Monocytes  \n",
       "2           B cells  \n",
       "3   Dendritic cells  \n",
       "4   CD14+ Monocytes  \n",
       "..              ...  \n",
       "70  Dendritic cells  \n",
       "71  Dendritic cells  \n",
       "72  Dendritic cells  \n",
       "73  Dendritic cells  \n",
       "74  Dendritic cells  \n",
       "\n",
       "[75 rows x 6 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read((slice(None),), value_filter=\"n_genes > 1500\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b60c290-7ce8-4324-9694-2b76b802dd9a",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Collection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e961b2f8-5e77-4c40-be87-4283fe9da010",
   "metadata": {},
   "source": [
    "A `Collection` is a persistent container of named SOMA objects, stored as a mapping of string keys and SOMA object values.\n",
    "\n",
    "The `ms` member in an Experiment is implemented as a Collection. Let's take a look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d437b606-8338-4220-966d-59c4bf48fd13",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms' (open for 'r') (2 items)\n",
       "    'raw': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (unopened)\n",
       "    'RNA': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA' (unopened)>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.ms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c02367a7-425d-4135-993c-1dd880b394c5",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this case, we have two members: `raw` and `test_exp_name`. They can be accessed as they were dict members:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0574abf8-5f72-4a05-a90f-608fdda2db07",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Measurement 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw' (open for 'r') (2 items)\n",
       "    'X': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw/X' (unopened)\n",
       "    'var': 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/raw/var' (unopened)>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.ms[\"raw\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d87abc9-08d2-4a15-b9f7-eb9ed4f9791e",
   "metadata": {
    "tags": []
   },
   "source": [
    "## DenseNDArray"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d8e320c-1484-4334-948f-9852d8e23f47",
   "metadata": {},
   "source": [
    "A ``DenseNDArray`` is a dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension. \n",
    "\n",
    "`DenseNDArray` has a user-defined schema, which includes:\n",
    "- the element type, expressed as an Arrow type, indicating the type of data contained within the array, and\n",
    "- the shape of the array, i.e., the number of dimensions and the length of each dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9d92b04-6556-4ec2-8efc-4906a632fdea",
   "metadata": {},
   "source": [
    "In a SOMA single cell experiment, the cell by gene matrix X is typically represented either by `DenseNDArray` or `SparseNDArray`. Let's take a look at our example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "c8c2aa17-52d7-4bd5-a5f3-b58c18fdcb11",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Collection 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X' (open for 'r') (1 item)\n",
       "    'data': DenseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X/data' (open for 'r')>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = experiment[\"ms\"][\"RNA\"].X\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f67613f-6075-4bb2-9e47-885dcfeb313e",
   "metadata": {},
   "source": [
    "Within the experiment, `X` is a `Collection` and the data can be accessed using `[\"data\"]`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "75035918-b26f-48b2-a47b-8ea08c308e37",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<DenseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/dense/pbmc3k/ms/RNA/X/data' (open for 'r')>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = X[\"data\"]\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2b161fe-96f9-4c07-843d-211e63bba818",
   "metadata": {
    "tags": []
   },
   "source": [
    "We can inspect the `DenseNDArray` and get useful information by using `.schema`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "b12b2c0e-db32-48f1-b114-f867baf5be76",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "soma_dim_0: int64\n",
       "soma_dim_1: int64\n",
       "soma_data: float"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dea46bf-69b4-4df0-9168-c01548d488e4",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this case, we see there are two dimensions and the data is of type `float`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d7b9e04-abed-492c-8ff2-958f0c922ca0",
   "metadata": {},
   "source": [
    "We can see the shape of the matrix by calling `.shape`. In this case, since this represents a dense matrix, this will be the exact size of the matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "27858aed-aa89-4c45-bccc-38e9cfa5cbb2",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2638, 1838)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6feae7b-bb6e-4029-85a8-a5375bc53b86",
   "metadata": {
    "tags": []
   },
   "source": [
    "Similarly to `DataFrame`, when opening a `DenseNDArray` only metadata is fetched, and the array isn't fetched into memory. \n",
    "\n",
    "We can convert the matrix into a `pyarrow.Tensor` using `.read()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "2c586c5c-055b-4bc7-9995-851dd802d961",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.Tensor>\n",
       "type: float\n",
       "shape: (2638, 1838)\n",
       "strides: (7352, 4)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "209d9b42-d8d3-4d67-b3a0-b45daaef1ee3",
   "metadata": {
    "tags": []
   },
   "source": [
    "From here, we can convert it further to a `numpy.ndarray`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "6d90e592-7a67-4a41-af08-05aa3807167a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n",
       "        -0.2090951 , -0.5312034 ],\n",
       "       [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n",
       "        -0.31314576, -0.5966544 ],\n",
       "       [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n",
       "        -0.17087643,  1.379     ],\n",
       "       ...,\n",
       "       [-0.2070895 , -0.250464  , -0.046397  , ..., -0.05114426,\n",
       "        -0.16106427,  2.0414972 ],\n",
       "       [-0.19032837, -0.2263336 , -0.04399938, ..., -0.00591773,\n",
       "        -0.13521303, -0.48211113],\n",
       "       [-0.33378917, -0.2535875 , -0.05271563, ..., -0.07842438,\n",
       "        -0.13032717, -0.4713379 ]], dtype=float32)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.read().to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63e894c8-65fc-486d-b2b7-9b989f8e1cea",
   "metadata": {
    "tags": []
   },
   "source": [
    "This will only work on small matrices, since a `numpy` array needs to be in memory. \n",
    "\n",
    "We can retrieve a subset of the matrix passing coordinates to `.read()`. Here we're only retrieving the first 10 rows of the matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "4a3b1f45-017d-4b92-9f2e-c88e8e3aa234",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n",
       "        -0.2090951 , -0.5312034 ],\n",
       "       [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n",
       "        -0.31314576, -0.5966544 ],\n",
       "       [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n",
       "        -0.17087643,  1.379     ],\n",
       "       ...,\n",
       "       [-0.15813293, -0.27562705, -0.04569191, ..., -0.08687588,\n",
       "        -0.2062048 ,  1.6869122 ],\n",
       "       [ 4.861763  , -0.23054866, -0.04826924, ..., -0.02755091,\n",
       "        -0.11788268, -0.4664504 ],\n",
       "       [-0.12453113, -0.23373608, -0.04131226, ..., -0.00758654,\n",
       "        -0.16255915, -0.50339466]], dtype=float32)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sliced_X = X.read((slice(0,9),)).to_numpy()\n",
    "sliced_X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "007f1e15-61cd-40b8-bb23-4102662ab3af",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 1838)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sliced_X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b630da-669f-4503-b0e0-6316d265608f",
   "metadata": {},
   "source": [
    "Note that `DenseNDArray` is always indexed, on each dimension, using zero-based integers. If this dimension matches any other object in the experiment, the `soma_joinid` column can be used to retrieve the correct slice.\n",
    "\n",
    "In the following example, we will get the values of X for the gene tagged as `ICOSLG`. This involves reading the `var` DataFrame using a `value_filter`, retrieving the `soma_joinid` for the gene and passing it as coordinate to `X.read`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "d6d39b44-33b3-4cb7-8a34-d30b94899ad1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.12167774],\n",
       "       [-0.05866209],\n",
       "       [-0.07043106],\n",
       "       ...,\n",
       "       [-0.1320983 ],\n",
       "       [-0.14978862],\n",
       "       [-0.10383061]], dtype=float32)"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var = experiment.ms[\"RNA\"].var\n",
    "idx = var.read(value_filter=\"var_id == 'ICOSLG'\").concat()[\"soma_joinid\"].to_numpy()\n",
    "\n",
    "X.read((None, int(idx[0]))).to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c33cc1d-25b9-4973-b076-f78dce246cdd",
   "metadata": {},
   "source": [
    "## SparseNDArray"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91416b83-72a9-47a8-824c-67eb8987d937",
   "metadata": {
    "tags": []
   },
   "source": [
    "A `SparseNDArray` is a sparse, N-dimensional array, with offset (zero-based) integer indexing on each dimension. `SparseNDArray` has a user-defined schema, which includes:\n",
    "- the element type, expressed as an Arrow type, indicating the type of data\n",
    "      contained within the array, and\n",
    "- the shape of the array, i.e., the number of dimensions and the length of\n",
    "      each dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf955720-db72-4536-b568-134e308d17e0",
   "metadata": {},
   "source": [
    "A `SparseNDArray` is functionally similar to a `DenseNDArray`, except that only elements that have a nonzero value are actually stored. Elements that are not explicitly stored are assumed to be zeros."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "836a3a97-901e-4428-8ee3-d93a49a1ee26",
   "metadata": {
    "tags": []
   },
   "source": [
    "As an example, we will load a version of pbmc3k that has been generated using a `SparseNDArray`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "71f8fed4-4ffd-4f30-a5b0-4e3a4a3730f3",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<SparseNDArray 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/sparse/pbmc3k/ms/RNA/X/data' (open for 'r')>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment = tiledbsoma.open(\"data/sparse/pbmc3k\")\n",
    "X = experiment.ms[\"RNA\"].X[\"data\"]\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76935223-eda5-48c0-89f5-26e5bfdf3628",
   "metadata": {},
   "source": [
    "Let's take a look at the schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "41897a5c-2225-49f9-b9f2-3a68a6ad8079",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "soma_dim_0: int64\n",
       "soma_dim_1: int64\n",
       "soma_data: float"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff9fcfd3-b456-430c-9a98-6cd46d2dd9d2",
   "metadata": {},
   "source": [
    "This is the same as the `DenseNDArray` version, which makes sense since it's still a 2-dimensional matrix with `float` data.\n",
    "\n",
    "Let's look at the shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "42c1d852-6492-4a5e-b1fe-bc9af3f83639",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9223372036854773760, 9223372036854773760)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f91e73f-aa79-4fa2-a763-ad792e5641a8",
   "metadata": {},
   "source": [
    "Since sparse matrices are not represented as contiguous arrays in memory, they don't have a fixed size like a dense matrix would have. Instead, `.shape()` returns the _capacity_ of the matrix, which means that those are valid indices for reading/writing to that matrix. These are dependent on the capacity of the system rather than the current bounding box of the array.\n",
    "\n",
    "The closest concept to size for a `SparseNDArray` is the non-empty domain which can be defined as the largest coordinates that correspond to a nonzero value (across each dimension). There is currently no direct way to infer the nonzero domain of a `SparseNDArray` without materializing the array; however, `obs.count` and `var.count` provide these values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3aa3b74-4c59-421c-96fe-215989103a41",
   "metadata": {
    "tags": []
   },
   "source": [
    "We can get the number of nonzero elements by calling `.nnz`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "2862f737-4f08-4886-9496-fe7771b4a581",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4848644"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.nnz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b4f394f-8d6c-4ac5-9d45-997958b319a5",
   "metadata": {
    "tags": []
   },
   "source": [
    "In order to work with a `SparseNDArray`, we call `.read()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "eaa0f9aa-8167-4f26-a52f-4d9636dde37b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x12cc3bdc0>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd75223e-6378-4159-8478-0b970ee2d5a4",
   "metadata": {},
   "source": [
    "This returns a SparseNDArrayRead that can be used for getting iterators. For instance, we can do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "00a7899f-2d28-4f07-b438-ab4d4d6bcfe5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "tensor = X.read().coos().concat()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "127721ae-4c0b-42ac-ac9e-20dd9e62a682",
   "metadata": {
    "tags": []
   },
   "source": [
    "This returns an [Arrow Tensor](https://arrow.apache.org/docs/cpp/api/tensor.html) that can be used to access the array, or convert it further to different formats. For instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "f62472c6-e67c-44f0-8ed2-df9bec3ae3e8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<9223372036854773760x9223372036854773760 sparse matrix of type '<class 'numpy.float32'>'\n",
       "\twith 4848644 stored elements in COOrdinate format>"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor.to_scipy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4f47d52-2883-4515-afc2-d3f9d9d4ad31",
   "metadata": {
    "tags": []
   },
   "source": [
    "can be used to transform it to a [SciPy coo_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a064c7b3-5ac5-4b0f-855a-0ffee47a709c",
   "metadata": {
    "tags": []
   },
   "source": [
    "Similarly to `DenseNDArray`s, we can call `.read()` with a slice to only obtain a subset of the matrix. As an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "d5d9ca87-58cc-44bf-ba48-7e2bf3b6c5a7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<9223372036854773760x9223372036854773760 sparse matrix of type '<class 'numpy.float32'>'\n",
       "\twith 18380 stored elements in COOrdinate format>"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sliced_X = X.read((slice(0,9),)).coos().concat().to_scipy()\n",
    "sliced_X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4dbf334-e525-45a2-b1cf-8531d064a89c",
   "metadata": {
    "tags": []
   },
   "source": [
    "Let's verify that the slice is correct. To do that, we can call `nonzero()` on the `scipy.sparse.coo_matrix` to obtain the coordinates of the nonzero items, and look at the coordinates over the first dimension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "d542f63e-7ca8-4e68-8933-cb15f17bc8cb",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, ..., 7, 8, 9])"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sliced_X.nonzero()[0]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}