{
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.9.15",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    }
  },
  "nbformat_minor": 5,
  "nbformat": 4,
  "cells": [
    {
      "cell_type": "markdown",
      "source": "# Tutorial: SOMA Experiment queries",
      "metadata": {
        "tags": []
      },
      "id": "2b8e72a7-129c-422c-b955-350fb9ee0541"
    },
    {
      "cell_type": "code",
      "source": "import tiledbsoma as soma",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 3,
      "outputs": [],
      "id": "3a5fd5d3"
    },
    {
      "cell_type": "markdown",
      "source": "In this notebook, we'll take a quick look at the SOMA experiment-query API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.\n",
      "metadata": {
        "tags": []
      },
      "id": "ccc8709a"
    },
    {
      "cell_type": "code",
      "source": "exp = soma.Experiment.open('data/sparse/pbmc3k')",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 4,
      "outputs": [],
      "id": "9b8851d9-27f1-437b-a070-b41a65a5609e"
    },
    {
      "cell_type": "markdown",
      "source": "Using the keys of the `obs` dataframe, we can see what fields are available to query on.",
      "metadata": {
        "tags": []
      },
      "id": "fab7898c"
    },
    {
      "cell_type": "code",
      "source": "exp.obs.keys()",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 5,
      "outputs": [
        {
          "execution_count": 5,
          "output_type": "execute_result",
          "data": {
            "text/plain": "('soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain')"
          },
          "metadata": {}
        }
      ],
      "id": "d67dfbc6-0382-4acc-8c56-3670549654f8"
    },
    {
      "cell_type": "code",
      "source": "p = exp.obs.read(column_names=['louvain']).concat().to_pandas()\np",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 6,
      "outputs": [
        {
          "execution_count": 6,
          "output_type": "execute_result",
          "data": {
            "text/plain": "              louvain\n0         CD4 T cells\n1             B cells\n2         CD4 T cells\n3     CD14+ Monocytes\n4            NK cells\n...               ...\n2633  CD14+ Monocytes\n2634          B cells\n2635          B cells\n2636          B cells\n2637      CD4 T cells\n\n[2638 rows x 1 columns]",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>louvain</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>CD4 T cells</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>B cells</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>CD4 T cells</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>CD14+ Monocytes</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>NK cells</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2633</th>\n      <td>CD14+ Monocytes</td>\n    </tr>\n    <tr>\n      <th>2634</th>\n      <td>B cells</td>\n    </tr>\n    <tr>\n      <th>2635</th>\n      <td>B cells</td>\n    </tr>\n    <tr>\n      <th>2636</th>\n      <td>B cells</td>\n    </tr>\n    <tr>\n      <th>2637</th>\n      <td>CD4 T cells</td>\n    </tr>\n  </tbody>\n</table>\n<p>2638 rows × 1 columns</p>\n</div>"
          },
          "metadata": {}
        }
      ],
      "id": "9e4ede09-2303-4c21-92c1-bf42ed4e7dd1"
    },
    {
      "cell_type": "markdown",
      "source": "Focusing on the `louvain` column, we can now find out what column values are present in the data.",
      "metadata": {
        "tags": []
      },
      "id": "f305fb7c"
    },
    {
      "cell_type": "code",
      "source": "p.groupby('louvain').size().sort_values()",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 7,
      "outputs": [
        {
          "execution_count": 7,
          "output_type": "execute_result",
          "data": {
            "text/plain": "louvain\nMegakaryocytes         15\nDendritic cells        37\nFCGR3A+ Monocytes     150\nNK cells              154\nCD8 T cells           316\nB cells               342\nCD14+ Monocytes       480\nCD4 T cells          1144\ndtype: int64"
          },
          "metadata": {}
        }
      ],
      "id": "00f1ccad-3ee2-4947-8961-8bf9642fbbba"
    },
    {
      "cell_type": "markdown",
      "source": "Now we can query the SOMA experiment -- here, by a few cell types.",
      "metadata": {
        "tags": []
      },
      "id": "fda99535"
    },
    {
      "cell_type": "code",
      "source": "obs_query = soma.AxisQuery(value_filter='louvain in [\"B cells\", \"NK cells\"]')",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 8,
      "outputs": [],
      "id": "e2ed76ca-5821-44c5-a220-ff96568686ec"
    },
    {
      "cell_type": "code",
      "source": "query = exp.axis_query(\"RNA\", obs_query=obs_query)",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 9,
      "outputs": [],
      "id": "f3af70bc-3817-453c-a18c-56dc9aa874da"
    },
    {
      "cell_type": "markdown",
      "source": "Note that the query output is smaller than the original dataset's size -- since we've queried for only a particular pair of cell types.",
      "metadata": {
        "tags": []
      },
      "id": "fb94d898"
    },
    {
      "cell_type": "code",
      "source": "[exp.obs.count, exp.ms[\"RNA\"].var.count]",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 10,
      "outputs": [
        {
          "execution_count": 10,
          "output_type": "execute_result",
          "data": {
            "text/plain": "[2638, 1838]"
          },
          "metadata": {}
        }
      ],
      "id": "2c60568b-0789-4dbf-aff9-4bea2860aef4"
    },
    {
      "cell_type": "code",
      "source": "[query.n_obs, query.n_vars]",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 11,
      "outputs": [
        {
          "execution_count": 11,
          "output_type": "execute_result",
          "data": {
            "text/plain": "[496, 1838]"
          },
          "metadata": {}
        }
      ],
      "id": "28ed8d40-36c5-4642-bd8f-53d35c3074f0"
    },
    {
      "cell_type": "markdown",
      "source": "Here we can take a look at the X data.",
      "metadata": {
        "tags": []
      },
      "id": "c9625771"
    },
    {
      "cell_type": "code",
      "source": "query.X(\"data\").tables().concat().to_pandas()",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 12,
      "outputs": [
        {
          "execution_count": 12,
          "output_type": "execute_result",
          "data": {
            "text/plain": "        soma_dim_0  soma_dim_1  soma_data\n0                1           0  -0.214582\n1                1           1  -0.372653\n2                1           2  -0.054804\n3                1           3  -0.683391\n4                1           4   0.633951\n...            ...         ...        ...\n911643        2636        1833  -0.149789\n911644        2636        1834  -0.325824\n911645        2636        1835  -0.005918\n911646        2636        1836  -0.135213\n911647        2636        1837  -0.482111\n\n[911648 rows x 3 columns]",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>soma_dim_0</th>\n      <th>soma_dim_1</th>\n      <th>soma_data</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>0</td>\n      <td>-0.214582</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>1</td>\n      <td>-0.372653</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>2</td>\n      <td>-0.054804</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1</td>\n      <td>3</td>\n      <td>-0.683391</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>1</td>\n      <td>4</td>\n      <td>0.633951</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>911643</th>\n      <td>2636</td>\n      <td>1833</td>\n      <td>-0.149789</td>\n    </tr>\n    <tr>\n      <th>911644</th>\n      <td>2636</td>\n      <td>1834</td>\n      <td>-0.325824</td>\n    </tr>\n    <tr>\n      <th>911645</th>\n      <td>2636</td>\n      <td>1835</td>\n      <td>-0.005918</td>\n    </tr>\n    <tr>\n      <th>911646</th>\n      <td>2636</td>\n      <td>1836</td>\n      <td>-0.135213</td>\n    </tr>\n    <tr>\n      <th>911647</th>\n      <td>2636</td>\n      <td>1837</td>\n      <td>-0.482111</td>\n    </tr>\n  </tbody>\n</table>\n<p>911648 rows × 3 columns</p>\n</div>"
          },
          "metadata": {}
        }
      ],
      "id": "65063167-5015-497a-9712-d72c0ecac2ed"
    },
    {
      "cell_type": "markdown",
      "source": "To finish out this introductory look at the experiment-query API, we can convert our query outputs to AnnData format.",
      "metadata": {
        "tags": []
      },
      "id": "db7af8b8"
    },
    {
      "cell_type": "code",
      "source": "adata = query.to_anndata(X_name=\"data\")",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 13,
      "outputs": [],
      "id": "1ed8510b-343a-4f88-8aae-11a5c2069311"
    },
    {
      "cell_type": "code",
      "source": "adata",
      "metadata": {
        "tags": [],
        "trusted": true
      },
      "execution_count": 14,
      "outputs": [
        {
          "execution_count": 14,
          "output_type": "execute_result",
          "data": {
            "text/plain": "AnnData object with n_obs × n_vars = 496 × 1838\n    obs: 'soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain'\n    var: 'soma_joinid', 'var_id', 'n_cells'"
          },
          "metadata": {}
        }
      ],
      "id": "b3118504-8c92-48d4-9b83-87176960e4f1"
    },
    {
      "cell_type": "code",
      "source": "",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": [],
      "id": "f2e46ce1-cf7a-43c2-9f0d-bf918fd806bc"
    }
  ]
}