tiledbsoma.DataFrame

class tiledbsoma.DataFrame(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

DataFrame is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.

Every DataFrame must contain a column called soma_joinid, of type int64, with negative values explicitly disallowed. The soma_joinid column contains a unique value for each row in the dataframe, and in some cases (e.g., as part of an Experiment), acts as a join key for other objects, such as SparseNDArray.

Lifecycle

Experimental.

Examples

>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...     [
...         ("soma_joinid", pa.int64()),
...         ("A", pa.float32()),
...         ("B", pa.large_string()),
...     ]
... )
>>> with tiledbsoma.DataFrame.create("./test_dataframe", schema=schema) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
...
soma_joinid: int64
A: float
B: large_string
---
   soma_joinid       A    B
0            0  1.0000  one
1            1  2.7182    e
2            2  3.1214   pi
>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...    [
...        ("soma_joinid", pa.int64()),
...        ("A", pa.float32()),
...        ("B", pa.large_string()),
...    ]
...)
>>> with tiledbsoma.DataFrame.create(
...     "./test_dataframe_2",
...     schema=schema,
...     index_column_names=["A", "B"],
...     domain=[(0.0, 10.0), None],
... ) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe_2") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
soma_joinid: int64
---
        A    B  soma_joinid
0  1.0000  one            0
1  2.7182    e            1
2  3.1214   pi            2

Here the index-column names are specified. The domain is entirely optional: if it’s omitted, defaults will be applied yielding the largest possible domain for each index column’s datatype. If the domain is specified, it must be a tuple/list of equal length to index_column_names. It can be None in a given slot, meaning use the largest possible domain. For string/bytes types, it must be None.

__init__(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

Internal-only common initializer steps.

This function is internal; users should open TileDB SOMA objects using the create() and open() factory class methods.

Methods

__init__(handle, *[, ...])

Internal-only common initializer steps.

close()

Release any resources held while the object is open.

create(uri, *, schema[, index_column_names, ...])

Creates the data structure on disk/S3/cloud.

exists(uri[, context, tiledb_timestamp])

Finds whether an object of this type exists at the given URI.

keys()

Returns the names of the columns when read back as a dataframe.

non_empty_domain()

Retrieves the non-empty domain for each dimension, namely the smallest and largest indices in each dimension for which the array/dataframe has data occupied.

open(uri[, mode, tiledb_timestamp, context, ...])

Opens this specific type of SOMA object.

read([coords, column_names, result_order, ...])

Reads a user-defined subset of data, addressed by the dataframe indexing columns, optionally filtered, and return results as one or more Arrow tables.

verify_open_for_writing()

Raises an error if the object is not open for writing.

write(values[, platform_config])

Writes an Arrow table to the persistent object.

Attributes

closed

True if the object has been closed.

context

A value storing implementation-specific configuration information.

count

Returns the number of rows in the dataframe.

domain

Returns a tuple of minimum and maximum values, inclusive, storable on each index column of the dataframe.

index_column_names

Returns index (dimension) column names.

metadata

The metadata of this SOMA object.

mode

The mode this object was opened in, either r or w.

schema

Returns data schema, in the form of an Arrow Schema.

soma_type

A string describing the SOMA type of this object.

tiledb_timestamp

The time that this object was opened in UTC.

tiledb_timestamp_ms

The time this object was opened, as millis since the Unix epoch.

uri

Accessor for the object's storage URI.