tiledbsoma.DataFrame

class tiledbsoma.DataFrame(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

DataFrame is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.

Every DataFrame must contain a column called soma_joinid, of type int64, with negative values explicitly disallowed. The soma_joinid column contains a unique value for each row in the dataframe, and in some cases (e.g., as part of an Experiment), acts as a join key for other objects, such as SparseNDArray.

Lifecycle

Maturing.

Examples

>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...     [
...         ("soma_joinid", pa.int64()),
...         ("A", pa.float32()),
...         ("B", pa.large_string()),
...     ]
... )
>>> with tiledbsoma.DataFrame.create("./test_dataframe", schema=schema) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
...
soma_joinid: int64
A: float
B: large_string
---
   soma_joinid       A    B
0            0  1.0000  one
1            1  2.7182    e
2            2  3.1214   pi
>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...    [
...        ("soma_joinid", pa.int64()),
...        ("A", pa.float32()),
...        ("B", pa.large_string()),
...    ]
...)
>>> with tiledbsoma.DataFrame.create(
...     "./test_dataframe_2",
...     schema=schema,
...     index_column_names=["A", "B"],
...     domain=[(0.0, 10.0), None],
... ) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe_2") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
soma_joinid: int64
---
        A    B  soma_joinid
0  1.0000  one            0
1  2.7182    e            1
2  3.1214   pi            2

Here the index-column names are specified. The domain is entirely optional: if it’s omitted, defaults will be applied yielding the largest possible domain for each index column’s datatype. If the domain is specified, it must be a tuple/list of equal length to index_column_names. It can be None in a given slot, meaning use the largest possible domain. For string/bytes types, it must be None.

__init__(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

Internal-only common initializer steps.

This function is internal; users should open TileDB SOMA objects using the create() and open() factory class methods.

Methods

__init__(handle, *[, ...])

Internal-only common initializer steps.

close()

Release any resources held while the object is open.

create(uri, *, schema[, index_column_names, ...])

Creates the data structure on disk/S3/cloud.

exists(uri[, context, tiledb_timestamp])

Finds whether an object of this type exists at the given URI.

keys()

Returns the names of the columns when read back as a dataframe.

non_empty_domain()

Retrieves the non-empty domain for each dimension, namely the smallest and largest indices in each dimension for which the array/dataframe has data occupied.

open(uri[, mode, tiledb_timestamp, context, ...])

Opens this specific type of SOMA object.

read([coords, column_names, result_order, ...])

Reads a user-defined subset of data, addressed by the dataframe indexing columns, optionally filtered, and return results as one or more Arrow tables.

reopen(mode[, tiledb_timestamp])

Return a new copy of the SOMAObject with the given mode at the current Unix timestamp.

verify_open_for_writing()

Raises an error if the object is not open for writing.

write(values[, platform_config])

Writes an Arrow table to the persistent object.

Attributes

closed

True if the object has been closed.

context

A value storing implementation-specific configuration information.

count

Returns the number of rows in the dataframe.

domain

Returns a tuple of minimum and maximum values, inclusive, storable on each index column of the dataframe.

index_column_names

Returns index (dimension) column names.

maxdomain

Returns a tuple of minimum and maximum values, inclusive, storable on each index column of the dataframe.

metadata

The metadata of this SOMA object.

mode

The mode this object was opened in, either r or w.

schema

Returns data schema, in the form of an Arrow Schema.

soma_type

A string describing the SOMA type of this object.

tiledb_timestamp

The time that this object was opened in UTC.

tiledb_timestamp_ms

The time this object was opened, as millis since the Unix epoch.

tiledbsoma_has_upgraded_domain

Returns true if the array has the upgraded resizeable domain feature from TileDB-SOMA 1.15: the array was created with this support, or it has had .tiledbsoma_upgrade_domain applied to it.

uri

Accessor for the object's storage URI.