tiledbsoma.DataFrame¶

class tiledbsoma.DataFrame(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')¶

DataFrame is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.

Every DataFrame must contain a column called soma_joinid, of type int64, with negative values explicitly disallowed. The soma_joinid column contains a unique value for each row in the dataframe, and in some cases (e.g., as part of an Experiment), acts as a join key for other objects, such as SparseNDArray.

Lifecycle

Experimental.

Examples

>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...     [
...         ("soma_joinid", pa.int64()),
...         ("A", pa.float32()),
...         ("B", pa.large_string()),
...     ]
... )
>>> with tiledbsoma.DataFrame.create("./test_dataframe", schema=schema) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
...
soma_joinid: int64
A: float
B: large_string
---
   soma_joinid       A    B
0            0  1.0000  one
1            1  2.7182    e
2            2  3.1214   pi

>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...    [
...        ("soma_joinid", pa.int64()),
...        ("A", pa.float32()),
...        ("B", pa.large_string()),
...    ]
...)
>>> with tiledbsoma.DataFrame.create(
...     "./test_dataframe_2",
...     schema=schema,
...     index_column_names=["A", "B"],
...     domain=[(0.0, 10.0), None],
... ) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe_2") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
soma_joinid: int64
---
        A    B  soma_joinid
0  1.0000  one            0
1  2.7182    e            1
2  3.1214   pi            2

Here the index-column names are specified. The domain is entirely optional: if it’s omitted, defaults will be applied yielding the largest possible domain for each index column’s datatype. If the domain is specified, it must be a tuple/list of equal length to index_column_names. It can be None in a given slot, meaning use the largest possible domain. For string/bytes types, it must be None.

__init__(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')¶

Internal-only common initializer steps.

This function is internal; users should open TileDB SOMA objects using the create() and open() factory class methods.

Methods

`__init__`(handle, *[, ...])	Internal-only common initializer steps.
`close`()	Release any resources held while the object is open.
`create`(uri, *, schema[, index_column_names, ...])	Creates the data structure on disk/S3/cloud.
`exists`(uri[, context, tiledb_timestamp])	Finds whether an object of this type exists at the given URI.
`keys`()	Returns the names of the columns when read back as a dataframe.
`non_empty_domain`()	Retrieves the non-empty domain for each dimension, namely the smallest and largest indices in each dimension for which the array/dataframe has data occupied.
`open`(uri[, mode, tiledb_timestamp, context, ...])	Opens this specific type of SOMA object.
`read`([coords, column_names, result_order, ...])	Reads a user-defined subset of data, addressed by the dataframe indexing columns, optionally filtered, and return results as one or more Arrow tables.
`verify_open_for_writing`()	Raises an error if the object is not open for writing.
`write`(values[, platform_config])	Writes an Arrow table to the persistent object.

Attributes

`closed`	True if the object has been closed.
`context`	A value storing implementation-specific configuration information.
`count`	Returns the number of rows in the dataframe.
`domain`	Returns a tuple of minimum and maximum values, inclusive, storable on each index column of the dataframe.
`index_column_names`	Returns index (dimension) column names.
`metadata`	The metadata of this SOMA object.
`mode`	The mode this object was opened in, either `r` or `w`.
`schema`	Returns data schema, in the form of an Arrow Schema.
`soma_type`	A string describing the SOMA type of this object.
`tiledb_timestamp`	The time that this object was opened in UTC.
`tiledb_timestamp_ms`	The time this object was opened, as millis since the Unix epoch.
`uri`	Accessor for the object's storage URI.