This design document is intended to outline a proposed design for Document Indexes in DocArray v2.

I tries to tackle, among other things, the shortcomings described here.

Another goal of this redesign is supporting hybrid search and other advanced search techniques.

Principles

I think Document Indexes should follow the following principle: ”Everything just works out of the box, but you can configure endlessly”

This just echoes FastAPI’s ethos

With regard to the different vector DBs that DocArray aims to support this essentially translates to: “Basic usage should be unified between all supported backends. Power usage can deviate between backends”

Further, the principles that apply for the rest of DocArray v2 still hold: The zen of python, particularly “explicit is better than implicit”, dataclasses are a good idea, schema definitions are a good idea, schema definitions through dataclasses are a good idea, python type hinting is a good idea, etc.

Preliminaries

DocArray has two simple concepts that are needed here: Document and DocList.

A Document is a pydantic-like object that can model your data:

class MyDoc(BaseDoc):
    url: ImageUrl
    tensor: TorchTensor[128]

And a DocList is a list-like of Documents:

documents = DocList[MyDoc]([MyDoc(...)])

A Document Index indexes Documents into a vector DB:

db = DocumentIndex[MyDoc](...)  # could be weaviate, qdrant, ES, ...
db.index(documents)