Milvus 2.6 上新解读：Data-in， Data-out，向量搜索告别复杂预处理

摘要：用 Milvus 做向量搜索时，你是不是也遇到过这样的麻烦：数据放进 Milvus 之前，得先自己做一堆预处理，用各种模型把文字、图片、音频转成向量（也就是 “向量化”），还得提取特征、调整格式……

用 Milvus 做向量搜索时，你是不是也遇到过这样的麻烦：数据放进 Milvus 之前，得先自己做一堆预处理，用各种模型把文字、图片、音频转成向量（也就是 “向量化”），还得提取特征、调整格式……

到了查询环节，还得先把 Query 转为向量才能做检索检索。

好不容易等从 Milvus 里查出向量结果后，又得根据返回的 ID 反查原始文本。

整个过程繁琐又低调。

近日，在最新的 Milvus 2.6 版本中，Data-in, Data-out功能正式推出，能够帮助大家极大的简化数据处理流程，做到原始数据进，原始结果出。

以下是关于Data-in, Data-out的功能解读以及具体实现。

Data-in, Data-out功能暂时还没有官方的正式命名，在代码内使用时依托于 Milvus『Function』函数配置，该 Feature 提案可以在：#35856(https://github.com/milvus-io/milvus/issues/35856) 追踪。

功能上，Data-in, Data-out允许用户不再需要预先计算向量结果，而是将各种 Embedding 和 Reranker 接口模型直接集成 Milvus 数据库内做自动化处理。

这个功能的引入，带来了三大改变

(1)直接插入原始数据：直接向 Milvus 提交文本、图片或其他内容。

(2)只需配置 Embeding 以进行向量化：Milvus 可以连接各种 Embedding 模型服务，例如 OpenAI、AWS Bedrock、Google Vertex AI、Cohere 和 Hugging Face。

(3)原始 Query 直接查询：直接使用原始 Query 进行查询，不必Embedding 后再进行搜索。

总结就是 Milvus 帮开发者写好了 Embedding 和 Reranker 的处理逻辑，大家开箱即用，让业务程序更加简洁。

下图展示了 Milvus 中『Data-in, Data-out』的工作原理。

输入文本：用户将原始数据（比如文档）插入到 Milvus。

生成 Embeddings：Milvus 中的 Function 模块会自动用配置的模型和提供第三方接口，将原始数据转换为向量嵌入。存储 Embeddings：生成的 Embedding 会存储到 Milvus Collections 所指定的向量字段中。查询文本：用户向 Milvus 提交文本查询。语义搜索：Milvus 内部就会将查询数据先 Embedding 转换成向量，再进行相似搜索相关结果。返回结构：Milvus 将 Top k匹配的结果返回给应用程序。

以下基于 Embedding 处理进行说明。

1、准备工作

这里使用的 Docker Compose 部署的，怎么修改它的 milvus.yaml 可参考：Configure Milvus with Docker Compose。（https://milvus.io/docs/configure-docker.md?tab=component#Download-a-configuration-file,另，其他部署方式也能在官方文档中找到配置说明。）

找到 credential 和 function，修改其中的 apikey1.apikey 和 providers.cohere 配置

...credential:aksk1:access_key_id: # Your access_key_idsecret_access_key: # Your secret_access_keyapikey1:apikey: # 修改这里gcp1:credential_json: # base64 based gcp credential data# Any configuration related to functionsfunction:textEmbedding:providers:...cohere: # 修改这下面credential: apikey1 # The name in the crendential configuration itemenable: true # Whether to enable cohere model serviceurl: "https://api.cohere.com/v2/embed" # Your cohere embedding url, Default is the official embedding url......

要使用 Embedding 功能，Collection 中的至少需要包含以下三个字段：

主键字段（id）：标识唯一实体标量字段（document）：存储原始数据向量字段（dense）：存储向量数据from pymilvus import MilvusClient, DataType, Function, FunctionType# Initialize Milvus clientclient = MilvusClient(uri="http://localhost:19530",)# Create a new schema for the collectionschema = client.create_schema# Add primary field "id"schema.add_field("id", DataType.INT64, is_primary=True, auto_id=False)# Add scalar field "document" for storing textual dataschema.add_field("document", DataType.VARCHAR, max_length=9000)# Add vector field "dense" for storing embeddings.# IMPORTANT: Set `dim` to match the exact output dimension of the embedding model.# For instance, OpenAI's text-embedding-3-small model outputs 1536-dimensional vectors.# For dense vector, data type can be FLOAT_VECTOR or INT8_VECTORschema.add_field("dense", DataType.FLOAT_VECTOR, dim=1536) # dim 根据嵌入模型来选择

接着就可以 Schema 内定义 Embedding 函数配置，以下定义了：

name：用于标识 Function 名。function_type：FunctionType.TEXTEMBEDDING 表示是一个用于文本 Embedding 的函数，其他还支持 FunctionType.BM25 和 FunctionType.RERANK，可参考 Full Text Search 和 Decay Ranker Overview。input_field_names：输入字段（原始数据）为 document。output_field_names：输出字段（向量数据）为 dense。params：配置参数，参数内 provider 和 model_name 都需要在 milvus.yaml 配置文件中对应上且可正常使用，这里使用了 cohere 配置。

Notes：确保每个 Function 都有一个唯一的 name 和 output_field_names，以区分不同的转换逻辑，避免冲突。

# Define embedding function (example: OpenAI provider)text_embedding_function = Function(name="cohere_embedding", # Unique identifier for this embedding functionfunction_type=FunctionType.TEXTEMBEDDING, # Type of embedding functioninput_field_names=["document"], # Scalar field to embedoutput_field_names=["dense"], # Vector field to store embeddingsparams={ # Provider-specific configuration (highest priority)"provider": "cohere", # Embedding model provider"model_name": "embed-v4.0", # Embedding model# "credential": "apikey1", # Optional: Credential label# Optional parameters:# "dim": "1536", # Optionally shorten the vector dimension# "user": "user123" # Optional: identifier for API tracking})# Add the embedding function to your schemaschema.add_function(text_embedding_function)3、配置索引

定义好必要字段和内置函数后，可以给 collections 设置索引，这里为了方便索引类型就使用 AUTOINDEX 。

# Prepare index parametersindex_params = client.prepare_index_params# Add AUTOINDEX to automatically select optimal indexing methodindex_params.add_index(field_name="dense",index_type="AUTOINDEX",metric_type="COSINE" )

使用定义的 Schema 和索引，创建一个名为 Demo 的 collection：

# Create collection named "demo"client.create_collection(collection_name='demo', schema=schema, index_params=index_params)5、插入数据

现在就可以直接插入原始数据了，不再需要进行向量化处理，十分方便。

# Insert sample documentsclient.insert('demo', [{'id': 1, 'document': 'Milvus simplifies semantic search through embeddings.'},{'id': 2, 'document': 'Vector embeddings convert text into searchable numeric data.'},{'id': 3, 'document': 'Semantic search helps users find relevant information quickly.'},])6、向量搜索

数据插入完成后，我们搜索时也可以直接使用原始文本进行搜索，Milvus 内会自动根据查询 Embedding 为所需要的向量数据，并根据相似度检索，然后返回匹配的结果。

# Perform semantic searchresults = client.search(collection_name='demo', data=['How does Milvus handle semantic search?'], # Use text query rather than query vectoranns_field='dense', # Use the vector field that stores embeddingslimit=1,output_fields=['document'],)print(results)# Example output:# data: ["[{'id': 1, 'distance': 0.8821347951889038, 'entity': {'document': 'Milvus simplifies semantic search through embeddings.'}}]"]

此处更多的向量检索功能可参考：Basic Vector Search （https://milvus.io/docs/single-vector-search.md）和 Query（https://milvus.io/docs/get-and-scalar-query.md）。

来源：同行者一点号1

标签：搜索向量 query 预处理 milvus

本文地址：http://news.43b.com.cn/a/1376157.html

免责声明：本站系转载，并不代表本网赞同其观点和对其真实性负责。如涉及作品内容、版权和其它问题，请在30日内与本站联系，我们将在第一时间删除内容!