🚀 免费试用 Zilliz Cloud,完全托管的 Milvus,体验 10 倍的性能提升!立即试用>

milvus-logo
LFAI
  • Home
  • Blog
  • 语义搜索与全文搜索:我要在 Milvus 2.5 中选择哪一个?

语义搜索与全文搜索:我要在 Milvus 2.5 中选择哪一个?

  • Engineering
December 17, 2024
David Wang, Jiang Chen

Milvus 是领先的高性能向量数据库,长期以来一直专注于利用深度学习模型的向量嵌入进行语义搜索。这项技术为检索增强生成(RAG)、搜索引擎和推荐系统等人工智能应用提供了动力。随着 RAG 和其他文本搜索应用的日益普及,社区已经认识到将传统文本匹配方法与语义搜索(即混合搜索)相结合的优势。这种方法尤其适用于严重依赖关键词匹配的场景。为了满足这一需求,Milvus 2.5 引入了全文搜索(FTS)功能,并将其与 2.4 版以来已有的稀疏向量搜索和混合搜索功能集成在一起,形成了强大的协同效应。

混合搜索是一种结合多种搜索路径结果的方法。用户可以通过各种方式搜索不同的数据字段,然后对结果进行合并和排序,以获得综合结果。在当今流行的 RAG 应用程序中,典型的混合方法是将语义搜索与全文搜索相结合。具体来说,这涉及使用 RRF(互惠排名融合)合并基于密集嵌入的语义搜索和基于 BM25 的词性匹配的结果,以增强结果排名。

在本文中,我们将使用 Anthropic 提供的数据集进行演示,该数据集由九个代码库中的代码片段组成。这类似于 RAG 的一个常用案例:人工智能辅助编码机器人。由于代码数据包含大量定义、关键词和其他信息,因此基于文本的搜索在这种情况下会特别有效。同时,在大型代码数据集上训练的密集嵌入模型可以捕捉到更高层次的语义信息。我们的目标是通过实验观察将这两种方法结合起来的效果。

我们将分析具体案例,以便对混合搜索有更清晰的认识。作为基线,我们将使用在大量代码数据上训练的高级密集嵌入模型(Voyage-2)。然后,我们将选择混合搜索结果优于语义搜索结果和全文搜索结果(前 5 名)的案例,分析这些案例背后的特点。

方法通过@5
全文搜索0.7318
语义搜索0.8096
混合搜索0.8176
混合搜索(添加词缀)0.8418

除了逐个分析质量外,我们还通过计算整个数据集的 Pass@5 指标扩大了评估范围。该指标衡量的是在每个查询的前 5 个结果中找到的相关结果的比例。我们的研究结果表明,虽然先进的嵌入模型建立了坚实的基础,但将它们与全文搜索整合在一起会产生更好的结果。通过检查 BM25 结果和微调特定场景的参数,可以进一步提高性能。

我们检查了三种不同搜索查询的具体检索结果,比较了语义搜索、全文搜索和混合搜索。您也可以在此软件仓库中查看完整代码

查询:日志文件是如何创建的?

该查询旨在询问如何创建日志文件,正确答案应该是创建日志文件的 Rust 代码片段。在语义搜索结果中,我们看到了一些介绍日志头文件的代码和获取日志记录器的 C++ 代码。不过,这里的关键是 "logfile "变量。在混合搜索结果 #hybrid 0 中,我们找到了这个相关结果,它自然来自全文搜索,因为混合搜索合并了语义搜索和全文搜索结果。

除了这个结果,我们还可以在 #hybrid 2 中找到不相关的测试模拟代码,尤其是重复出现的短语 "长字符串,以测试如何处理这些字符串"。这就需要了解全文搜索中使用的 BM25 算法背后的原理。全文搜索的目的是匹配更多不常见的单词(因为常见单词会降低文本的独特性,阻碍对象的识别)。假设我们对大量自然文本进行统计分析。在这种情况下,我们很容易得出结论:"how "是一个非常常见的词,对相关性得分的贡献很小。然而,在本例中,数据集由代码组成,而代码中出现 "how "一词的次数并不多,因此在这种情况下,"how "就成了关键搜索词。

基本事实:正确答案是创建日志文件的 Rust 代码。

use {
    crate::args::LogArgs,
    anyhow::{anyhow, Result},
    simplelog::{Config, LevelFilter, WriteLogger},
    std::fs::File,
};

pub struct Logger;

impl Logger {
    pub fn init(args: &impl LogArgs) -> Result<()> {
        let filter: LevelFilter = args.log_level().into();
        if filter != LevelFilter::Off {
            let logfile = File::create(args.log_file())
                .map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
            WriteLogger::init(filter, Config::default(), logfile)
                .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
        }
        Ok(())
    }
}

语义搜索结果

##dense 0 0.7745316028594971 
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>



 ##dense 1 0.769859254360199 
        void simple()
        {
                LayoutPtr layout = LayoutPtr(new SimpleLayout());
                AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false));
                root->addAppender(appender);
                common();

                LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple")));
        }

        std::string createMessage(int i, Pool & pool)
        {
                std::string msg("Message ");
                msg.append(pool.itoa(i));
                return msg;
        }

        void common()
        {
                int i = 0;

                // In the lines below, the logger names are chosen as an aid in
                // remembering their level values. In general, the logger names
                // have no bearing to level values.
                LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR"));
                ERRlogger->setLevel(Level::getError());



 ##dense 2 0.7591114044189453 
                log4cxx::spi::LoggingEventPtr logEvt = std::make_shared<log4cxx::spi::LoggingEvent>(LOG4CXX_STR("foo"),
                                                                                                                                                                                         Level::getInfo(),
                                                                                                                                                                                         LOG4CXX_STR("A Message"),
                                                                                                                                                                                         log4cxx::spi::LocationInfo::getLocationUnavailable());
                FMTLayout layout(LOG4CXX_STR("{d:%Y-%m-%d %H:%M:%S} {message}"));
                LogString output;
                log4cxx::helpers::Pool pool;
                layout.format( output, logEvt, pool);



 ##dense 3 0.7562235593795776 
#include "util/compare.h"
#include "util/transformer.h"
#include "util/absolutedateandtimefilter.h"
#include "util/iso8601filter.h"
#include "util/absolutetimefilter.h"
#include "util/relativetimefilter.h"
#include "util/controlfilter.h"
#include "util/threadfilter.h"
#include "util/linenumberfilter.h"
#include "util/filenamefilter.h"
#include "vectorappender.h"
#include <log4cxx/fmtlayout.h>
#include <log4cxx/propertyconfigurator.h>
#include <log4cxx/helpers/date.h>
#include <log4cxx/spi/loggingevent.h>
#include <iostream>
#include <iomanip>

#define REGEX_STR(x) x
#define PAT0 REGEX_STR("\\[[0-9A-FXx]*]\\ (DEBUG|INFO|WARN|ERROR|FATAL) .* - Message [0-9]\\{1,2\\}")
#define PAT1 ISO8601_PAT REGEX_STR(" ") PAT0
#define PAT2 ABSOLUTE_DATE_AND_TIME_PAT REGEX_STR(" ") PAT0
#define PAT3 ABSOLUTE_TIME_PAT REGEX_STR(" ") PAT0
#define PAT4 RELATIVE_TIME_PAT REGEX_STR(" ") PAT0
#define PAT5 REGEX_STR("\\[[0-9A-FXx]*]\\ (DEBUG|INFO|WARN|ERROR|FATAL) .* : Message [0-9]\\{1,2\\}")


 ##dense 4 0.7557586431503296 
                std::string msg("Message ");

                Pool pool;

                // These should all log.----------------------------
                LOG4CXX_FATAL(ERRlogger, createMessage(i, pool));
                i++; //0
                LOG4CXX_ERROR(ERRlogger, createMessage(i, pool));
                i++;

                LOG4CXX_FATAL(INF, createMessage(i, pool));
                i++; // 2
                LOG4CXX_ERROR(INF, createMessage(i, pool));
                i++;
                LOG4CXX_WARN(INF, createMessage(i, pool));
                i++;
                LOG4CXX_INFO(INF, createMessage(i, pool));
                i++;

                LOG4CXX_FATAL(INF_UNDEF, createMessage(i, pool));
                i++; //6
                LOG4CXX_ERROR(INF_UNDEF, createMessage(i, pool));
                i++;
                LOG4CXX_WARN(INF_UNDEF, createMessage(i, pool));
                i++;
                LOG4CXX_INFO(INF_UNDEF, createMessage(i, pool));
                i++;

                LOG4CXX_FATAL(INF_ERR, createMessage(i, pool));
                i++; // 10
                LOG4CXX_ERROR(INF_ERR, createMessage(i, pool));
                i++;

                LOG4CXX_FATAL(INF_ERR_UNDEF, createMessage(i, pool));
                i++;
                LOG4CXX_ERROR(INF_ERR_UNDEF, createMessage(i, pool));
                i++;


混合搜索结果

##hybrid 0 0.016393441706895828 
use {
    crate::args::LogArgs,
    anyhow::{anyhow, Result},
    simplelog::{Config, LevelFilter, WriteLogger},
    std::fs::File,
};

pub struct Logger;

impl Logger {
    pub fn init(args: &impl LogArgs) -> Result<()> {
        let filter: LevelFilter = args.log_level().into();
        if filter != LevelFilter::Off {
            let logfile = File::create(args.log_file())
                .map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
            WriteLogger::init(filter, Config::default(), logfile)
                .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
        }
        Ok(())
    }
}

 
##hybrid 1 0.016393441706895828 
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>


 
##hybrid 2 0.016129031777381897 
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
    };
}


 
##hybrid 3 0.016129031777381897 
        void simple()
        {
                LayoutPtr layout = LayoutPtr(new SimpleLayout());
                AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false));
                root->addAppender(appender);
                common();

                LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple")));
        }

        std::string createMessage(int i, Pool & pool)
        {
                std::string msg("Message ");
                msg.append(pool.itoa(i));
                return msg;
        }

        void common()
        {
                int i = 0;

                // In the lines below, the logger names are chosen as an aid in
                // remembering their level values. In general, the logger names
                // have no bearing to level values.
                LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR"));
                ERRlogger->setLevel(Level::getError());


 
##hybrid 4 0.01587301678955555 
std::vector<std::string> MakeStrings() {
    return {
        "a", "ab", "abc", "abcd",
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "

查询:如何初始化日志记录器?

这个查询与前一个查询非常相似,正确答案也是相同的代码片段,但在本例中,混合搜索(通过语义搜索)找到了答案,而全文搜索却没有找到。造成这种差异的原因在于语料库中单词的统计权重与我们对问题的直观理解不一致。模型未能认识到 "how "一词的匹配在这里并不那么重要。与 "如何 "相比,"记录仪 "一词在代码中出现的频率更高,这导致 "如何 "在全文搜索排名中变得更加重要。

地面实况

use {
    crate::args::LogArgs,
    anyhow::{anyhow, Result},
    simplelog::{Config, LevelFilter, WriteLogger},
    std::fs::File,
};

pub struct Logger;

impl Logger {
    pub fn init(args: &impl LogArgs) -> Result<()> {
        let filter: LevelFilter = args.log_level().into();
        if filter != LevelFilter::Off {
            let logfile = File::create(args.log_file())
                .map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
            WriteLogger::init(filter, Config::default(), logfile)
                .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
        }
        Ok(())
    }
}

全文搜索结果

##sparse 0 10.17311954498291 
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
    };
}



 ##sparse 1 9.775702476501465 
std::vector<std::string> MakeStrings() {
    return {
        "a", "ab", "abc", "abcd",
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "


 ##sparse 2 7.638711452484131 
//   union ("x|y"), grouping ("(xy)"), brackets ("[xy]"), and
//   repetition count ("x{5,7}"), among others.
//
//   Below is the syntax that we do support.  We chose it to be a
//   subset of both PCRE and POSIX extended regex, so it's easy to
//   learn wherever you come from.  In the following: 'A' denotes a
//   literal character, period (.), or a single \\ escape sequence;
//   'x' and 'y' denote regular expressions; 'm' and 'n' are for


 ##sparse 3 7.1208391189575195 
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>



 ##sparse 4 7.066349029541016 
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
#include <log4cxx/filter/denyallfilter.h>
#include <log4cxx/logger.h>
#include <log4cxx/spi/filter.h>
#include <log4cxx/spi/loggingevent.h>
#include "../logunit.h"

混合搜索结果


 ##hybrid 0 0.016393441706895828 
use {
    crate::args::LogArgs,
    anyhow::{anyhow, Result},
    simplelog::{Config, LevelFilter, WriteLogger},
    std::fs::File,
};

pub struct Logger;

impl Logger {
    pub fn init(args: &impl LogArgs) -> Result<()> {
        let filter: LevelFilter = args.log_level().into();
        if filter != LevelFilter::Off {
            let logfile = File::create(args.log_file())
                .map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
            WriteLogger::init(filter, Config::default(), logfile)
                .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
        }
        Ok(())
    }
}

 
##hybrid 1 0.016393441706895828 
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
    };
}


 
##hybrid 2 0.016129031777381897 
std::vector<std::string> MakeStrings() {
    return {
        "a", "ab", "abc", "abcd",
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "
        "long string to test how those are handled. Here goes more text. "

 
##hybrid 3 0.016129031777381897 
                LoggerPtr INF = Logger::getLogger(LOG4CXX_TEST_STR("INF"));
                INF->setLevel(Level::getInfo());

                LoggerPtr INF_ERR = Logger::getLogger(LOG4CXX_TEST_STR("INF.ERR"));
                INF_ERR->setLevel(Level::getError());

                LoggerPtr DEB = Logger::getLogger(LOG4CXX_TEST_STR("DEB"));
                DEB->setLevel(Level::getDebug());

                // Note: categories with undefined level
                LoggerPtr INF_UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("INF.UNDEF"));
                LoggerPtr INF_ERR_UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("INF.ERR.UNDEF"));
                LoggerPtr UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("UNDEF"));


 
##hybrid 4 0.01587301678955555 
//   union ("x|y"), grouping ("(xy)"), brackets ("[xy]"), and
//   repetition count ("x{5,7}"), among others.
//
//   Below is the syntax that we do support.  We chose it to be a
//   subset of both PCRE and POSIX extended regex, so it's easy to
//   learn wherever you come from.  In the following: 'A' denotes a
//   literal character, period (.), or a single \\ escape sequence;
//   'x' and 'y' denote regular expressions; 'm' and 'n' are for

在观察中,我们发现在稀疏向量搜索中,许多低质量的搜索结果都是由于匹配了 "如何 "和 "什么 "等低信息量的词造成的。通过检查数据,我们意识到这些词对搜索结果造成了干扰。缓解这一问题的一种方法是将这些词添加到一个停止词列表中,并在匹配过程中忽略它们。这将有助于消除这些常用词的负面影响,提高搜索结果的质量。

在添加了停顿词以过滤掉 "How "和 "What "等低信息量的词之后,我们分析了一个经过微调的混合搜索比语义搜索表现更好的案例。这种情况下的改进是由于在查询中匹配了 "RegistryClient "一词,这让我们找到了仅靠语义搜索模型无法调用的结果。

此外,我们注意到混合搜索减少了结果中低质量匹配的数量。在这种情况下,混合搜索方法成功地整合了语义搜索和全文搜索,从而获得了更多相关结果,并提高了准确性。

查询:测试方法中是如何创建 RegistryClient 实例的?

混合搜索有效地检索到了与创建 "RegistryClient "实例相关的答案,而单独的语义搜索未能找到该答案。添加停止词有助于避免 "如何 "等术语带来的不相关结果,从而提高匹配质量,减少低质量结果。

/** Integration tests for {@link BlobPuller}. */
public class BlobPullerIntegrationTest {

  private final FailoverHttpClient httpClient = new FailoverHttpClient(true, false, ignored -> {});

  @Test
  public void testPull() throws IOException, RegistryException {
    RegistryClient registryClient =
        RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient)
            .newRegistryClient();
    V22ManifestTemplate manifestTemplate =
        registryClient
            .pullManifest(
                ManifestPullerIntegrationTest.KNOWN_MANIFEST_V22_SHA, V22ManifestTemplate.class)
            .getManifest();

    DescriptorDigest realDigest = manifestTemplate.getLayers().get(0).getDigest();

语义搜索结果


 

##dense 0 0.7411458492279053 
    Mockito.doThrow(mockRegistryUnauthorizedException)
        .when(mockJibContainerBuilder)
        .containerize(mockContainerizer);

    try {
      testJibBuildRunner.runBuild();
      Assert.fail();

    } catch (BuildStepsExecutionException ex) {
      Assert.assertEquals(
          TEST_HELPFUL_SUGGESTIONS.forHttpStatusCodeForbidden("someregistry/somerepository"),
          ex.getMessage());
    }
  }



 ##dense 1 0.7346029877662659 
    verify(mockCredentialRetrieverFactory).known(knownCredential, "credentialSource");
    verify(mockCredentialRetrieverFactory).known(inferredCredential, "inferredCredentialSource");
    verify(mockCredentialRetrieverFactory)
        .dockerCredentialHelper("docker-credential-credentialHelperSuffix");
  }



 ##dense 2 0.7285804748535156 
    when(mockCredentialRetrieverFactory.dockerCredentialHelper(anyString()))
        .thenReturn(mockDockerCredentialHelperCredentialRetriever);
    when(mockCredentialRetrieverFactory.known(knownCredential, "credentialSource"))
        .thenReturn(mockKnownCredentialRetriever);
    when(mockCredentialRetrieverFactory.known(inferredCredential, "inferredCredentialSource"))
        .thenReturn(mockInferredCredentialRetriever);
    when(mockCredentialRetrieverFactory.wellKnownCredentialHelpers())
        .thenReturn(mockWellKnownCredentialHelpersCredentialRetriever);



 ##dense 3 0.7279614210128784 
  @Test
  public void testBuildImage_insecureRegistryException()
      throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException,
          ExecutionException {
    InsecureRegistryException mockInsecureRegistryException =
        Mockito.mock(InsecureRegistryException.class);
    Mockito.doThrow(mockInsecureRegistryException)
        .when(mockJibContainerBuilder)
        .containerize(mockContainerizer);

    try {
      testJibBuildRunner.runBuild();
      Assert.fail();

    } catch (BuildStepsExecutionException ex) {
      Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forInsecureRegistry(), ex.getMessage());
    }
  }



 ##dense 4 0.724872350692749 
  @Test
  public void testBuildImage_registryCredentialsNotSentException()
      throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException,
          ExecutionException {
    Mockito.doThrow(mockRegistryCredentialsNotSentException)
        .when(mockJibContainerBuilder)
        .containerize(mockContainerizer);

    try {
      testJibBuildRunner.runBuild();
      Assert.fail();

    } catch (BuildStepsExecutionException ex) {
      Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forCredentialsNotSent(), ex.getMessage());
    }
  }

混合搜索结果


 ##hybrid 0 0.016393441706895828 
/** Integration tests for {@link BlobPuller}. */
public class BlobPullerIntegrationTest {

  private final FailoverHttpClient httpClient = new FailoverHttpClient(true, false, ignored -> {});

  @Test
  public void testPull() throws IOException, RegistryException {
    RegistryClient registryClient =
        RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient)
            .newRegistryClient();
    V22ManifestTemplate manifestTemplate =
        registryClient
            .pullManifest(
                ManifestPullerIntegrationTest.KNOWN_MANIFEST_V22_SHA, V22ManifestTemplate.class)
            .getManifest();

    DescriptorDigest realDigest = manifestTemplate.getLayers().get(0).getDigest();


 
##hybrid 1 0.016393441706895828 
    Mockito.doThrow(mockRegistryUnauthorizedException)
        .when(mockJibContainerBuilder)
        .containerize(mockContainerizer);

    try {
      testJibBuildRunner.runBuild();
      Assert.fail();

    } catch (BuildStepsExecutionException ex) {
      Assert.assertEquals(
          TEST_HELPFUL_SUGGESTIONS.forHttpStatusCodeForbidden("someregistry/somerepository"),
          ex.getMessage());
    }
  }


 
##hybrid 2 0.016129031777381897 
    verify(mockCredentialRetrieverFactory).known(knownCredential, "credentialSource");
    verify(mockCredentialRetrieverFactory).known(inferredCredential, "inferredCredentialSource");
    verify(mockCredentialRetrieverFactory)
        .dockerCredentialHelper("docker-credential-credentialHelperSuffix");
  }


 
##hybrid 3 0.016129031777381897 
  @Test
  public void testPull_unknownBlob() throws IOException, DigestException {
    DescriptorDigest nonexistentDigest =
        DescriptorDigest.fromHash(
            "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");

    RegistryClient registryClient =
        RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient)
            .newRegistryClient();

    try {
      registryClient
          .pullBlob(nonexistentDigest, ignored -> {}, ignored -> {})
          .writeTo(ByteStreams.nullOutputStream());
      Assert.fail("Trying to pull nonexistent blob should have errored");

    } catch (IOException ex) {
      if (!(ex.getCause() instanceof RegistryErrorException)) {
        throw ex;
      }
      MatcherAssert.assertThat(
          ex.getMessage(),
          CoreMatchers.containsString(
              "pull BLOB for gcr.io/distroless/base with digest " + nonexistentDigest));
    }
  }
}

 
##hybrid 4 0.01587301678955555 
    when(mockCredentialRetrieverFactory.dockerCredentialHelper(anyString()))
        .thenReturn(mockDockerCredentialHelperCredentialRetriever);
    when(mockCredentialRetrieverFactory.known(knownCredential, "credentialSource"))
        .thenReturn(mockKnownCredentialRetriever);
    when(mockCredentialRetrieverFactory.known(inferredCredential, "inferredCredentialSource"))
        .thenReturn(mockInferredCredentialRetriever);
    when(mockCredentialRetrieverFactory.wellKnownCredentialHelpers())
        .thenReturn(mockWellKnownCredentialHelpersCredentialRetriever);

结论

通过分析,我们可以就不同检索方法的性能得出几个结论。在大多数情况下,语义搜索模型可以通过把握查询的整体意图来帮助我们获得良好的结果,但当查询包含我们想要匹配的特定关键词时,它就显得力不从心了。

在这些情况下,Embeddings 模型并不能明确表示这种意图。另一方面,全文搜索可以直接解决这个问题。不过,它也会带来尽管匹配了单词但结果却不相关的问题,这会降低整体结果的质量。因此,通过分析特定结果和应用有针对性的策略来提高搜索质量,从而识别和处理这些负面情况至关重要。采用 RRF 或加权 Rerankers 等排序策略的混合搜索通常是一个不错的基准选项。

随着 Milvus 2.5 中全文检索功能的发布,我们旨在为社区提供灵活多样的信息检索解决方案。这将允许用户探索各种搜索方法的组合,满足 GenAI 时代日益复杂多样的搜索需求。请查看如何使用 Milvus 2.5 实现全文检索和混合搜索的代码示例。

Like the article? Spread the word

扩展阅读