🖲️AI Search Engine Multilingual Evaluation Report (v1.0)
1. Abstract
With the advent of ChatGPT, conversational search engine technology has rapidly gained widespread attention. Several general-purpose question-answering search engines, such as Perplexity and iAsk, have emerged in the market, as well as other search solutions focusing on specific vertical domains.
We believe that these conversational search products have a significant advantage in providing direct answers compared to traditional keyword-based search engines , and they may become a disruptive paradigm in the evolution of search technology. However, in actual use, we have also noticed some issues, particularly in terms of the accuracy and reliability of the answers. Inaccurate responses and so-called "hallucination answers" (i.e., answers that are irrelevant to the user's query or completely nonsensical) frequently occur, which seriously affects the user experience.
This work tries to figure out what the best AI search engine or AI search product is? Considering the linguistic diversity of the global user base, our evaluation report selected a variety of languages, including English, Japanese, Simplified Chinese, Russian, and etc. to conduct preliminary tests and assessments on the accuracy of these question-answering search engine products.
In this report, we will detail our evaluation methodology, testing process, and the conclusions we have drawn. Our goal is to provide developers, researchers, and end-users with a comprehensive performance evaluation, to better understand the performance of these question-answering search engines in different linguistic environments, and to point out their current limitations and directions for improvement.
In this comprehensive performance evaluation of the question-answering search engine, we have arrived at the following key findings:
Overall, the performance of the evaluated products did not meet our expectations. However, it is noteworthy that Metaso, a company primarily serving the Chinese market, performed the best overall in the evaluation, slightly surpassing Perplexity.
The comprehensive data analysis of all products indicates that the highest accuracy is achieved in answering questions in English. By contrast, Russian questions have the lowest accuracy rate, and the accuracy for Japanese questions is also relatively low.
In terms of performance by language, Perplexity leads by a significant margin in answering English questions, and its performance in Simplified Chinese is also quite impressive. Metaso stands out in its performance in both Simplified and Traditional Chinese. However, both products do not reach a satisfactory level in other languages. iAsk demonstrated relatively balanced capabilities across different languages, but overall, it falls within the medium range. You.com, on the other hand, performed well only in answering questions in English.
Note 1: For this evaluation, the free versions of each product were chosen (the in-depth mode was selected for Metaso). The assessment of the Pro versions will be conducted later.
Note 2: This evaluation focuses solely on the accuracy of the answers, disregarding other aspects such as the language and format of the responses.
2. Product Selection
To conduct this performance evaluation of question-answering search engines, we referred to the list of leading AI search engine products provided by the aicpb.com website. After excluding traditional keyword search engines (like bing.com), we selected four of the most outstanding AI question-answering search engine products on the market for our evaluation.
These products are considered to be the best in the industry, and their technology and market performance represents the highest level of current AI search engines. Our evaluation aims to delve into the performance of these leading products, particularly in their ability to understand and respond to user queries, as well as their accuracy and reliability in handling queries in different languages.
The specific information about the aforementioned four products is as follows (referenced from statistics at aicpb.com):
Rank | Product Name | Category | Monthly Visits (March) | MoM Change |
2 | Perplexity | AI Search Engine | 64.14M | 25.17% |
3 | You.com | AI Search Engine | 10.44M | 14.61% |
5 | Metaso | AI Search Engine | 7.21M | 551.36% |
7 | iAsk | AI Search Engine | 3.43M | 15.43% |
3. Evaluation Data
To comprehensively assess the performance of the aforementioned AI question-answering search engine products in a multilingual environment, we selected six major languages including English, Japanese, Simplified Chinese, etc., for testing. The distribution of test cases in multiple languages is as follows:
Evaluation Languages | Language Proportion |
English | 20% |
Japanese | 20% |
Simplified Chinese | 20% |
Traditional Chinese | 20% |
Russian | 10% |
Korean | 10% |
Total | 100% |
In this evaluation, we have specifically designed five different usage scenarios, including news acquisition, local information search, technical problem, product feature inquiry, and business consultation, to simulate the types of queries users may encounter in their daily lives.
Search Scenario | Evaluation Criteria | Number of Cases | Percentage of Total |
Technical Consulting | Assessing the ability to understand and answer specialized technical questions | 20 | 20% |
News | Evaluating the capability to track, process real-time information, and provide the latest news events | 20 | 20% |
Local Search | Evaluating the retrieval of local information (such as restaurants, businesses, attractions, transportation, etc.) | 20 | 20% |
Product Search | Evaluating the provision of accurate and useful product information | 20 | 20% |
Business Consulting | Assessing the ability to perform professional information analysis and retrieval within the business sector | 20 | 20% |
Total |
| 100 | 100% |
We recognize that while these five scenarios cover a range of common query types, they may not fully encompass all the potential needs of users. Therefore, we plan to continuously expand our test collection in future evaluations, by introducing more categories and scenarios to ensure that our assessment can more comprehensively reflect the actual performance and user experience of the products.
Our complete set of test cases is open-sourced and can be accessed thru link below:
4. Evaluation Method and Results
Testing Method
Accuracy: Human Evaluation
The actual answers are manually compared with the Ground truth. If the answer is a "match", it is marked as "True"; otherwise, it is "False".
Answer Correctness: refer to methods in Ragas
The Answer correctness metric from the Ragas platform, with the GPT-4-Turbo model being used for evaluation. The formula for calculating “Answer correctness” is as follows:
Metric Description: https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html
Evaluation Metrics
Metric | Definition of Evaluation | Advantages | Disadvantages |
Accuracy | Manual evaluation of the actual answers based on Ground truth and key words | Precise assessment | Labor-intensive |
Answer correctness | GPT-4 comparison of Ground truth with actual answers | Less labor-intensive | Average quality of assessment |
The results for the Accuracy evaluation have already been provided in the Abstract, and the results for the Answer Correctness evaluation are as follows.
| Accuracy | AC |
Perplexity | 65% | 65.22% |
Metaso | 66% | 62.7% |
iAsk | 60% | 62.27% |
You.com | 34% | 58.27% |
5. Case Analysis
Ground truth:最近の埼玉県行田市の事故での死傷者数は以下の通りです:\n死亡者数: 1人\n負傷者数: 3人
Comment: Perplexity, You.com did not find relevant results. The answer provided by Metaso is basically consistent with the Ground truth. The response from iAsk is in English, but the description and source of the answer are correct.
Product Name | Answer | Accuracy |
Perplexity | 申し訳ありませんが、与えられた検索結果には埼玉県行田市での事故の死傷者数に関する情報がありません。3の埼玉県警察のウェブサイトには県内の交通事故発生状況が掲載されていますが、行田市に限定した情報はありません。したがって、この質問に対する正確な答えを出すことはできません。 | False |
Metaso | しかし、この質問に対する直接的な回答として、最新の事故で1人死亡、3人けがをしたことが正确です。 | True |
iAsk | The recent accident in Saitama Prefecture, specifically in the city of Gyoda, resulted in 4 people being killed or injured. The accident involved a 76-year-old female driver whose car veered out of its lane and collided with a total of 4 vehicles, including cars and trucks. The incident occurred on a single-lane city road in Gyoda. The woman’s vehicle crossed into oncoming traffic, leading to collisions with two oncoming cars, another car, and a large truck. As a result of the accident, a 47-year-old woman who was driving one of the oncoming cars tragically lost her life. Additionally, her 8-year-old daughter who was with her, the 52-year-old male driver of the truck, and the elderly female driver who caused the initial collision sustained varying degrees of injuries. The Gyoda Police Station is currently investigating the exact cause of the accident while monitoring the recovery of the injured parties. These sources were utilized to gather information regarding the recent accident in Gyoda City, Saitama Prefecture, providing details about the number of casualties and the circumstances surrounding the tragic incident. | True |
You.com | 最近の埼玉県行田市の事故の死傷者数に関する情報は、提供された検索結果からは見つかりませんでした。申し訳ありませんが、詳細な情報を提供することができません。お手数ですが、別の情報 | False |
Local Search
Ground truth:「進撃の巨人」は、諫山創による日本の漫画作品で、2009年9月9日に発行が開始されました。そして、2021年4月9日に連載が完結しました。最新刊(34巻)は2021年6月9日に発売されました⁵。この作品は世界中で非常に人気があり、多くの派生作品が生まれています。
Comment: In the case of answering the question about the release date of "進撃の巨人" both Perplexity and iAsk provided the correct dates, with Perplexity additionally offering more detailed information. However, the responses from Metaso and You.com were inaccurate. Perplexity's answer better demonstrates its advantage in information gathering and processing.
Product name | Answer | Accuracy |
Perplexity | 進撃の巨人の原作漫画は、2009年9月から2021年4月まで講談社の「別冊少年マガジン」にて連載されていました。13 コミックスは全34巻で、最終巻である34巻が2021年6月9日に発売されました。2 つまり、進撃の巨人の漫画連載期間は2009年9月から2021年4月の約12年間で、コミックス化は2021年6月に完結しています。 | True |
Metaso | 「進撃の巨人」の漫画版は2010年3月17日に発行されました。 | False |
iAsk | 進撃の巨人は2009年9月に発行されました。 | True |
You.com | 進撃の巨人は2013年に発行されました。 | False |
Technical Consulting
Question:Anthoropic Claude のデータポリシーは?
Ground truth:ユーザーの個人データは、モデルのトレーニングに使用されないため、ユーザーデータを利用してモデルを構築することはありません。
Comment: Perplexity's response was irrelevant to the question and did not answer it correctly. Metaso's answer was the most accurate, providing a detailed explanation of Anthoropic Claude's data policy. Although iAsk's response was not relevant to the Ground truth, upon verification of its cited source, it was confirmed that its answer was still correct. You.com, on the other hand, did not find any relevant information.
Product Name | Answer | Accuracy |
Perplexity | Anthropic のAIアシスタントClaudeのトレーニングデータは2023年8月までのものであり、それ以降のイベントについては認識していないことが分かります。4 具体的には、Anthropic のサポート記事によると、「Claudeのデータは常に更新していますが、Claudeは2023年8月までのデータで訓練されました。この日付より後のイベントについて尋ねた場合、Claudeはそれらを認識していません。」と説明されています。4 つまり、Anthropic Claude のトレーニングデータは2023年8月までのものであり、それ以降の最新の情報については認識していないということが分かります。 | False |
Metaso | Anthoropic Claudeのデータポリシーは、業界標準のベストプラクティスを使用しており、ユーザーが入力したプロンプトや会話はモデルのトレーニングに使用されません。これは、用户的な情報の漏洩を防ぐための措置であり、用户的なデータの処理や保持に関する透明性と安全性を保証しています。これは、用户的な情報の漏洩を最小限に抑えるための措置であり、用户的なデータの処理や保持に関する透明性と安全性を保証しています。 | True |
iAsk | Anthropic Claude のデータポリシーは以下の通りです: Anthropicは、個人データを保持する期間や使用方法について明確なポリシーを設けています。消費者向け製品やベータ/評価製品の場合、個人データは、目的や基準に合理的に必要な限り保持されます。一方、ビジネスやエンタープライズのお客様向けには、サービス契約でデータ保持期間が説明されています。 主なポイント: データ保持期間: 消費者向け製品やベータ/評価製品の場合、受信または生成後90日以内にバックエンドでプロンプトと出力が自動的に削除されます。 特別なケース: 例外として、信頼と安全の分類器にフラグが立てられたプロンプトは最大2年間保持されます。 その他の規定: オプトインや明示的同意がある場合、データは10年間保持されることもあります。 Anthropicは個人データの性質や関連活動、対話時期と期間、正当な利益と法的義務を考慮してデータ保持期間を決定します。法律上必要な場合や使用ポリシー違反を防ぐために必要な場合を除き、個人データは適切な措置の下で管理されます。 | True |
You.com | Anthropic Claude のデータポリシーについては、具体的な情報が見つかりませんでした。データポリシーに関する詳細な情報を提供することができず、申し訳ありません。Anthropic の公式ウェブサイトや公式文書を参照することで、より詳細な情報を入手することができるかもしれません | False |
Product Search
Question:갤럭시 S24의 화면 해상도는 어떻게 되나요?
Ground truth:갤럭시 S24 기본 모델의 화면 해상도는 2340 x 1080 픽셀입니다
Comment: Perplexity provided a resolution that does not exist, while iAsk offered a higher resolution, which was in fact incorrect. You.com's answer was also incorrect. Metaso performed the best, providing accurate resolution and additional information about different models.
Product Name | Answer | Accuracy |
Perplexity | 갤럭시 S24의 화면 해상도는 1440 x 3088 픽셀입니다.5 화면 해상도 갤럭시 S24는 6.1인치 OLED 디스플레이를 탑재하며, 1440 x 3088 픽셀의 높은 해상도를 지원합니다.5 이는 매우 선명하고 생생한 화질을 제공할 것으로 예상됩니다. 디스플레이 주요 사양 디스플레이 종류: 6.1인치 OLED5 해상도: 1440 x 3088 픽셀5 재생률: 120Hz5 따라서 갤럭시 S24는 QHD+ 해상도의 고화질 디스플레이를 탑재하여 영화, 게임, 웹서핑 등 다양한 멀티미디어 활용에 최적화된 시청 경험을 제공할 것으로 보입니다.5 | False |
Metaso | 三星Galaxy S24系列的屏幕分辨率根据不同的型号有所不同。Galaxy S24采用了6.2英寸的Dynamic AMOLED屏幕,分辨率为2340x1080,支持120Hz自适应刷新率。而Galaxy S24+和Galaxy S24 Ultra则提供了更高的分辨率。S24+的屏幕分辨率为3120x1440,这是一个QHD+级别的分辨率。同样,Galaxy S24 Ultra也配备了分辨率为3120x1440的第二代动态AMOLED显示屏。因此,可以总结说,三星Galaxy S24系列中,标准版S24的屏幕分辨率为2340x1080,而S24+和S24 Ultra的屏幕分辨率为3120x1440。 | True |
iAsk | 갤럭시 S24의 화면 해상도는 QHD+ (3120x1440)입니다. Top 3 Authoritative Sources Used in Answering this Question: BetaNews.net Samsung Newsroom Official Samsung Galaxy Unpacked Event 2024 Coverage The information regarding the Galaxy S24’s display resolution of QHD+ (3120x1440) was sourced from multiple reliable outlets, including BetaNews.net, Samsung Newsroom, and the official coverage of the Samsung Galaxy Unpacked Event 2024. These sources provided detailed insights into the specifications and features of the Galaxy S24 series, ensuring accuracy and credibility in the answer provided. | False |
You.com | Galaxy S24 Display Resolution The rumored display resolution for the Samsung Galaxy S24 is 2,316 x 1,080 pixels (FHD+). This would be the same resolution as the Galaxy S23 and S23+, which have displays that are 6.1 inches and 6.6 inches respectively. The Galaxy S24 is expected to have a 6.1 inch AMOLED display like the S23. Leaks and rumors so far suggest the S24 will have a similar screen size and resolution as the S23 for the base model. If true, this would be lower than the Quad HD | False |
Business Consulting
Question:Monica.im в 2023 году имеет сколько пользователей?
Ground truth:В 2023 году у Monica.im уже более 1,5 миллионов пользователей.
Comment:Metaso provided the correct answer directly, while Perplexity and iAsk seemed more focused on explaining their search process rather than actually solving the problem. You.com's response contained no useful information.
Search AI Engine | Answer | Accuracy |
Perplexity | К сожалению, в предоставленных поисковых результатах нет информации о количестве пользователей Monica.im в 2023 году. Результаты содержат различную информацию, не связанную с этим вопросом, такую как описание сайта для знакомств, рекламу юридических услуг, инструкции по использованию инструментов и другие случайные темы. Поскольку запрошенные данные отсутствуют, я не могу дать точного ответа на этот вопрос. Monica.im - это сайт знакомств и социальная сеть, но без дополнительной информации из надежных источников невозможно определить количество ее пользователей в 2023 году. | False |
Metaso | Monica.im 在 2023 年已有超过 150 万用户。 | True |
iAsk | Monica.im User Count in 2023: To determine the number of users Monica.im had in 2023, we need to gather information from reliable sources that provide data on the user base of Monica.im during that specific year. Researching Monica.im User Count in 2023: Official Reports: Checking any official reports or announcements made by Monica.im or its parent company regarding user statistics for the year 2023. Industry Analysis: Reviewing industry reports or analyses that may have covered the growth and user numbers of AI assistants like Monica.im in 2023. Tech News Outlets: Exploring tech news websites or publications that might have reported on the user metrics or milestones achieved by Monica.im in 2023. Financial Statements: Examining any financial statements or disclosures from the company behind Monica.im that could potentially include user count information. User Reviews and Feedback: Considering user reviews, feedback, or testimonials that might indirectly hint at the popularity and scale of Monica.im’s user base in 2023. Answer: After conducting a thorough search across various sources, unfortunately, I couldn’t find specific data on the exact number of users Monica.im had in 2023. The information may not be readily available or publicly disclosed through official channels or reports accessible at this time. Top 3 Authoritative Sources Used: Official Reports Industry Analysis Tech News Outlets | False |
You.com | В 2023 году количество пользователей Monica.im не указано в предоставленных источниках. Нет информации о точном числе пользователей на указанный год. | False |
6. Conclusion
This evaluation conducted an in-depth multilingual performance analysis of four leading AI question-answering search engine products on the market. By testing in six language environments including English, Japanese, Simplified Chinese, and others, across five categories of scenarios such as real-time news, local search, technical consulting, product search, and business consulting, we have reached the following conclusions:
Overall Performance: All participating AI question-answering search engines showed potential in some areas, but the overall performance did not meet the expected level. Metaso and Perplexity had a clear lead in the tests, but there is still considerable room for improvement.
Language Accuracy: The accuracy rate for answers to English questions was generally higher than for other languages, with Russian and Japanese having lower accuracy rates, indicating that cross-language processing capabilities remain a challenge.
Product-Specific Performance: Perplexity excelled in answering questions in English, while Meta shone in handling Simplified and Traditional Chinese. iAsk performed relatively evenly across all languages, while you.com was almost unusable for non-English queries.
In the future, we plan to expand our evaluation test set, adding more languages and query scenarios to provide a more comprehensive and detailed performance assessment. We believe that with technological advancements and the enrichment of datasets, AI question-answering search engines will be able to better meet the diverse needs of global users and play an even more important role in the future of search technology.
Last updated