⚗️AI Search Engine Multilingual Evaluation Report - Complex Query (v1.1)
Conclusion
In our last assessment, we observed that existing AI search engines fell short when tackling intricate challenges. Considering the frequency of such complex issues in our everyday tasks and lives, this evaluation is dedicated to examining the AI Search Engine's proficiency in resolving such problems.
However, during our evaluation, we found that the Basic versions of various products were not entirely satisfactory. Therefore, out of necessity, we had to include Perplexity Pro in our testing scope to see how well the best products on the market can perform. After rigorous testing, we have reached the following conclusions:
Perplexity Pro significantly outperformed, achieving an accuracy rate of 80%, while the performance of other products (free version) was not up to par.
When the source retrieved is not enough, LLMs tend to use its knowledge to infer, which lead to lots of hallucination.
The LLMs generating answers for Metaso and Perplexity (basic) performed poorly, often providing incorrect answers even when relevant information was available.
Evaluation Data
In our evaluation, complex problems are defined as those involving multiple points of information, with answers that require consolidation or reasoning. To ensure the validity of our tests, we deliberately avoided questions that could be directly answered by a single webpage. Our test cases were primarily constructed around the following two scenarios:
Comparative Questions: These involve comparing entities from different time periods or groups.
Composite Questions: These contain two distinct sub-questions that require a comprehensive answer.
The multilingual distribution of our test cases remains consistent with the previous edition. Our complex problem test cases have been made open source:
Dataset:https://github.com/sparticleinc/ASEED/blob/main/datasets/complex_search.csv
Case Analysis
Question:Which movie is longer, Hamlet or Gone with the Wind?
Ground truth:The movie that is longer is Hamlet.The key details are:Hamlet (1996) directed by Kenneth Branagh has a runtime of 242 minutes (4 hours 2 minutes)Gone with the Wind (1939) has a runtime of 221 minutes (3 hours 41 minutes)
Comment: Perplexity Pro performed exceptionally well, providing accurate duration information and correct answers. iAsk and You.com also arrived at the correct conclusions. However, Perplexity showed inconsistency in its performance, with contradictory conclusions indicating that there is room for improvement in the quality of its large language model (LLM). Metaso exhibited similar issues to Perplexity; despite having sufficient data, the accuracy of its responses was still lacking.
Question:劉德華比周傳雄(小剛)大幾歲?
Ground truth:劉德華出生於1961年9月27日,而周傳雄(小剛)則出生於1969年6月7日。因此,劉德華比周傳雄大約8年。
Comment: Perplexity Pro excelled, not only providing accurate duration information but also delivering correct answers. All other platforms were incorrect, with the primary reason being their failure to recall the correct content. Metaso made inferences based on the available data. However, both attempts at deduction were incorrect.
Last updated