{"id":1369,"date":"2024-11-18T08:44:02","date_gmt":"2024-11-18T07:44:02","guid":{"rendered":"https:\/\/solr.pl\/?p=1369"},"modified":"2024-11-18T08:44:03","modified_gmt":"2024-11-18T07:44:03","slug":"apache-solr-embeddings-how-to-start","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2024\/11\/18\/apache-solr-embeddings-how-to-start\/","title":{"rendered":"Apache Solr &#038; Embeddings &#8211; How to Start?"},"content":{"rendered":"\n<p>Semantic search and everything related to machine learning has become a very popular topic. To be honest, it\u2019s not only semantic search itself but, due to the massive popularity of so-called Large Language Models and more, many organizations using Solr are trying to implement various query logic based on machine models, RAG techniques, or rescoring. Since it\u2019s been relatively quiet on this front for a while, it\u2019s time to delve into the topic. In this post, we\u2019ll focus on search and explore what Solr has to offer for data retrieval based on vectors.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Short Problem Description<\/h2>\n\n\n\n<p>Let\u2019s assume that our documents consist of a series of fields used for full-text search, along with a field that we want to use for semantic search. This field contains tags that describe the document.<\/p>\n\n\n\n<p>It\u2019s important to highlight that this is not an ideal data example\u2014for instance, we won\u2019t be using context, the tags themselves are single words, and we also want to build a single vector for the document, which doesn\u2019t make the best use of tags. Nonetheless, we\u2019ll proceed with this approach and index the prepared data in Solr.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solr Preparation<\/h2>\n\n\n\n<p>Preparing the data is a bit more complex. We won\u2019t create our own model but will instead use one available on Hugging Face, specifically <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2\">https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2<\/a>. This is one of many models that enable work with sentences and paragraphs. It\u2019s important to remember that calculating the vector involves not only analyzing the words or the sentence itself but also the context in which they occur. The resulting embedding encodes all the relevant information, allowing algorithms to find similar documents. We\u2019ll take a shortcut and use this model to process the tags field, but as I mentioned earlier\u2014the model isn\u2019t the main focus today; we\u2019re interested in Solr.<\/p>\n\n\n\n<p>To do this, we\u2019ll use Solr 9.7.0 and a field based on the <strong>solr.DenseVectorField<\/strong> class. We\u2019ll start by creating a new field type that looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n  \"name\":\"knn_vector_384\",\n  \"class\":\"solr.DenseVectorField\",\n  \"vectorDimension\":384,\n  \"similarityFunction\":\"cosine\",\n  \"knnAlgorithm\":\"hnsw\"\n}<\/code><\/pre>\n\n\n\n<p>Here, we have the name <strong>knn_vector_384<\/strong>, the class <strong>solr.DenseVectorField<\/strong>, a vector dimension of 384, the cosine similarity method for vector comparison, and the proximity calculation algorithm &#8211; hnsw &#8211; which is currently the only one available. The full documentation for the <strong>solr.DenseVectorField<\/strong> class can be found in the <a href=\"https:\/\/solr.apache.org\/guide\/solr\/latest\/query-guide\/dense-vector-search.html\">official documentation<\/a>. It\u2019s worth mentioning that we\u2019ll be using a model that generates 384-dimensional vectors, which is why we define our type in this way.<\/p>\n\n\n\n<p>Our documents will consist of five fields:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>identifier (<em>id<\/em> field),<\/li>\n\n\n\n<li>name (<em>name<\/em> field),<\/li>\n\n\n\n<li>embeddings (<em>vector<\/em> field),<\/li>\n\n\n\n<li>product category (<em>category<\/em> field),<\/li>\n\n\n\n<li>product tags (<em>tags<\/em> field).<\/li>\n<\/ul>\n\n\n\n<p>To create the test collection and prepare it for indexing the documents we can use the following commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">$ bin\/solr create -c test -s 1 -rf 1\n\n$ curl -XPOST -H 'Content-type:application\/json' 'http:\/\/localhost:8983\/solr\/test\/schema' --data-binary '{\n  \"add-field-type\" : {\n    \"name\":\"knn_vector_384\",\n    \"class\":\"solr.DenseVectorField\",\n    \"vectorDimension\":384,\n    \"similarityFunction\":\"cosine\",\n    \"knnAlgorithm\":\"hnsw\"\n  },\n  \"add-field\" : [\n      {\n        \"name\":\"vector\",\n        \"type\":\"knn_vector_384\",\n        \"indexed\":true,\n        \"stored\":false\n      },\n      {\n        \"name\":\"name\",\n        \"type\":\"text_general\",\n        \"multiValued\":false,\n        \"indexed\":true,\n        \"stored\":true\n      },\n      {\n        \"name\":\"category\",\n        \"type\":\"string\",\n        \"multiValued\":false,\n        \"indexed\":true,\n        \"stored\":true\n      },\n      {\n        \"name\":\"tags\",\n        \"type\":\"string\",\n        \"multiValued\":true,\n        \"indexed\":true,\n        \"stored\":true\n      }\n    ]\n}'<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Data Preparation<\/h2>\n\n\n\n<p>Preparing the data is a bit more complex. Instead of creating our own model, we\u2019ll use one of the available models from Hugging Face, specifically the model <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2\">sentence-transformers\/all-MiniLM-L6-v2<\/a>. This is one of many models capable of processing sentences and paragraphs. We\u2019ll take a shortcut and use this model to process the field with tags, but as I mentioned before, the model itself isn\u2019t our main focus today\u2014we\u2019re focused on Solr.<\/p>\n\n\n\n<p>To index the documents we will use the following Python code that we store in a file called <strong>index.py<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import pysolr\nimport uuid\nfrom typing import List, Dict\nimport torch\nfrom transformers import AutoTokenizer, AutoModel\nimport numpy as np\n\nclass DocumentEmbedder:\n    def __init__(self):\n        self.model_name = \"sentence-transformers\/all-MiniLM-L6-v2\"\n        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)\n        self.model = AutoModel.from_pretrained(self.model_name)\n        \n        self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n        self.model.to(self.device)\n\n    def mean_pooling(self, model_output, attention_mask):\n        token_embeddings = model_output[0]\n        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()\n        return torch.sum(token_embeddings * input_mask_expanded, 1) \/ torch.clamp(input_mask_expanded.sum(1), min=1e-9)\n\n    def get_embedding(self, text: str) -&gt; np.ndarray:\n        encoded_input = self.tokenizer(\n            text,\n            padding=True,\n            truncation=True,\n            max_length=512,\n            return_tensors='pt'\n        )\n        \n        encoded_input = {k: v.to(self.device) for k, v in encoded_input.items()}\n\n        with torch.no_grad():\n            model_output = self.model(**encoded_input)\n\n        sentence_embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])\n        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)\n        \n        return sentence_embeddings.cpu().numpy()\n\nclass SolrIndexer:\n    def __init__(self, solr_url: str = 'http:\/\/localhost:8983\/solr\/test'):\n        self.solr = pysolr.Solr(solr_url, always_commit=True)\n        self.embedder = DocumentEmbedder()\n\n    def create_document(self, name: str, tags: str, category: str) -&gt; Dict:\n        \"\"\"Create a document with embeddings.\"\"\"\n        embedding = self.embedder.get_embedding(tags)\n        \n        doc = {\n            'id': str(uuid.uuid4()),\n            'name': name,\n            'tags': tags.split(', '),  \n            'category': category,\n            'vector': embedding.flatten().tolist()\n        }\n        return doc\n\n    def index_documents(self, documents: List[Dict]):\n        \"\"\"Index multiple documents to Solr.\"\"\"\n        try:\n            self.solr.add(documents)\n            print(f\"Successfully indexed {len(documents)} documents\")\n        except Exception as e:\n            print(f\"Error indexing documents: {str(e)}\")\n\ndef main():\n    documents = [\n        {\"name\": \"Apple iPhone 13\", \"tags\": \"phone, smartphone, screen, iOS\", \"category\": \"phone\"},\n        {\"name\": \"Apple iPhone 14\", \"tags\": \"phone, smartphone, screen, iOS\", \"category\": \"phone\"},\n        {\"name\": \"Apple iPhone 15\", \"tags\": \"phone, smartphone, screen, iOS\", \"category\": \"phone\"},\n        {\"name\": \"Samsung Galaxy S24\", \"tags\": \"phone, smartphone, screen, Android\", \"category\": \"phone\"},\n        {\"name\": \"Apple iPod\", \"tags\": \"music, screen, iOS\", \"category\": \"music player\"},\n        {\"name\": \"Samsung Microwave\", \"tags\": \"kitchen, cooking, electric\", \"category\": \"household\"}\n    ]\n    \n    indexer = SolrIndexer()\n    \n    solr_documents = []\n    for doc in documents:\n        solrdoc = indexer.create_document(doc['name'], doc['tags'], doc['category'])\n        solr_documents.append(solrdoc)\n        print(f\"Created document: {solrdoc['name']}\")\n        print(f\"Vector length: {len(solrdoc['vector'])}\")\n        \n    indexer.index_documents(solr_documents)\n\n    embedder = DocumentEmbedder()\n    embeddings = embedder.get_embedding(\"song player\")\n    print(f\"Embeddings: {str(embeddings)}\")\n\nif __name__ == \"__main__\":\n    main()\n\n\n<\/code><\/pre>\n\n\n\n<p>A few comments about the code:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>DocumentEmbedder<\/strong> class is responsible for creating the vector; if we want to switch models, we only need to change the value of <strong>self.model_name<\/strong>. However, it\u2019s important to remember that we configured Solr to work with 384-dimensional vectors, so if the new model produces vectors with different characteristics, a different Solr configuration will be required.<\/li>\n\n\n\n<li>The <strong>SolrIndexer<\/strong> class handles the indexing of documents.<\/li>\n\n\n\n<li>The documents to be indexed are defined in the <strong>documents<\/strong> variable.<\/li>\n<\/ul>\n\n\n\n<p>To run the code, there are dependencies that must be installed. We tested the code on a MacOS system running on an ARM-based processor. The <strong>requirements.txt<\/strong> file looks as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">transformers==4.37.2\ntorch&gt;=2.2.0\nnumpy&gt;=1.24.3\npandas&gt;=2.1.4\nsentencepiece==0.1.99 --extra-index-url https:\/\/download.pytorch.org\/whl\/cpu\npysolr==3.9.0\nuuid==1.30\n<\/code><\/pre>\n\n\n\n<p>You can install the dependencies and run the code using the following commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">$ pip install -r requirements.txt\n\n$ python index.py<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Querying<\/h2>\n\n\n\n<p>When the documents are successfully indexed we can start running the queries to retrieve the documents. For the semantic search we can use one of the two available parsers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>knn<\/em><\/li>\n\n\n\n<li><em>vectorSimilarity<\/em><\/li>\n<\/ul>\n\n\n\n<p>When using the <strong>knn<\/strong> parser, we get the top K documents, whereas with the <strong>vectorSimilarity<\/strong> parser, we retrieve documents that exceed a specified vector similarity threshold. An example query looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/query?q={!knn f=vector topK=10}[...]<\/code><\/pre>\n\n\n\n<p>As shown, the above query uses the <strong>knn<\/strong> parser, operates on data from the field named vector, retrieves 10 documents, and in the <strong>[&#8230;]<\/strong> section, our 384-dimensional vector should be placed. How can we generate it? We can modify the previous code and calculate the vector for our query in the <strong>main<\/strong> function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">.\n.\n.\n\ndef main():\n    embedder = DocumentEmbedder()\n    embeddings = embedder.get_embedding(\"song player\").flatten().tolist()\n    print(f\"Embeddings: {str(embeddings)}\")\n\nif __name__ == \"__main__\":\n    main()<\/code><\/pre>\n\n\n\n<p>The generated vector can now be used in Solr. Let\u2019s try querying the <strong>tags<\/strong> field with <strong>song player<\/strong>, like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/select?q=tags:\"song player\"<\/code><\/pre>\n\n\n\n<p>The results, as expected will be empty:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n  \"responseHeader\":{\n    \"zkConnected\":true,\n    \"status\":0,\n    \"QTime\":1,\n    \"params\":{\n      \"q\":\"tags:\\\"song player\\\"\"\n    }\n  },\n  \"response\":{\n    \"numFound\":0,\n    \"start\":0,\n    \"numFoundExact\":true,\n    \"docs\":[ ]\n  }\n}<\/code><\/pre>\n\n\n\n<p>This is the situation we might have expected &#8211; we don\u2019t have a document with that exact tag. Our <strong>Apple iPod<\/strong> document has one of its <strong>tags<\/strong> field values set to <strong>music<\/strong>. Close, but not the same as <strong>song player<\/strong>, at least from the perspective of full-text search, and we aren\u2019t using any synonyms.<\/p>\n\n\n\n<p>So, let\u2019s see how semantic search handles this. To do this, we need to generate a vector from our phrase. Using the code we\u2019ve seen above, our <strong>song player<\/strong> query, after vector generation, will look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/query?fl=name,tags,score,*&amp;q={!knn f=vector topK=10}[-2.00312417e-02,-8.43894295e-03,4.63341828e-03,-7.11187869e-02,-5.41122817e-02,8.86119753e-02,1.43207222e-01,6.48824051e-02,-2.42583733e-02,5.58690503e-02,1.17570441e-02,-8.10808595e-03,5.87443374e-02,-8.77943709e-02,-2.66626403e-02,9.57359672e-02,-3.80967893e-02,6.07994124e-02,-2.05643158e-02,-4.67214771e-02,-1.52253941e-01,-9.71766468e-03,-6.08050935e-02,2.88447887e-02,-2.94179078e-02,1.03829138e-01,-8.31280462e-03,1.05928726e-01,-5.67206480e-02,-7.29391426e-02,8.77882075e-03,3.76622789e-02,1.22778863e-01,7.63954269e-03,-1.07958838e-01,-2.39758939e-02,-4.40737829e-02,-1.58879627e-02,-8.71627405e-02,-1.86017044e-02,-1.60255414e-02,3.48703228e-02,-3.19783390e-02,-1.32465595e-03,-6.66615218e-02,-5.59806228e-02,-3.83943357e-02,-3.26939276e-03,3.50921564e-02,8.62423256e-02,-4.67025265e-02,3.27739231e-02,1.36479884e-02,2.14965921e-02,9.94029175e-03,8.50510597e-03,6.12449907e-02,1.26747936e-01,1.20850094e-02,8.25305879e-02,-3.12836142e-03,-7.23205507e-02,-1.25491228e-02,-4.12022136e-02,8.32084790e-02,-4.56760749e-02,1.24245733e-02,2.99928896e-03,-1.72226392e-02,1.40204923e-02,-5.27366474e-02,8.62174854e-02,-1.23909842e-02,-5.11718355e-03,-5.19047305e-03,-3.26551199e-02,3.38410065e-02,-2.57184170e-02,4.93465960e-02,2.16306932e-02,5.44718355e-02,-5.13004661e-02,-6.29040822e-02,-1.14791520e-01,2.42656711e-02,-5.12294583e-02,-1.41068548e-02,-1.69403590e-02,-3.93591821e-02,2.21420880e-02,-7.29445517e-02,3.97052318e-02,4.06186096e-02,2.38778368e-02,-2.60719825e-02,4.29015867e-02,8.55240896e-02,-1.29460618e-01,-6.10361844e-02,1.73135296e-01,5.50046675e-02,2.11313833e-02,8.09541941e-02,8.49610151e-05,1.86103079e-02,-7.11245313e-02,-2.20125029e-03,8.21124241e-02,-3.17892991e-02,-5.61453328e-02,3.08496933e-02,-2.24805139e-02,-6.99905232e-02,-1.38647379e-02,2.40420792e-02,1.32726450e-02,3.10667846e-02,1.00519679e-01,-3.77691817e-03,1.09965838e-02,-2.87625156e-02,-3.69284451e-02,-4.66967747e-02,1.17862746e-02,-6.05449453e-02,1.80788971e-02,-7.05301017e-03,-2.43471990e-33,4.00696788e-03,-8.39618146e-02,4.76996899e-02,4.31838222e-02,6.59582578e-03,-2.13763528e-02,1.40098054e-02,4.72629964e-02,-7.40025416e-02,-1.17481779e-02,4.74894270e-02,-2.08981428e-02,-4.10624146e-02,6.36240691e-02,1.19205341e-01,-6.63071573e-02,-5.37127964e-02,1.66804232e-02,-5.87310642e-03,-1.00764409e-02,2.95753870e-02,7.51189962e-02,3.69122699e-02,4.02592048e-02,3.01825050e-02,-3.49015556e-02,3.05062942e-02,-5.49735427e-02,7.76632428e-02,-1.31180901e-02,1.32043369e-03,-7.29682148e-02,4.55190837e-02,-3.90759930e-02,-2.00935621e-02,2.90088002e-02,-4.90487851e-02,-4.68467623e-02,-3.78382057e-02,-8.62234011e-02,-8.21669679e-03,-1.02636190e-02,-4.22154590e-02,-1.01420805e-02,-9.29822177e-02,-4.51174043e-02,1.08679747e-02,4.72158194e-02,1.13880262e-02,2.26492099e-02,1.08083403e-02,4.73796055e-02,-2.11352482e-02,2.94824075e-02,3.32471356e-02,-6.10656738e-02,6.39934689e-02,7.52783706e-03,2.01779045e-02,8.42025783e-03,7.92021900e-02,7.49795884e-02,3.60657759e-02,-5.19200675e-02,-2.61804424e-02,4.30282280e-02,9.71625224e-02,-5.95051460e-02,9.75757018e-02,2.11117323e-02,-3.37768793e-02,-2.30997894e-02,5.96088655e-02,3.78986225e-02,-8.59187767e-02,2.64272131e-02,-5.98726794e-02,-7.63988122e-02,-8.67564604e-03,-4.02248465e-02,-7.74084628e-02,-7.17599411e-03,-6.49854094e-02,5.07407561e-02,-4.55319695e-02,-2.79121399e-02,-3.54578048e-02,-8.02342817e-02,-1.16430726e-02,-3.30321328e-03,-2.42636316e-02,4.07439955e-02,8.91774427e-03,1.66885648e-02,-5.11291474e-02,2.29379760e-33,-2.25291476e-02,-3.75496261e-02,7.23349378e-02,2.80879121e-02,1.11529857e-01,9.52087902e-03,3.09992954e-02,3.44425067e-02,-1.44962538e-02,3.70973274e-02,-3.42840664e-02,-4.15693037e-03,4.41893972e-02,-5.78483054e-03,2.27168929e-02,2.73053981e-02,-7.61534553e-03,3.05881090e-02,2.22553536e-02,-2.11785771e-02,-8.38738605e-02,-3.54573503e-02,5.55671491e-02,-3.25255133e-02,-6.17663078e-02,-3.31615508e-02,1.28180429e-01,-1.09783243e-02,-4.22303416e-02,8.85861460e-03,1.03141055e-01,-1.19012371e-02,-5.68600930e-02,-9.76827294e-02,-3.66187817e-03,6.69727027e-02,5.01095466e-02,3.45795639e-02,-4.87284325e-02,-2.88722366e-02,3.69899273e-02,4.80774455e-02,9.51483287e-03,1.05140977e-01,9.12032463e-03,7.83540029e-03,3.00276410e-02,1.10446520e-01,2.02773716e-02,6.35717139e-02,-6.48668930e-02,-3.63331586e-02,5.26458211e-02,-8.43572319e-02,-6.71359748e-02,-3.95149644e-03,-3.19298953e-02,-3.83070633e-02,-1.30710788e-02,3.45022231e-02,3.50211672e-02,-1.17928414e-02,-2.96711233e-02,-1.07177338e-02,-3.55232134e-02,1.01115681e-01,3.25268582e-02,3.25036608e-02,-1.87526215e-02,-2.26221383e-02,1.12875113e-02,5.46618700e-02,-9.37254541e-03,5.56712523e-02,-4.75139245e-02,8.30354728e-03,-1.24001637e-01,7.75059313e-02,-2.27639657e-02,-7.71744251e-02,3.50985602e-02,-1.18714431e-02,2.85322741e-02,-1.23035219e-02,3.05858813e-03,-3.02043278e-02,7.86332041e-02,2.63012294e-02,-2.10798401e-02,-2.86972001e-02,4.53165434e-02,5.47545776e-02,-1.12766981e-01,4.01742896e-03,-7.09661469e-03,-1.23211663e-08,-8.58379975e-02,1.30043477e-02,2.08473746e-02,-4.64787520e-02,5.16857654e-02,1.19483555e-02,3.52647863e-02,-1.09199435e-01,3.83973494e-02,3.78849730e-02,3.93284820e-02,-7.54635110e-02,8.52666888e-03,7.53920805e-03,2.69943215e-02,-2.04220396e-02,-5.33770137e-02,9.14978534e-02,-5.69638461e-02,4.34034877e-02,1.58696324e-02,6.27059937e-02,1.16019472e-02,-3.03725917e-02,2.53463928e-02,-1.08486209e-02,-4.15410772e-02,6.96198866e-02,1.25364019e-02,8.21540307e-04,3.80332284e-02,7.23319054e-02,2.98110046e-03,-6.22513592e-02,6.56076372e-02,2.09525451e-02,2.19415948e-02,-1.34995428e-03,-9.06020775e-02,2.12142467e-02,-1.63224037e-03,9.72291678e-02,-3.10167447e-02,-3.09801828e-02,-1.90541670e-02,-3.96139771e-02,2.09502075e-02,-9.64867976e-03,1.07652992e-02,2.92998832e-02,-6.14691451e-02,2.42477246e-02,-4.87532094e-02,2.83491425e-02,8.26478079e-02,4.19540368e-02,-3.42742465e-02,5.80971912e-02,-2.19746046e-02,1.28769483e-02,-1.10125542e-02,3.03942598e-02,2.98435502e-02,5.43411188e-02]<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Results<\/h2>\n\n\n\n<p>Our query that we just run returns the following results:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n  \"responseHeader\":{\n    .\n    .\n    .\n  },\n  \"response\":{\n    \"numFound\":6,\n    \"start\":0,\n    \"maxScore\":0.7036504,\n    \"numFoundExact\":true,\n    \"docs\":[{\n      \"id\":\"0030eb55-95ee-4042-b574-1f1f4d2d0dd4\",\n      \"name\":\"Apple iPod\",\n      \"tags\":[\"music\",\"screen\",\"iOS\"],\n      \"_version_\":1815543355046625280,\n      \"_root_\":\"0030eb55-95ee-4042-b574-1f1f4d2d0dd4\",\n      \"score\":0.7036504\n    },{\n      \"id\":\"f5122858-22bf-47fe-84b4-f0fa5e608d22\",\n      \"name\":\"Samsung Galaxy S24\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"Android\"],\n      \"_version_\":1815543355045576704,\n      \"_root_\":\"f5122858-22bf-47fe-84b4-f0fa5e608d22\",\n      \"score\":0.5939434\n    },{\n      \"id\":\"9b6aa3b5-f02a-4969-ae21-daaafc58ece4\",\n      \"name\":\"Apple iPhone 13\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"_version_\":1815543355005730816,\n      \"_root_\":\"9b6aa3b5-f02a-4969-ae21-daaafc58ece4\",\n      \"score\":0.5706577\n    },{\n      \"id\":\"bbb744ae-ca90-4663-8a8e-fdbe8e00d935\",\n      \"name\":\"Apple iPhone 14\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"_version_\":1815543355042430976,\n      \"_root_\":\"bbb744ae-ca90-4663-8a8e-fdbe8e00d935\",\n      \"score\":0.5706577\n    },{\n      \"id\":\"f750c960-229b-4fd2-a9f7-c891afbb19d3\",\n      \"name\":\"Apple iPhone 15\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"_version_\":1815543355044528128,\n      \"_root_\":\"f750c960-229b-4fd2-a9f7-c891afbb19d3\",\n      \"score\":0.5706577\n    },{\n      \"id\":\"756e14fc-6320-4fee-9713-7d0c6123fa8a\",\n      \"name\":\"Samsung Microwave\",\n      \"tags\":[\"kitchen\",\"cooking\",\"electric\"],\n      \"_version_\":1815543355047673856,\n      \"_root_\":\"756e14fc-6320-4fee-9713-7d0c6123fa8a\",\n      \"score\":0.5591891\n    }]\n  }\n}<\/code><\/pre>\n\n\n\n<p>As we can see, we retrieved all the documents, with our <strong>Apple iPod<\/strong> having the highest score, which is now calculated based not on document relevance to the query but on vector similarity. In our case, this is a very limited sample of documents and a tag field that is far from ideal, but we have a functioning, relatively simple example. The <strong>knn<\/strong> parser returned the top 10 documents, which, in our case, means all of them.<\/p>\n\n\n\n<p>If we only want to retrieve documents with a score above a certain threshold, we can use the <strong>vectorSimilarity<\/strong> parser. For example, if we want to fetch documents with a similarity score equal to or greater than <strong>0.7<\/strong>, our query would look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/query?q={!vectorSimilarity f=vector minReturn=0.7}[...]<\/code><\/pre>\n\n\n\n<p>And the results of the above query would look as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n  \"responseHeader\":{\n    .\n    .\n    .\n  },\n  \"response\":{\n    \"numFound\":1,\n    \"start\":0,\n    \"maxScore\":0.7036504,\n    \"numFoundExact\":true,\n    \"docs\":[{\n      \"id\":\"0030eb55-95ee-4042-b574-1f1f4d2d0dd4\",\n      \"name\":\"Apple iPod\",\n      \"tags\":[\"music\",\"screen\",\"iOS\"],\n      \"_version_\":1815543355046625280,\n      \"_root_\":\"0030eb55-95ee-4042-b574-1f1f4d2d0dd4\",\n      \"score\":0.7036504\n    }]\n  }\n}<\/code><\/pre>\n\n\n\n<p>And now we only have a single document with the score higher than 0.7.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Filtering<\/h2>\n\n\n\n<p>The performance of the solution will depend on the number of candidates\u2014that is, the documents that Solr will need to process. To reduce this number, we can use filtering with the <strong>preFilter<\/strong> parameter. For example, if we only want to retrieve phones (where the <strong>category<\/strong> field is equal to <strong>phone<\/strong>), our query would look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/query?q={!knn f=vector topK=10 preFilter=category:phone}[...]<\/code><\/pre>\n\n\n\n<p>This time the results would look as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n  \"responseHeader\":{\n    .\n    .\n    .\n  },\n  \"response\":{\n    \"numFound\":4,\n    \"start\":0,\n    \"maxScore\":0.5939434,\n    \"numFoundExact\":true,\n    \"docs\":[{\n      \"id\":\"a0f6fe9b-e82a-437e-b8ef-c12b6ee8d4a3\",\n      \"name\":\"Samsung Galaxy S24\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"Android\"],\n      \"category\":\"phone\",\n      \"_version_\":1815544743947403264,\n      \"_root_\":\"a0f6fe9b-e82a-437e-b8ef-c12b6ee8d4a3\",\n      \"score\":0.5939434\n    },{\n      \"id\":\"ff2a471f-6d31-4dc1-9ec0-f5eacbfc1540\",\n      \"name\":\"Apple iPhone 13\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"category\":\"phone\",\n      \"_version_\":1815544743943208960,\n      \"_root_\":\"ff2a471f-6d31-4dc1-9ec0-f5eacbfc1540\",\n      \"score\":0.5706577\n    },{\n      \"id\":\"9d2955b2-ba6d-4f5d-a477-66909725b362\",\n      \"name\":\"Apple iPhone 14\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"category\":\"phone\",\n      \"_version_\":1815544743945306112,\n      \"_root_\":\"9d2955b2-ba6d-4f5d-a477-66909725b362\",\n      \"score\":0.5706577\n    },{\n      \"id\":\"93e5bd31-78b5-45b9-9c61-a0d8e8c0badb\",\n      \"name\":\"Apple iPhone 15\",\n      \"tags\":[\"phone\",\"smartphone\",\"screen\",\"iOS\"],\n      \"category\":\"phone\",\n      \"_version_\":1815544743946354688,\n      \"_root_\":\"93e5bd31-78b5-45b9-9c61-a0d8e8c0badb\",\n      \"score\":0.5706577\n    }]\n  }\n}<\/code><\/pre>\n\n\n\n<p>This time, we only received phones, meaning the documents were properly filtered.<\/p>\n\n\n\n<p>The behavior of Solr is also defined by the query. For example, if the <strong>preFilter<\/strong> parameter is absent, Solr will use standard parameters known from full-text search, such as <strong>fq<\/strong>. We can construct our filtered query using the <strong>fq<\/strong> parameter, and it would look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">http:\/\/localhost:8983\/solr\/test\/query?fq=category:phone&amp;q={!knn f=vector topK=10}[...]<\/code><\/pre>\n\n\n\n<p>The results would be the same &#8211; four documents which we already saw.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>We\u2019ve seen some of the capabilities Solr offers when it comes to using vectors. We\u2019ve explored what the two parsers allow and how to use them. Of course, this example is very simple, and consequently, if we want to start working on our own projects, first we need to invest time in finding or preparing a model, or alternatively, finding a model and fine-tuning it. The model used highly depends on the use case. There are many models available that can be used, such as the <a href=\"https:\/\/huggingface.co\/Snowflake\/snowflake-arctic-embed-m-v1.5\">Snowflake\/snowflake-arctic-embed-m-v1.5<\/a> model, or published models like <a href=\"https:\/\/huggingface.co\/Marqo\/marqo-ecommerce-embeddings-B\">Marqo\/marqo-ecommerce-embeddings-B<\/a> or <a href=\"https:\/\/huggingface.co\/Marqo\/marqo-ecommerce-embeddings-L\">Marqo\/marqo-ecommerce-embeddings-L<\/a>. It\u2019s important to spend time selecting the right model, testing it, and evaluating its performance. In semantic search, model selection is crucial because it is responsible for creating the vectors that will be used for searching.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Semantic search and everything related to machine learning has become a very popular topic. To be honest, it\u2019s not only semantic search itself but, due to the massive popularity of so-called Large Language Models and more, many organizations using Solr<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[654,164],"class_list":["post-1369","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-embeddings","tag-solr-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/1369","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=1369"}],"version-history":[{"count":4,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/1369\/revisions"}],"predecessor-version":[{"id":1376,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/1369\/revisions\/1376"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=1369"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=1369"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=1369"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}