add elasticsearch notebook

JKusio · JKusio · commit 4cec037d8897 · 2024-03-18T23:10:22.000+01:00
diff --git a/02-ElasticSearch.ipynb b/02-ElasticSearch.ipynb
@@ -0,0 +1,267 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ElasticSearch\n",
+    "\n",
+    "In this notebook we'll implement a simple search using ElasticSearch.\n",
+    "Because ES does not support polish language natively, we'll use polish analyzer plugin to add polish language support.\n",
+    "\n",
+    "There are many ways we can add this support. The simplest is to set a polish analyzer to selected properties. \n",
+    "\n",
+    "We can also set a custom tokenizers. That would let us have more control over the way the fields are tokenized. Some words like \"e-mail\" might be split into \"e\" and \"mail\" which is not something we want. \n",
+    "\n",
+    "There is also a plugin created by Allegro, which might work better for some parts. It's also possible to add a dictionary of synonyms.\n",
+    "\n",
+    "In later notebooks we'll look at how each of this methods works and what result can we achieve."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'test_index'})"
+      ]
+     },
+     "execution_count": 81,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from elasticsearch import Elasticsearch\n",
+    "\n",
+    "es = Elasticsearch('http://localhost:9200')\n",
+    "\n",
+    "mapping = {\n",
+    "    \"mappings\": {\n",
+    "        \"properties\": {\n",
+    "            \"content\": {\n",
+    "                \"type\": \"text\",\n",
+    "                \"analyzer\": \"polish\"\n",
+    "            }\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "es.options(ignore_status=404).indices.delete(index='test_index')\n",
+    "es.options(ignore_status=400).indices.create(index='test_index', body=mapping)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analyze the query"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "matematyka\n",
+      "super\n",
+      "podobać\n",
+      "trójkąt\n",
+      "równania\n"
+     ]
+    }
+   ],
+   "source": [
+    "text = \"Matematyka jest super! Bardzo mi się podobają trójkąty i równania.\"\n",
+    "\n",
+    "analysis = es.indices.analyze(index='test_index', body={\n",
+    "    'analyzer': 'polish',\n",
+    "    'text': text\n",
+    "})\n",
+    "\n",
+    "for token in analysis['tokens']:\n",
+    "    print(token['token'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Add data to elastic search"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tokens for document 0 :\n",
+      "historia\n",
+      "polski\n",
+      "ciekawy\n",
+      "mieszko\n",
+      "pierwszey\n",
+      "władca\n",
+      "polski\n",
+      "966\n",
+      "przyjąć\n",
+      "chrzesć\n",
+      "polski\n",
+      "stała\n",
+      "kraj\n",
+      "chrześcijański\n",
+      "\n",
+      "Tokens for document 1 :\n",
+      "maria\n",
+      "curia\n",
+      "skłodowski\n",
+      "wybitny\n",
+      "polski\n",
+      "naukowczynić\n",
+      "otrzymać\n",
+      "dwie\n",
+      "nagroda\n",
+      "nobl\n",
+      "pierwy\n",
+      "kobieta\n",
+      "otrzymać\n",
+      "nagroda\n",
+      "chemia\n",
+      "pasja\n",
+      "odkryć\n",
+      "pierwiastek\n",
+      "promieniotwórczy\n",
+      "\n",
+      "Tokens for document 2 :\n",
+      "trygonometria\n",
+      "działem\n",
+      "matematyka\n",
+      "zajmować\n",
+      "on\n",
+      "badać\n",
+      "zależność\n",
+      "międyć\n",
+      "bok\n",
+      "kąt\n",
+      "trójkąt\n",
+      "trygonometria\n",
+      "ważny\n",
+      "nawigacja\n",
+      "dzięk\n",
+      "móc\n",
+      "określić\n",
+      "położen\n",
+      "statek\n",
+      "morze\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "sentences = [\n",
+    "    \"Historia Polski jest bardzo ciekawa. Mieszko I był pierwszym władcą Polski. W roku 966 przyjął chrzest. W ten sposób Polska stała się krajem chrześcijańskim.\",\n",
+    "    \"Maria Curie-Skłodowska była wybitną polską naukowczynią. Otrzymała dwie Nagrody Nobla. Była pierwszą kobietą, która otrzymała tę nagrodę. Chemia była jej pasją. Odkryła pierwiastki promieniotwórcze.\",\n",
+    "    \"Trygonometria jest działem matematyki. Zajmuje się ona badaniem zależności między bokami i kątami trójkątów. Trygonometria jest bardzo ważna w nawigacji. Dzięki niej możemy określić położenie statku na morzu.\",\n",
+    "]\n",
+    "\n",
+    "for i, sentence in enumerate(sentences):\n",
+    "    es.index(index='test_index', id=i, body={'content': sentence})\n",
+    "\n",
+    "response = es.search(index='test_index', body={\"query\": {\"match_all\": {}}})\n",
+    "\n",
+    "# For each document in the response, analyze its 'content' field\n",
+    "for hit in response['hits']['hits']:\n",
+    "    document = hit['_source']\n",
+    "    content = document['content']\n",
+    "    \n",
+    "    # Analyze the content\n",
+    "    analysis = es.indices.analyze(index='test_index', body={\n",
+    "        'analyzer': 'polish',\n",
+    "        'text': content\n",
+    "    })\n",
+    "    \n",
+    "    # Print the tokens\n",
+    "    print(\"Tokens for document\", hit['_id'], \":\")\n",
+    "    for token in analysis['tokens']:\n",
+    "        print(token['token'])\n",
+    "    print()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Query the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Liczba wyników: 1\n",
+      "Dokument: Trygonometria jest działem matematyki. Zajmuje się ona badaniem zależności między bokami i kątami trójkątów. Trygonometria jest bardzo ważna w nawigacji. Dzięki niej możemy określić położenie statku na morzu.\n",
+      "Wynik: 0.9768399\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "query = {\n",
+    "    \"query\": {\n",
+    "        \"match\": {\n",
+    "            \"content\": {\n",
+    "                \"query\": \"matematyka\",\n",
+    "                \"analyzer\": \"polish\"\n",
+    "            }\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "results = es.search(index='test_index', body=query)\n",
+    "\n",
+    "print('Liczba wyników:', results['hits']['total']['value'])\n",
+    "for hit in results['hits']['hits']:\n",
+    "    print('Dokument:', hit['_source']['content'])\n",
+    "    print('Wynik:', hit['_score'])\n",
+    "    print()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}