Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 4cec037

Browse files
committedMar 18, 2024
add elasticsearch notebook
1 parent b2ca2cd commit 4cec037

File tree

1 file changed

+267
-0
lines changed

1 file changed

+267
-0
lines changed
 

‎02-ElasticSearch.ipynb

+267
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# ElasticSearch\n",
8+
"\n",
9+
"In this notebook we'll implement a simple search using ElasticSearch.\n",
10+
"Because ES does not support polish language natively, we'll use polish analyzer plugin to add polish language support.\n",
11+
"\n",
12+
"There are many ways we can add this support. The simplest is to set a polish analyzer to selected properties. \n",
13+
"\n",
14+
"We can also set a custom tokenizers. That would let us have more control over the way the fields are tokenized. Some words like \"e-mail\" might be split into \"e\" and \"mail\" which is not something we want. \n",
15+
"\n",
16+
"There is also a plugin created by Allegro, which might work better for some parts. It's also possible to add a dictionary of synonyms.\n",
17+
"\n",
18+
"In later notebooks we'll look at how each of this methods works and what result can we achieve."
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": 81,
24+
"metadata": {},
25+
"outputs": [
26+
{
27+
"data": {
28+
"text/plain": [
29+
"ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'test_index'})"
30+
]
31+
},
32+
"execution_count": 81,
33+
"metadata": {},
34+
"output_type": "execute_result"
35+
}
36+
],
37+
"source": [
38+
"from elasticsearch import Elasticsearch\n",
39+
"\n",
40+
"es = Elasticsearch('http://localhost:9200')\n",
41+
"\n",
42+
"mapping = {\n",
43+
" \"mappings\": {\n",
44+
" \"properties\": {\n",
45+
" \"content\": {\n",
46+
" \"type\": \"text\",\n",
47+
" \"analyzer\": \"polish\"\n",
48+
" }\n",
49+
" }\n",
50+
" }\n",
51+
"}\n",
52+
"\n",
53+
"es.options(ignore_status=404).indices.delete(index='test_index')\n",
54+
"es.options(ignore_status=400).indices.create(index='test_index', body=mapping)"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"## Analyze the query"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": 76,
67+
"metadata": {},
68+
"outputs": [
69+
{
70+
"name": "stdout",
71+
"output_type": "stream",
72+
"text": [
73+
"matematyka\n",
74+
"super\n",
75+
"podobać\n",
76+
"trójkąt\n",
77+
"równania\n"
78+
]
79+
}
80+
],
81+
"source": [
82+
"text = \"Matematyka jest super! Bardzo mi się podobają trójkąty i równania.\"\n",
83+
"\n",
84+
"analysis = es.indices.analyze(index='test_index', body={\n",
85+
" 'analyzer': 'polish',\n",
86+
" 'text': text\n",
87+
"})\n",
88+
"\n",
89+
"for token in analysis['tokens']:\n",
90+
" print(token['token'])"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"## Add data to elastic search"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": 78,
103+
"metadata": {},
104+
"outputs": [
105+
{
106+
"name": "stdout",
107+
"output_type": "stream",
108+
"text": [
109+
"Tokens for document 0 :\n",
110+
"historia\n",
111+
"polski\n",
112+
"ciekawy\n",
113+
"mieszko\n",
114+
"pierwszey\n",
115+
"władca\n",
116+
"polski\n",
117+
"966\n",
118+
"przyjąć\n",
119+
"chrzesć\n",
120+
"polski\n",
121+
"stała\n",
122+
"kraj\n",
123+
"chrześcijański\n",
124+
"\n",
125+
"Tokens for document 1 :\n",
126+
"maria\n",
127+
"curia\n",
128+
"skłodowski\n",
129+
"wybitny\n",
130+
"polski\n",
131+
"naukowczynić\n",
132+
"otrzymać\n",
133+
"dwie\n",
134+
"nagroda\n",
135+
"nobl\n",
136+
"pierwy\n",
137+
"kobieta\n",
138+
"otrzymać\n",
139+
"nagroda\n",
140+
"chemia\n",
141+
"pasja\n",
142+
"odkryć\n",
143+
"pierwiastek\n",
144+
"promieniotwórczy\n",
145+
"\n",
146+
"Tokens for document 2 :\n",
147+
"trygonometria\n",
148+
"działem\n",
149+
"matematyka\n",
150+
"zajmować\n",
151+
"on\n",
152+
"badać\n",
153+
"zależność\n",
154+
"międyć\n",
155+
"bok\n",
156+
"kąt\n",
157+
"trójkąt\n",
158+
"trygonometria\n",
159+
"ważny\n",
160+
"nawigacja\n",
161+
"dzięk\n",
162+
"móc\n",
163+
"określić\n",
164+
"położen\n",
165+
"statek\n",
166+
"morze\n",
167+
"\n"
168+
]
169+
}
170+
],
171+
"source": [
172+
"sentences = [\n",
173+
" \"Historia Polski jest bardzo ciekawa. Mieszko I był pierwszym władcą Polski. W roku 966 przyjął chrzest. W ten sposób Polska stała się krajem chrześcijańskim.\",\n",
174+
" \"Maria Curie-Skłodowska była wybitną polską naukowczynią. Otrzymała dwie Nagrody Nobla. Była pierwszą kobietą, która otrzymała tę nagrodę. Chemia była jej pasją. Odkryła pierwiastki promieniotwórcze.\",\n",
175+
" \"Trygonometria jest działem matematyki. Zajmuje się ona badaniem zależności między bokami i kątami trójkątów. Trygonometria jest bardzo ważna w nawigacji. Dzięki niej możemy określić położenie statku na morzu.\",\n",
176+
"]\n",
177+
"\n",
178+
"for i, sentence in enumerate(sentences):\n",
179+
" es.index(index='test_index', id=i, body={'content': sentence})\n",
180+
"\n",
181+
"response = es.search(index='test_index', body={\"query\": {\"match_all\": {}}})\n",
182+
"\n",
183+
"# For each document in the response, analyze its 'content' field\n",
184+
"for hit in response['hits']['hits']:\n",
185+
" document = hit['_source']\n",
186+
" content = document['content']\n",
187+
" \n",
188+
" # Analyze the content\n",
189+
" analysis = es.indices.analyze(index='test_index', body={\n",
190+
" 'analyzer': 'polish',\n",
191+
" 'text': content\n",
192+
" })\n",
193+
" \n",
194+
" # Print the tokens\n",
195+
" print(\"Tokens for document\", hit['_id'], \":\")\n",
196+
" for token in analysis['tokens']:\n",
197+
" print(token['token'])\n",
198+
" print()\n"
199+
]
200+
},
201+
{
202+
"cell_type": "markdown",
203+
"metadata": {},
204+
"source": [
205+
"## Query the data"
206+
]
207+
},
208+
{
209+
"cell_type": "code",
210+
"execution_count": 79,
211+
"metadata": {},
212+
"outputs": [
213+
{
214+
"name": "stdout",
215+
"output_type": "stream",
216+
"text": [
217+
"Liczba wyników: 1\n",
218+
"Dokument: Trygonometria jest działem matematyki. Zajmuje się ona badaniem zależności między bokami i kątami trójkątów. Trygonometria jest bardzo ważna w nawigacji. Dzięki niej możemy określić położenie statku na morzu.\n",
219+
"Wynik: 0.9768399\n",
220+
"\n"
221+
]
222+
}
223+
],
224+
"source": [
225+
"query = {\n",
226+
" \"query\": {\n",
227+
" \"match\": {\n",
228+
" \"content\": {\n",
229+
" \"query\": \"matematyka\",\n",
230+
" \"analyzer\": \"polish\"\n",
231+
" }\n",
232+
" }\n",
233+
" }\n",
234+
"}\n",
235+
"\n",
236+
"results = es.search(index='test_index', body=query)\n",
237+
"\n",
238+
"print('Liczba wyników:', results['hits']['total']['value'])\n",
239+
"for hit in results['hits']['hits']:\n",
240+
" print('Dokument:', hit['_source']['content'])\n",
241+
" print('Wynik:', hit['_score'])\n",
242+
" print()"
243+
]
244+
}
245+
],
246+
"metadata": {
247+
"kernelspec": {
248+
"display_name": "env",
249+
"language": "python",
250+
"name": "python3"
251+
},
252+
"language_info": {
253+
"codemirror_mode": {
254+
"name": "ipython",
255+
"version": 3
256+
},
257+
"file_extension": ".py",
258+
"mimetype": "text/x-python",
259+
"name": "python",
260+
"nbconvert_exporter": "python",
261+
"pygments_lexer": "ipython3",
262+
"version": "3.12.2"
263+
}
264+
},
265+
"nbformat": 4,
266+
"nbformat_minor": 2
267+
}

0 commit comments

Comments
 (0)
Please sign in to comment.