OpenSearch Cluster Information | |
Version | 2.5 |
Domain name | opensearch-eu-west-1 |
VPC Endpoint | https://vpc-opensearch-eu-west-1-taklxygqetyxg6krtrswbiztby.eu-west-1.es.amazonaws.com |
Availability Zone(s) | 1-AZ |
Dedicated Master Node Available ? | No |
Warm and cold data storage Available ? | No |
EBS Volume Type | General Purpose (SSD) - gp3 |
EBS Volume Size Per Node | 10 GiB |
Number of OpenSearch Nodes | 2 |
Provisioned IOPS | 3000 IOPS |
Provisioned Throughput (MiB/s) | 125 MiB/s |
Snapshot Backup Frequency | Hourly |
{
"mappings": {
"properties": {
"account_id": {
"type": "keyword"
},
"description_term": {
"type": "keyword"
},
"description_text": {
"type": "text"
},
"transaction_date": {
"type": "date",
"format": "strict_year_month_day"
},
"chq_no": {
"type": "keyword"
},
"highlight": {
"type": "boolean"
},
"sl_no": {
"type": "integer"
},
"is_debit": {
"type": "boolean"
},
"transaction_amount": {
"type": "double"
},
"category": {
"type": "keyword"
}
}
}
}
Copy
account_id (keyword)
- Set of transaction records under an account.description_term (keyword)
- Narration / Description of the transaction made. This field needs to support full-text searches.description_text (text)
- Narration / Description of the transaction made but the type of of text. See this comparison here - https://opensearch.org/docs/latest/query-dsl/term-vs-full-texttransaction_date (date)
- Formatted as - strict_year_month_day . To store transaction date in YYYY-MM-DD
format.chq_no(keyword)
- Cheque number if the transaction is made using cheque. We are populating this for every entry for benchmarking and solution testing.highlight (boolean)
- When the user highlights a transaction , we set the value to true.sl_no (integer)
- Serial number / Order number of the transaction.is_debit (boolean)
- true if the transaction is a boolean transaction.transaction_amount (double)
- Transaction amount.category (keyword)
- Category of the transaction.The user should be able to do full text searches on description field. Consider the example records below:
"description_term": "TDS DEDUCTED Yukta Wahal Debt 36108473986#",
"description_term": "TDS DEDUCTED Yukta Upadhyay Social, 82225554583#",
"description_term": "TDS DEDUCTED Yukta Upadhyay Rate 85373069044#",
"description_term": "TDS DEDUCTED Yukta Tiwari (CFA) 15646092048#",
"description_term": "TDS DEDUCTED Yukta Holkar Cost 39464312918#",
Copy
If a user searches for text like TDS
and ukta
all the above 5 entries should be showed up. Fuzziness is not yet a part of our search, so we are skipping that.
We will be generating data that is similar to what an actual transaction description looks. Below is a sample document of data generated.
{
"account_id": "37503487134",
"description_term": "TDS DEDUCTED Yukta Gaur Bankruptcy 54636237417#",
"description_text": "TDS DEDUCTED Yukta Gaur Bankruptcy 54636237417#",
"transaction_date": "2023-12-04",
"chq_no": "CHQ-9649",
"highlight": false,
"sl_no": 10156058,
"is_debit": true,
"transaction_amount": 4276,
"category": "GstinConsentLink"
}
Copy
The code for generating the above data is given below:
counter = 0
category_list = [..some-categories..] - # ex: [sendLoanAgreement, blackBoxInvoker, loanDisbursal etc]
description_list = [..some-description..] - # ex: [Accounting, Return etc]
transaction_list = [..some-transaction-related-words..] #ex: [ACCOUNT, CAPITAL, CREDIT, NEFT/SBIN, UPI/GPAY etc]
accounts_list = [..fixed-account-numbers-90+..] #ex: [71983548866, 53012970390, 62310812672]
def randN(N):
min = pow(10, N-1)
max = pow(10, N) - 1
return random.randint(min, max)
def index_document():
global counter
counter = counter + 1
person_name = indian_names.get_full_name()
description_modified_desc = random.choice(transaction_list) + ' ' + \
person_name + ' ' + random.choice(description_list) + ' ' + str(randN(11)) + '#'
transaction_date = get_fake_date()
formatted_date = str(transaction_date)
id = counter
document = {
'account_id': random.choice(accounts_list),
'description_term': description_modified_desc,
'description_text': description_modified_desc,
'transaction_date': formatted_date,
'chq_no': "CHQ-" + str(randN(4)),
'highlight': False,
'sl_no': counter,
'is_debit': bool(random.getrandbits(1)),
'transaction_amount': float("{0:.2f}".format(randN(4))),
'category': random.choice(category_list)
}
return document, id
Copy
Since an app-form has multiple accounts under it and each account can have an average of 100000 records, we can go with assumption each account having a total of 700000 records for our testing. We will also create few indexes on the same cluster which have high number of documents to see how other indexes affect cluster performance. So overall these are the cases under which we will test the feasibility and performance of elastic search:
other
with high number of documents say - one billion.Since we are targeting to support full-text
searches, using wildcard
queries will fit in to our use case. But wildcard queries are costly and we need to check the performance of these queries on the above defined cases. We will also try using n-grams
or knn
search to see if they can optimise and fit into the solution that we are looking for.
AppFormIds List - appforms.csv sample
appFormId
3040c75d-7003-4d6b-b00a-68e73437d702,
7d12d071-ebce-41d5-9113-1a13a004a3f5,
5c368295-e5ee-47de-acdd-e8b4f3c163bd,
21a107b1-e66d-447e-b131-70894e67671a,
1ee5feb6-5bc1-4a54-b8fb-541a106139ce,
3d5f6809-d0a6-4edc-a7e0-0307fd3fa895,
7924c198-b018-4c04-9156-c68d1f36cdfc,
1dd876a8-4919-414c-a1b6-335de41589e2,
d5a29a35-9c09-43e5-afcd-3beb07d66b04,
93cbbad1-20df-43b4-aa19-8eba38cd9e5b
Copy
AccountsIds List
- Every appForm above have these accounts and each account has 45000
records.
accountId
dwxsjhdrxo
dwwdcjyfmw
igsnodlzai
yzmiyspdqu
cgerslhcuh
bbqhszautv
obyzkusfgz
omcjodrnjj
jaialvjana
Copy
Overall Indexes Information - 21 indexes with each index having 10 accounts and
green open aa2c2031-8196-46f3-a065-2c607560102e yNUfCKW6SPq-7op88a0GEA 5 1 900000 0 395.3mb 198.3mb
green open bda1457a-a67a-4595-9c3f-210fa9807e23 plaRLA1LQhCauXpC60Uebw 5 1 900000 0 393.9mb 197.4mb
green open 21a107b1-e66d-447e-b131-70894e67671a nbPvfmfvScaUkH02JFQjkw 5 1 900000 0 395.3mb 198.3mb
green open 93cbbad1-20df-43b4-aa19-8eba38cd9e5b K--JVUYTQyOvaLdbekvzvw 5 1 900000 0 395mb 198.1mb
green open 2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 QuYAzePYRLqKi1cI6OMkHw 5 1 900000 0 394.6mb 198mb
green open 3040c75d-7003-4d6b-b00a-68e73437d702 WbiGMb0HTrag8DifOx7bpQ 5 1 900000 0 394.5mb 197.8mb
green open ea5b97df-dcc7-4e12-aba1-ac6e8a366f85 LGeDLdhSSTS3Um_GT4aUVg 5 1 900000 0 394.4mb 197.7mb
green open 5c368295-e5ee-47de-acdd-e8b4f3c163bd 4q-VyHWtSlq_Ewm-GHTWdA 5 1 900000 0 395.3mb 198.4mb
green open 492ceac5-f21b-4c51-b3db-8794193ce0ae xXbkbT9pQZ6ja0RPxnLsHA 5 1 900000 0 395.2mb 198.1mb
green open f2610cc4-9347-4c08-9a9e-bbbb6c17699d 3PCmpJqkT2u7XWSIqamVHw 5 1 900000 0 394.8mb 198.3mb
green open 3d5f6809-d0a6-4edc-a7e0-0307fd3fa895 DQ8voAMJTDm4Mdh12oP3Ng 5 1 900000 0 395.2mb 198.2mb
green open 7924c198-b018-4c04-9156-c68d1f36cdfc P8nBEejsQ3iAIIlALZzalw 5 1 900000 0 394.4mb 197.7mb
green open 1445f76e-501f-44ef-aecb-7e9492a70883 OhaVCBbYSTms7W2iYl9X8A 5 1 900000 0 395.7mb 198.7mb
green open 2a7ade8e-5b74-44d4-81f9-5e2935072186 9K1dYgJRRC6YbcxdIleoXw 5 1 900000 0 394.3mb 197.8mb
green open 24550098-3b7d-40b1-981b-d60d08f1ab62 qR0oPd0zRuS766NDcq6eaQ 5 1 900000 0 395mb 198.2mb
green open d5a29a35-9c09-43e5-afcd-3beb07d66b04 3peRTZbURv--DE7ofA-Iaw 5 1 900000 0 395.3mb 198.2mb
green open 7d12d071-ebce-41d5-9113-1a13a004a3f5 Bp7Pq-FkRUypN3zQF2IKIQ 5 1 900000 0 395.5mb 198.4mb
green open e4329c5e-6d5e-4111-afc3-3eba2ad2c2ed KLLAu1sRQYqhUSe7DegIMw 5 1 900000 0 395mb 198.1mb
green open 1ee5feb6-5bc1-4a54-b8fb-541a106139ce NMcdlNVLQ5GJfYkk7ieXWg 5 1 900000 0 395.4mb 198.3mb
green open 1dd876a8-4919-414c-a1b6-335de41589e2 hJpjTQhwSsG3NKTxInE6Jg 5 1 900000 0 394.9mb 198.1mb
green open 094d6f38-767a-4007-84e3-9a62b4a505e2 PnC3oynTSIO-SqDhxpkA8w 5 1 900000 0 395.2mb 198.2mb
Copy
Overall space taken by these documents in OpenSearch cluster ~ 8.4 GiB
Number of parallel requests / second | Time Taken to complete each request | Search Latency - Average | Search Latency Maximum | Max CPU Utilisation | JVM Pressure |
20 | ~ 3.2 seconds | 273 ms - First Search | 600 ms -First Search | 7-8% - First Search | 51.8 % |
40 - First Iteration | ~ 5.5 seconds | 173 ms | 189 - 173 ms | 7 % | 60-70 % |
40 - Second Iteration | ~ 5.3 seconds | 0 ms | 0 ms | 7 % | 60-70 % |
60 | ~ 6-7 seconds | 0 ms | 0 ms | 7 % | 65 % |
Log files - With focus on Response time:
Now, we shall increase the number of shards to 4
.
Number of parallel requests / second | Time Taken to complete each request | Search Latency - Average | Search Latency Maximum | Max CPU Utilisation | JVM Pressure |
20 | ~ 2.5 seconds | 161 ms - First Search | 161 ms -First Search | 7-8% - First Search | 51.8 % |
40 | ~ 2.5 - 3 seconds | 0 ms | 0 ms | 7 % | 56 % |
60 | ~ 3 seconds | 0 ms | 0 ms | 7 % | 65 % |
Number of parallel requests / second | Time Taken to complete each request | Search Latency - Average | Search Latency Maximum | Max CPU Utilisation | JVM Pressure |
20 | ~ 2.5 seconds | 161 ms - First Search | 161 ms -First Search | 7-8% - First Search | 51.8 % |
40 | ~ 2.5 - 3 seconds | 0 ms | 0 ms | 7 % | 56 % |
60 | ~ 3 seconds | 0 ms | 0 ms | 7 % | 65 % |
As data grows, it is beneficial to add more nodes as the data under an index in our use case is distributed across shards. The below is how an index is split across multiple shards (55~56 shards in our case). Some indexes were removed here for brevity. So if we go with the above metrics,
index shard prirep state docs store ip node
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 2 r STARTED 180078 39.4mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 2 p STARTED 180078 39.4mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 1 p STARTED 179547 39.5mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 1 r STARTED 179547 39.5mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 3 p STARTED 180471 39.8mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 3 r STARTED 180471 39.8mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 4 r STARTED 180491 39.8mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 4 p STARTED 180491 39.8mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 0 r STARTED 179413 39.2mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
2862c1b7-24b0-4fbc-b6bf-cda9ca413ab2 0 p STARTED 179413 39.2mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
aa2c2031-8196-46f3-a065-2c607560102e 2 p STARTED 180119 39.5mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
aa2c2031-8196-46f3-a065-2c607560102e 2 r STARTED 180119 39.5mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
aa2c2031-8196-46f3-a065-2c607560102e 1 r STARTED 179087 39.4mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
aa2c2031-8196-46f3-a065-2c607560102e 1 p STARTED 179087 39.4mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
aa2c2031-8196-46f3-a065-2c607560102e 3 r STARTED 180314 39.9mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
aa2c2031-8196-46f3-a065-2c607560102e 3 p STARTED 180314 39.9mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
aa2c2031-8196-46f3-a065-2c607560102e 4 r STARTED 180594 39.8mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
aa2c2031-8196-46f3-a065-2c607560102e 4 p STARTED 180594 39.8mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
aa2c2031-8196-46f3-a065-2c607560102e 0 p STARTED 179886 39.5mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
aa2c2031-8196-46f3-a065-2c607560102e 0 r STARTED 179886 39.5mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
094d6f38-767a-4007-84e3-9a62b4a505e2 2 p STARTED 180207 39.7mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
094d6f38-767a-4007-84e3-9a62b4a505e2 2 r STARTED 180207 39.7mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
094d6f38-767a-4007-84e3-9a62b4a505e2 1 p STARTED 180119 39.8mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
094d6f38-767a-4007-84e3-9a62b4a505e2 1 r STARTED 180119 39.8mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
094d6f38-767a-4007-84e3-9a62b4a505e2 3 p STARTED 179461 39.5mb x.x.x.x 66f2a1be95ed117c009e7c888bf82233
094d6f38-767a-4007-84e3-9a62b4a505e2 3 r STARTED 179461 39.5mb x.x.x.x 1b6dec9baf668d9954e33d5931a65368
094d6f38-767a-4007-84e3-9a62b4a505e2 4 p STARTED 180139 39.5mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
094d6f38-767a-4007-84e3-9a62b4a505e2 4 r STARTED 180139 39.5mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
094d6f38-767a-4007-84e3-9a62b4a505e2 0 p STARTED 180074 39.5mb x.x.x.x 1ae11744b7d470c7b73a32b20af4833d
094d6f38-767a-4007-84e3-9a62b4a505e2 0 r STARTED 180074 39.5mb x.x.x.x a6c067bc9b1f8eafd66d5759260407f4
Copy
wildcard
queries for searching description/narration
, the use of wildcards are very expensive. Consider the below screenshot pointed out in elastic docs.Since OpenSearch is a fork of elasticsearch and the handling features of wildcard
and range
is exactly same.
2. If we really want to use OpenSearch, then we will be using it as a Database but not as a search engine (which is its primary) use case.
3. Even with the introduction of trigrams (n-gram with n=3) there was very minimal difference in performance.
4. The performance did not improve if the size of data is fixed and the number of nodes were increased.