JSON Search Indexes come to Object Storage

Hermann Bär
Oracle Developers
Published in
9 min readJul 9, 2024
Photo by Austin Chan on Unsplash

We hear from so many customers who need to be able to search through many files in the Object Storage to find the information they need. Log or auditing files, documents for data exchange, some archived data, you can’t find what you’re looking for. And that’s not easy. JSON documents can be complex, and with their flexible format, it takes time and effort to find what you are looking for, and sometimes you don’t even know how it’s exactly spelled. It’s about time to evolve Oracle’s technology to make your life much easier by applying JSON search capabilities to your data in Object Stores. Let’s make it easy to find what you are looking for.

Oracle has had full-text search and indexing for many years — more than I can count or remember. It is a very mature and comprehensive technology nowadays, and we continuously enhance it, like JSON search indexes for database tables in 2021 or text indexing on files in Object Storage at the end of 2022. And now, as of July 2024, we have just released support for JSON search indexes on documents (files) in Object Storage for Autonomous Database Serverless.

If you store JSON documents in text files in your Object Storage, you can leverage JSON Search indexing capabilities on top of these files. And here is the kicker: you cannot only identify the files with the content you are searching, but you can also transparently access these files and return only the matching JSON documents, just like with a regular table in the database. Here is how it works.

Identify the JSON documents of interest in your Object Store.

It starts with (obviously) some text files with JSON documents in the Object Store, in a bucket you can access. This bucket can be a private bucket where you need a credential for authentication and authorization or a public bucket. The bucket can be in the Oracle Object Store or any other supported Object Store.

For the sake of this introductory blog, we assume you already have access to a bucket that contains your JSON files of interest (for more details on how to set up a bucket with credential access, see the ‘text indexing on files in Object Storage’ blog). Here is the list of the JSON files in my little test bucket.

SQL> select object_name, bytes
from dbms_cloud.list_objects('my_credential','https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy/b/my_bucket/o/JSONforText/');

OBJECT_NAME BYTES
------------------------------ ----------
airportDelays.json 7542274
comments.json 18231014
companies.json 38563354
data.json 18832901
grades.json 37840131
inspections.json 28000656
listingsAndReviews.json 99656721
movies_mongo.json 43456152
posts.json 17215951
routes.json 16137490
stories.json 11947834
transactions.json 20141601
trips.json 7112796
tweets.json 49887461

I borrowed these files from various sample repositories on GitHub and other publicly available places. Most of these files are in ndjson format, the default file format supported by this feature, but I also included one file to force some errors and quickly see how to analyze those (while ndjson is currently not a standard file format, it’s defined as line-delimited documents where each document must be followed by the newline (\n) character). Note that you have some control over the supported file formats — e.g., delimiters, compressed or not — but that’s beyond this blog and something the documentation will help with.

The JSON search indexing functionality is very similar to the core text indexing capabilities for files in Object Storage, but it does more. It allows you to identify the files containing your data of interest and transparently access individual documents within these files directly with SQL — but one step at a time.

Let’s get started and create our first JSON search index.

When you have identified the location of your JSON document files and have ensured that you can access them, it is time to create your JSON index with a simple PL/SQL procedure call. We will call our index JSONTEST. Who could have guessed that name?

SQL> BEGIN
2 DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX (
3 credential_name => 'my_credential',
4 location_uri => 'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy/b/my_bucket/o/JSONforText/',
5 index_name => 'jsontest',
6 format => JSON_OBJECT ('json_index' value 'true', 'json_index_doc_len' value '100000')
7 );
8 END;
9 /

PL/SQL procedure successfully completed.

That looks familiar if you have used text indexing on files in the Object Storage before, but with two differences in the format option:

  • We explicitly specify that we want to build a JSON search index on our external files.
  • We explicitly specify a maximum length for our JSON documents. If a JSON document within a file exceeds this length, it will not be indexed but will appear in the error table.

That returns quickly because this procedure only initiates the indexing process, which is kicked off as a scheduler job in the background. Let’s check that the process is running fine.

SQL> select state from all_scheduler_jobs where job_name = 'JSONTEST$JOB';

STATE
--------------------
RUNNING

Our job is running fine. Note that the job is named in line with the name of your JSON search index, namely <yourIndexName>$JOB. If something were wrong, like a wrong URI or a credential without the proper authorization, you’d spot this immediately in the run details of your job (and the job will not be in a running state). Here is an example where I intentionally used the wrong URI:

SQL> select log_date, status, additional_info from all_scheduler_job_run_details where job_name = 'JSONTEST$JOB';

LOG_DATE STATUS ADDITIONAL_INFO
------------------------------- ------------------------------ ------------------------------------------------
02/07/24 17:25:39.331062000 GMT FAILED ORA-20404: Object not found - https://objectstorage.eu-frankfurt-
1.oraclecloud.com/n/my_non_existing_tenancy/b/mybucket/o/JSONforText/
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD$PDBCS_240607_0", line 2061

The state of the scheduler job does not necessarily indicate an initial problem. If you used the procedure’s defaults, as I did here, then the indexing job is set up with a refresh interval of 5 minutes, so if its initial invocation fails, the state of the job will show ‘SCHEDULED’.

Depending on the number and size of your files, the indexing might take some time, but when it’s finished, you’ll find an entry in the scheduler log that says the indexing has been completed successfully.

The job that creates (or maintains) your JSON search index also tracks errors on individual documents. I promised you some errors with the files I picked, so let’s look at what I did. Errors are tracked in a table called <your JSON search index>$TXTIDX_ERR, in my case JSONTEST$TXTIDX_ERR:

SQL> select distinct object_name from jsontest$txtidx_err;

OBJECT_NAME
------------------------------
airportDelays.json

Document ‘airportDelays.json’ generated errors, so let’s check them out. You can do this visually by checking the file and possibly spot what’s wrong. Or better yet, use the functionality of DBMS_CLOUD to fetch files from the Object Storage together with the information in the error table. I am using the latter one for illustration, and I only want to see the file content for the first error entry to get an idea (there’s also a minimum fetch size of 10 bytes that’s reflected in the code):

SQL> with err as (
2 select object_path || object_name as objname,
3 nvl(offset,0) offset,
4 greatest(nvl(offset,0) + length,10) len,
5 rank()
6 over(partition by object_path || object_name
7 order by id asc
8 ) rnk
9 from jsontest$txtidx_err
10 where object_name = 'airportDelays.json'
11 )
12 select objname, to_clob(
13 dbms_cloud.get_object(
14 credential_name => 'oci_adwc4pm',
15 object_uri => objname,
16 startoffset => offset,
17 endoffset => len
18 ) data
19 )
20 from err
21 where rnk = 1;

OBJNAME DATA
----------------------------------------------------------------- --------------------------------------------------------------------------------
https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy [ {
/b/my_bucket/o/JSONforText/airportDelays.json

You can see that there is something wrong with the documents in ‘airportDelays.json’ and that this file does not contain newline delimited documents, the default for JSON search indexing on file in Object Store. If you want to process these kinds of files, you need to see whether your files have another JSON document separator, which you could specify with the format parameter JSON_INDEX_DOC_SEP. If that’s not the case, you’d need to massage your files to leverage this new functionality (there are plans to broaden the support for file formats). The good news is that all of my other files were indexed successfully.

So what has happened so far?

The indexing process has created a table called JSONTEST$TXTIDX (the index name you provided, plus “$TXTIDX”), plus some other JSON and text search-specific tables you don’t have to worry about. This main table contains the same columns as the text index table on files in Object Storage and additional ones, like LINE_NUMBER, OFFSET, or LENGTH. These extra columns identify the individual JSON documents in your files and were individually indexed.

You can query this table, but you should be okay without touching this table in most cases. Instead of knowing about offsets and byte lengths of JSON documents in a file, you want to find the individual JSON documents of interest and directly access those.

For direct access, the procedure has automatically created a view with the name of your JSON index, which, in our case, is JSONTEST. This view has the following columns:

SQL> desc jsontest

Name Null? Type
-------------- ----- ---------------------------
OBJECT_NAME VARCHAR2(4000)
OBJECT_PATH VARCHAR2(4000)
FILE_LINE_JSON CLOB
MTIME TIMESTAMP(6) WITH TIME ZONE
DATA CLOB

The main column of interest is DATA, which “contains” our JSON documents from the external files. The interesting tidbit, however, is that this is a wrapper view to stream the JSON documents of interest directly from the Object Storage, but based on the indexing information of our index table JSONTEST$TXTID!

A sample query selecting all documents where the JSON field ‘$.business_name’ contains ‘michael’ would look as follows:

SQL> select json_value(data,
2 '$.business_name' returning clob) as record
3 from jsontest
4 where json_textcontains ( file_line_json,'$.business_name','michael' )
5 fetch first 10 rows only;

RECORD
--------------------------------------------------------------------------------
ST.MICHAEL GROCERY CORP
WU, MICHAEL A.
MICHAEL G. FANELLI, INC.
ST MICHAEL GROCERY CORP
MICHAEL'S PLACE SALON MASTER HAIR CUTTER, INC.
MICHAEL'S STORES, INC
MICHAEL'S #9025
S. MICHAEL & SON INC.
MICHAEL D. THOMAS FUNERAL SERVICES INC.
WU, MICHAEL A.
10 rows selected.

So, the view itself acts like an external table with direct access to the data in the Object Storage, but it only fetches the documents of interest based on its offset and the document length found in the JSON index information inside the database. Now that is all internal “stuff”. All you need to know is that you need to specify the JSON search predicate on column FILE_LINE_JSON, but the actual data is in column DATA that’s streamed from the Object Storage.

You can see in the execution plan how the JSON search is driven by the internal table to derive the individual JSON documents in your files in the Object Store:

SQL> select * from dbms_xplan.display_cursor();

PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________
SQL_ID btu33a2tgx8sn, child number 1
-------------------------------------
select json_value(data, '$.business_name' returning clob) as
record from jsontest where json_textcontains (
file_line_json,'$.business_name','michael' ) fetch first 10 rows only

Plan hash value: 2883663787

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3 (100)| |
|* 1 | VIEW | | 10 | 159K| 3 (0)| 00:00:01 |
|* 2 | WINDOW NOSORT STOPKEY | | 11 | 1485 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| JSONTEST$TXTIDX | 11 | 1485 | 3 (0)| 00:00:01 |
|* 4 | DOMAIN INDEX | JSONTEST$IDX | | | 0 (0)| |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=10)
2 - filter(ROW_NUMBER() OVER ( ORDER BY NULL )<=10)
4 - access("CTXSYS"."CONTAINS"("JSONTEST$TXTIDX"."FILE_LINE_JSON",'(michael) INPATH
(/business_name)')>0)

26 rows selected.

You can also get higher-level information from this view, like which objects in the Object Store actually contain JSON documents with a JSON scalar field ‘$.business_name,’ like the following:

SQL> select distinct object_name  from jsontest
2 where json_textcontains ( file_line_json,'$.business_name','michael' )
3 fetch first 10 rows only;

OBJECT_NAME
------------------------------
inspections.json

Here is an interesting last side comment: A JSON search index can also be used as a standard text index (or a normal generic index), but the search is not limited to a specific JSON field; it instead looks at all data. So, when we search for the same text as before, ‘michael’, but as a general index, you will see more files matching this string.

SQL> select distinct object_name from jsontest
2 where contains(
3 file_line_json,
4 'michael'
5 ) > 0;

OBJECT_NAME
------------------------------
movies_mongo.json
tweets.json
inspections.json

That’s about it for now, folks. Feel free to play more with JSON search indexes and your data.

As always, the documentation has more details and information, specifically about the parameters. We’d also love to hear from you, positive or negative. We’re always working on making things better for you, our customer.

--

--

Hermann Bär
Oracle Developers

I work in Product Management in the Oracle Database organization. I thrive in helping customers leverage Oracle technology to make them successful.