JSON Search Indexes come to Object Storage

Published in

Oracle Developers

9 min readJul 9, 2024

We hear from so many customers who need to be able to search through many files in the Object Storage to find the information they need. Log or auditing files, documents for data exchange, some archived data, you can’t find what you’re looking for. And that’s not easy. JSON documents can be complex, and with their flexible format, it takes time and effort to find what you are looking for, and sometimes you don’t even know how it’s exactly spelled. It’s about time to evolve Oracle’s technology to make your life much easier by applying JSON search capabilities to your data in Object Stores. Let’s make it easy to find what you are looking for.

Oracle has had full-text search and indexing for many years — more than I can count or remember. It is a very mature and comprehensive technology nowadays, and we continuously enhance it, like JSON search indexes for database tables in 2021 or text indexing on files in Object Storage at the end of 2022. And now, as of July 2024, we have just released support for JSON search indexes on documents (files) in Object Storage for Autonomous Database Serverless.

If you store JSON documents in text files in your Object Storage, you can leverage JSON Search indexing capabilities on top of these files. And here is the kicker: you cannot only identify the files with the content you are searching, but you can also transparently access these files and return only the matching JSON documents, just like with a regular table in the database. Here is how it works.

Identify the JSON documents of interest in your Object Store.

It starts with (obviously) some text files with JSON documents in the Object Store, in a bucket you can access. This bucket can be a private bucket where you need a credential for authentication and authorization or a public bucket. The bucket can be in the Oracle Object Store or any other supported Object Store.

For the sake of this introductory blog, we assume you already have access to a bucket that contains your JSON files of interest (for more details on how to set up a bucket with credential access, see the ‘text indexing on files in Object Storage’ blog). Here is the list of the JSON files in my little test bucket.

SQL> select object_name, bytes
     from dbms_cloud.list_objects('my_credential','https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy/b/my_bucket/o/JSONforText/');

OBJECT_NAME                    BYTES
------------------------------ ----------
airportDelays.json             7542274
comments.json                  18231014
companies.json                 38563354
data.json                      18832901
grades.json                    37840131
inspections.json               28000656
listingsAndReviews.json        99656721
movies_mongo.json              43456152
posts.json                     17215951
routes.json                    16137490
stories.json                   11947834
transactions.json              20141601
trips.json                     7112796
tweets.json                    49887461

I borrowed these files from various sample repositories on GitHub and other publicly available places. Most of these files are in ndjson format, the default file format supported by this feature, but I also included one file to force some errors and quickly see how to analyze those (while ndjson is currently not a standard file format, it’s defined as line-delimited documents where each document must be followed by the newline (\n) character). Note that you have some control over the supported file formats — e.g., delimiters, compressed or not — but that’s beyond this blog and something the documentation will help with.

The JSON search indexing functionality is very similar to the core text indexing capabilities for files in Object Storage, but it does more. It allows you to identify the files containing your data of interest and transparently access individual documents within these files directly with SQL — but one step at a time.

Let’s get started and create our first JSON search index.

When you have identified the location of your JSON document files and have ensured that you can access them, it is time to create your JSON index with a simple PL/SQL procedure call. We will call our index JSONTEST. Who could have guessed that name?

SQL> BEGIN
  2 DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX (
  3 credential_name => 'my_credential',
  4 location_uri => 'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy/b/my_bucket/o/JSONforText/',
  5 index_name => 'jsontest',
  6 format => JSON_OBJECT ('json_index' value 'true', 'json_index_doc_len' value '100000')
  7 );
  8 END;
  9 /

PL/SQL procedure successfully completed.

That looks familiar if you have used text indexing on files in the Object Storage before, but with two differences in the format option:

We explicitly specify that we want to build a JSON search index on our external files.
We explicitly specify a maximum length for our JSON documents. If a JSON document within a file exceeds this length, it will not be indexed but will appear in the error table.

That returns quickly because this procedure only initiates the indexing process, which is kicked off as a scheduler job in the background. Let’s check that the process is running fine.

SQL> select state from all_scheduler_jobs where job_name = 'JSONTEST$JOB';

STATE
--------------------
RUNNING

Our job is running fine. Note that the job is named in line with the name of your JSON search index, namely <yourIndexName>$JOB. If something were wrong, like a wrong URI or a credential without the proper authorization, you’d spot this immediately in the run details of your job (and the job will not be in a running state). Here is an example where I intentionally used the wrong URI:

SQL> select log_date, status, additional_info from all_scheduler_job_run_details where job_name = 'JSONTEST$JOB';

LOG_DATE                        STATUS                         ADDITIONAL_INFO
------------------------------- ------------------------------ ------------------------------------------------
02/07/24 17:25:39.331062000 GMT FAILED                         ORA-20404: Object not found - https://objectstorage.eu-frankfurt-           
                                                               1.oraclecloud.com/n/my_non_existing_tenancy/b/mybucket/o/JSONforText/           
                                                               ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD$PDBCS_240607_0", line 2061

The state of the scheduler job does not necessarily indicate an initial problem. If you used the procedure’s defaults, as I did here, then the indexing job is set up with a refresh interval of 5 minutes, so if its initial invocation fails, the state of the job will show ‘SCHEDULED’.

Depending on the number and size of your files, the indexing might take some time, but when it’s finished, you’ll find an entry in the scheduler log that says the indexing has been completed successfully.

The job that creates (or maintains) your JSON search index also tracks errors on individual documents. I promised you some errors with the files I picked, so let’s look at what I did. Errors are tracked in a table called <your JSON search index>$TXTIDX_ERR, in my case JSONTEST$TXTIDX_ERR:

SQL> select distinct object_name from jsontest$txtidx_err;

OBJECT_NAME
------------------------------
airportDelays.json

Document ‘airportDelays.json’ generated errors, so let’s check them out. You can do this visually by checking the file and possibly spot what’s wrong. Or better yet, use the functionality of DBMS_CLOUD to fetch files from the Object Storage together with the information in the error table. I am using the latter one for illustration, and I only want to see the file content for the first error entry to get an idea (there’s also a minimum fetch size of 10 bytes that’s reflected in the code):

SQL> with err as (
  2 select object_path || object_name as objname,
  3        nvl(offset,0) offset,
  4        greatest(nvl(offset,0) + length,10) len,
  5        rank()
  6           over(partition by object_path || object_name
  7           order by id asc
  8           ) rnk
  9 from jsontest$txtidx_err
 10 where object_name = 'airportDelays.json'
 11 )
 12 select objname, to_clob(
 13                     dbms_cloud.get_object(
 14                           credential_name => 'oci_adwc4pm',
 15                           object_uri => objname,
 16                           startoffset => offset,
 17                           endoffset => len
 18                          ) data
 19                   )
 20 from err
 21 where rnk = 1;

OBJNAME                                                           DATA
----------------------------------------------------------------- --------------------------------------------------------------------------------
https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/my_tenancy [ {
/b/my_bucket/o/JSONforText/airportDelays.json

You can see that there is something wrong with the documents in ‘airportDelays.json’ and that this file does not contain newline delimited documents, the default for JSON search indexing on file in Object Store. If you want to process these kinds of files, you need to see whether your files have another JSON document separator, which you could specify with the format parameter JSON_INDEX_DOC_SEP. If that’s not the case, you’d need to massage your files to leverage this new functionality (there are plans to broaden the support for file formats). The good news is that all of my other files were indexed successfully.

So what has happened so far?

The indexing process has created a table called JSONTEST$TXTIDX (the index name you provided, plus “$TXTIDX”), plus some other JSON and text search-specific tables you don’t have to worry about. This main table contains the same columns as the text index table on files in Object Storage and additional ones, like LINE_NUMBER, OFFSET, or LENGTH. These extra columns identify the individual JSON documents in your files and were individually indexed.

You can query this table, but you should be okay without touching this table in most cases. Instead of knowing about offsets and byte lengths of JSON documents in a file, you want to find the individual JSON documents of interest and directly access those.

For direct access, the procedure has automatically created a view with the name of your JSON index, which, in our case, is JSONTEST. This view has the following columns:

SQL> desc jsontest

Name           Null? Type                        
-------------- ----- ---------------------------
OBJECT_NAME          VARCHAR2(4000)              
OBJECT_PATH          VARCHAR2(4000)              
FILE_LINE_JSON       CLOB                        
MTIME                TIMESTAMP(6) WITH TIME ZONE
DATA                 CLOB

The main column of interest is DATA, which “contains” our JSON documents from the external files. The interesting tidbit, however, is that this is a wrapper view to stream the JSON documents of interest directly from the Object Storage, but based on the indexing information of our index table JSONTEST$TXTID!

A sample query selecting all documents where the JSON field ‘$.business_name’ contains ‘michael’ would look as follows:

SQL> select json_value(data,
  2             '$.business_name' returning clob) as record
  3  from jsontest
  4   where json_textcontains ( file_line_json,'$.business_name','michael' )
  5   fetch first 10 rows only;

RECORD
--------------------------------------------------------------------------------
ST.MICHAEL GROCERY CORP
WU, MICHAEL A.
MICHAEL G. FANELLI, INC.
ST MICHAEL GROCERY CORP
MICHAEL'S PLACE SALON MASTER HAIR CUTTER, INC.
MICHAEL'S STORES, INC
MICHAEL'S  #9025
S. MICHAEL & SON INC.
MICHAEL D. THOMAS FUNERAL SERVICES INC.
WU, MICHAEL A.
10 rows selected.

So, the view itself acts like an external table with direct access to the data in the Object Storage, but it only fetches the documents of interest based on its offset and the document length found in the JSON index information inside the database. Now that is all internal “stuff”. All you need to know is that you need to specify the JSON search predicate on column FILE_LINE_JSON, but the actual data is in column DATA that’s streamed from the Object Storage.

You can see in the execution plan how the JSON search is driven by the internal table to derive the individual JSON documents in your files in the Object Store:

SQL> select * from dbms_xplan.display_cursor();

PLAN_TABLE_OUTPUT                                                                                    
____________________________________________________________________________________________________ 
SQL_ID  btu33a2tgx8sn, child number 1                                                                
-------------------------------------                                                                
select json_value(data,            '$.business_name' returning clob) as                              
record from jsontest  where json_textcontains (                                                      
file_line_json,'$.business_name','michael' )  fetch first 10 rows only                               
                                                                                                    
Plan hash value: 2883663787                                                                          
                                                                                                      
-------------------------------------------------------------------------------------------------    
| Id  | Operation                     | Name            | Rows  | Bytes | Cost (%CPU)| Time     |    
-------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT              |                 |       |       |     3 (100)|          |    
|*  1 |  VIEW                         |                 |    10 |   159K|     3   (0)| 00:00:01 |    
|*  2 |   WINDOW NOSORT STOPKEY       |                 |    11 |  1485 |     3   (0)| 00:00:01 |    
|   3 |    TABLE ACCESS BY INDEX ROWID| JSONTEST$TXTIDX |    11 |  1485 |     3   (0)| 00:00:01 |    
|*  4 |     DOMAIN INDEX              | JSONTEST$IDX    |       |       |     0   (0)|          |    
-------------------------------------------------------------------------------------------------    
                                                                                                      
Predicate Information (identified by operation id):                                                  
---------------------------------------------------                                                  
                                                                                                     
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=10)                                     
   2 - filter(ROW_NUMBER() OVER ( ORDER BY  NULL )<=10)                                              
   4 - access("CTXSYS"."CONTAINS"("JSONTEST$TXTIDX"."FILE_LINE_JSON",'(michael) INPATH               
              (/business_name)')>0)                                                                  
                                                                                                      
26 rows selected.

You can also get higher-level information from this view, like which objects in the Object Store actually contain JSON documents with a JSON scalar field ‘$.business_name,’ like the following:

SQL> select distinct object_name  from jsontest
  2   where json_textcontains ( file_line_json,'$.business_name','michael' )
  3   fetch first 10 rows only;

OBJECT_NAME
------------------------------
inspections.json

Here is an interesting last side comment: A JSON search index can also be used as a standard text index (or a normal generic index), but the search is not limited to a specific JSON field; it instead looks at all data. So, when we search for the same text as before, ‘michael’, but as a general index, you will see more files matching this string.

SQL> select distinct object_name from jsontest
  2   where contains(
  3      file_line_json,
  4      'michael'
  5  ) > 0;

OBJECT_NAME
------------------------------
movies_mongo.json
tweets.json
inspections.json

That’s about it for now, folks. Feel free to play more with JSON search indexes and your data.

As always, the documentation has more details and information, specifically about the parameters. We’d also love to hear from you, positive or negative. We’re always working on making things better for you, our customer.

JSON Search Indexes come to Object Storage

Identify the JSON documents of interest in your Object Store.

Let’s get started and create our first JSON search index.

So what has happened so far?

Written by Hermann Bär