Milvus, Vector Database, JSON, Vectors, Unstructured Database, Python, REST, NYC Data, Open Data, Arrays
As I have been exploring all the features of Milvus, I found two interesting field types. One for JSON and one for Arrays that let you store more complex data to augment your vectors. In the roadmap, inverted indexes are coming for Arrays and JSON so these are types that you should start embracing and using in places where it makes sense.
JSON has been a widely used data format for a number of years, it has been used as a native storage format, used for processing in front ends, used as a data interchange format especially for RESTful endpoints and finally it is being embraced as a format for use in AI Agent flows.
See:
One of the nice things about Milvus is that if we aren’t fully sure what our schema will be we can use a dynamic schema.
When you use Dynamic Fields you are actually using a JSON Field. The dynamic field in a collection is a reserved JSON field named $meta.
I found a really good use for the JSON data field type when I started looking at Motor Vehicle Collision data for New York City which is available in a convenient frequently updated REST endpoint returning JSON.
Let’s start getting data!
Quick Video Walk Through
API INGEST
For example, if we want to ingest information about the latest street cameras in New York City, that is a REST API with JSON data. We did that recently and you can read about that one.
So we see that it is pretty common to be working with JSON data, so fortunately we can handle that in Milvus and not need to put our data in multiple spots.
For our example today we will use Motor Vehicle Collisions in New York City.
You can take a look at the data dictionary here:
With a REST endpoint that displays 1,000 records at a time.
https://data.cityofnewyork.us/resource/h9gi-nx95.json?$limit=1000
We can up the limit on this and also do paging.
City of New York Open Data Documentation
NYC Open Data (Need to Sign Up)
How to Use Socrata Data
NYC Vision Zero View
Example Usage
Querying Giant Data Sets Via API
For some more details on how to work with socrata data feeds for things like limits, offsets, paging and getting started see the articles below.
Let’s examine the REST endpoint first to make sure we are getting data back.
df = pd.read_json('https://data.cityofnewyork.us/resource/h9gi-nx95.json?$order=crash_date+DESC&$limit=50')
Example JSON Record
{
"crash_date":"2021–09–11T00:00:00.000",
"crash_time":"2:39",
"on_street_name":"WHITESTONE EXPRESSWAY","off_street_name":"20 AVENUE",
"number_of_persons_injured":"2","number_of_persons_killed":"0",
"number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0",
"number_of_cyclist_injured":"0","number_of_cyclist_killed":"0",
"number_of_motorist_injured":"2","number_of_motorist_killed":"0",
"contributing_factor_vehicle_1":"Aggressive Driving/Road Rage",
"contributing_factor_vehicle_2":"Unspecified","collision_id":"4455765",
"vehicle_type_code1":"Sedan",
"vehicle_type_code2":"Sedan"
}
Hurrying to the next meetup, don’t want to miss a minute.
For a crash free experience attend my meetups in Manhattan.
Let’s look at our notebook.
SOURCE CODE
NOTEBOOK
FILTER AND FIND
What is really cool is that we can do do some pretty powerful filters on that JSON data.
For example, we can look at the JSON field borough and filter for only crashes in “Manhattan” where the first vehicle involved is a Taxi, Sedan or Bus.
crash["borough"] like "MANHATTAN" && crash["vehicle_type_code1"] in ["Taxi", "Sedan", "Bus"]',
## Attu Generated code (manually add filter)
res = milvus_client.search(
collection_name="nyccollisions", # Collection name
data=query_vector, # Replace with your query vector
filter='crash["borough"] like "MANHATTAN" && crash["vehicle_type_code1"] in ["Taxi", "Sedan", "Bus"] && crash["on_street_name"] like "11 AVENUE"',
search_params={
"metric_type": "L2",
"params": {"nprobe":16}, # Search parameters
}, # Search parameters
limit=250, # Max. number of search results to return
output_fields=["id","crash"], # Fields to return in the search results
consistency_level="Eventually"
)
for res in search_results:
for res1 in res:
print(f"{res1}")
Reference
We can filter and explore our data with a Jupyter notebook or we can use the Milvus GUI — Attu.
Attu (Milvus GUI)
Filter With Attu (Milvus GUI)
As you can see you can easily add additional filters and connect them with and or or. In my filter on the JSON field I limit the borough to Manhattan, just looking at the three most common vehicles and the street name I am checking out.
Adding a simple filter to a Vector Search is really easy.
Download Attu today.
Are there any collisions near the upcoming meetups?
The filter condition to find any crashes near our upcoming meetups.
crash["borough"] like "MANHATTAN" &&
crash["vehicle_type_code1"] in ["Taxi", "Sedan", "Bus"] &&
crash["on_street_name"] like "11 AVENUE"
The result of our query is a clue to where the meets will be held.
{
"details":"Crash occurred on 11 AVENUE with off street WEST 40 STREET MANHATTAN, NY 10018 with lat/long 40.759525 -73.99925 at 19:45 on 2024-05-02T00:00:00.000 with vehicles Bus and Sedan including 1 injuries",
"on_street_name":"11 AVENUE","off_street_name":"WEST 40 STREET",
"crash_date":"2024-05-02T00:00:00.000","crash_time":"19:45",
"borough":"MANHATTAN","zip_code":"10018","latitude":"40.759525",
"longitude":"-73.99925","location":"11 AVENUE MANHATTAN, NY 10018",
"number_of_persons_injured":"1","number_of_persons_killed":"0",
"number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0",
"number_of_cyclist_injured":"0","number_of_cyclist_killed":"0",
"number_of_motorist_injured":"1","number_of_motorist_killed":"0",
"contributing_factor_vehicle_1":"Passing or Lane Usage Improper",
"vehicle_type_code1":"Bus","contributing_factor_vehicle_2":"Unspecified",
"vehicle_type_code2":"Sedan","cross_street_name":"",
"contributing_factor_vehicle_3":"","vehicle_type_code_3":"",
"contributing_factor_vehicle_4":"","vehicle_type_code_4":""
}
You can export your Vector Search results:
"score","id","crash"
0.016393441706895828,"4729465","{""details"":""Crash occurred on 11 AVENUE with off street WEST 20 STREET MANHATTAN, NY 10011 with lat/long 40.74681 -74.007965 at 10:14 on 2024-05-29T00:00:00.000 with vehicles Sedan and including 0 injuries"",""on_street_name"":""11 AVENUE"",""off_street_name"":""WEST 20 STREET"",""crash_date"":""2024-05-29T00:00:00.000"",""crash_time"":""10:14"",""borough"":""MANHATTAN"",""zip_code"":""10011"",""latitude"":""40.74681"",""longitude"":""-74.007965"",""location"":""11 AVENUE MANHATTAN, NY 10011"",""number_of_persons_injured"":""0"",""number_of_persons_killed"":""0"",""number_of_pedestrians_injured"":""0"",""number_of_pedestrians_killed"":""0"",""number_of_cyclist_injured"":""0"",""number_of_cyclist_killed"":""0"",""number_of_motorist_injured"":""0"",""number_of_motorist_killed"":""0"",""contributing_factor_vehicle_1"":""Driver Inattention/Distraction"",""vehicle_type_code1"":""Sedan"",""contributing_factor_vehicle_2"":""Unspecified"",""vehicle_type_code2"":"""",""cross_street_name"":"""",""contributing_factor_vehicle_3"":"""",""vehicle_type_code_3"":"""",""contributing_factor_vehicle_4"":"""",""vehicle_type_code_4"":""""}"
0.016129031777381897,"4721167","{""details"":""Crash occurred on 11 AVENUE with off street WEST 40 STREET MANHATTAN, NY 10018 with lat/long 40.759525 -73.99925 at 23:30 on 2024-04-30T00:00:00.000 with vehicles Sedan and including 0 injuries"",""on_street_name"":""11 AVENUE"",""off_street_name"":""WEST 40 STREET"",""crash_date"":""2024-04-30T00:00:00.000"",""crash_time"":""23:30"",""borough"":""MANHATTAN"",""zip_code"":""10018"",""latitude"":""40.759525"",""longitude"":""-73.99925"",""location"":""11 AVENUE MANHATTAN, NY 10018"",""number_of_persons_injured"":""0"",""number_of_persons_killed"":""0"",""number_of_pedestrians_injured"":""0"",""number_of_pedestrians_killed"":""0"",""number_of_cyclist_injured"":""0"",""number_of_cyclist_killed"":""0"",""number_of_motorist_injured"":""0"",""number_of_motorist_killed"":""0"",""contributing_factor_vehicle_1"":""Driver Inattention/Distraction"",""vehicle_type_code1"":""Sedan"",""contributing_factor_vehicle_2"":"""",""vehicle_type_code2"":"""",""cross_street_name"":"""",""contributing_factor_vehicle_3"":"""",""vehicle_type_code_3"":"""",""contributing_factor_vehicle_4"":"""",""vehicle_type_code_4"":""""}"
0.01587301678955555,"4722900","{""details"":""Crash occurred on 11 AVENUE with off street WEST 18 STREET MANHATTAN, NY 10011 with lat/long 40.745415 -74.00821 at 10:45 on 2024-05-07T00:00:00.000 with vehicles Sedan and Bus including 0 injuries"",""on_street_name"":""11 AVENUE"",""off_street_name"":""WEST 18 STREET"",""crash_date"":""2024-05-07T00:00:00.000"",""crash_time"":""10:45"",""borough"":""MANHATTAN"",""zip_code"":""10011"",""latitude"":""40.745415"",""longitude"":""-74.00821"",""location"":""11 AVENUE MANHATTAN, NY 10011"",""number_of_persons_injured"":""0"",""number_of_persons_killed"":""0"",""number_of_pedestrians_injured"":""0"",""number_of_pedestrians_killed"":""0"",""number_of_cyclist_injured"":""0"",""number_of_cyclist_killed"":""0"",""number_of_motorist_injured"":""0"",""number_of_motorist_killed"":""0"",""contributing_factor_vehicle_1"":""Unsafe Lane Changing"",""vehicle_type_code1"":""Sedan"",""contributing_factor_vehicle_2"":""Unspecified"",""vehicle_type_code2"":""Bus"",""cross_street_name"":"""",""contributing_factor_vehicle_3"":"""",""vehicle_type_code_3"":"""",""contributing_factor_vehicle_4"":"""",""vehicle_type_code_4"":""""}"
0.015625,"4722264","{""details"":""Crash occurred on 11 AVENUE with off street WEST 40 STREET MANHATTAN, NY 10018 with lat/long 40.759525 -73.99925 at 19:45 on 2024-05-02T00:00:00.000 with vehicles Bus and Sedan including 1 injuries"",""on_street_name"":""11 AVENUE"",""off_street_name"":""WEST 40 STREET"",""crash_date"":""2024-05-02T00:00:00.000"",""crash_time"":""19:45"",""borough"":""MANHATTAN"",""zip_code"":""10018"",""latitude"":""40.759525"",""longitude"":""-73.99925"",""location"":""11 AVENUE MANHATTAN, NY 10018"",""number_of_persons_injured"":""1"",""number_of_persons_killed"":""0"",""number_of_pedestrians_injured"":""0"",""number_of_pedestrians_killed"":""0"",""number_of_cyclist_injured"":""0"",""number_of_cyclist_killed"":""0"",""number_of_motorist_injured"":""1"",""number_of_motorist_killed"":""0"",""contributing_factor_vehicle_1"":""Passing or Lane Usage Improper"",""vehicle_type_code1"":""Bus"",""contributing_factor_vehicle_2"":""Unspecified"",""vehicle_type_code2"":""Sedan"",""cross_street_name"":"""",""contributing_factor_vehicle_3"":"""",""vehicle_type_code_3"":"""",""contributing_factor_vehicle_4"":"""",""vehicle_type_code_4"":""""}"
0.015384615398943424,"4729921","{""details"":""Crash occurred on 11 AVENUE with off street WEST 17 STREET MANHATTAN, NY 10011 with lat/long 40.7447 -74.008354 at 16:49 on 2024-06-03T00:00:00.000 with vehicles Taxi and Bike including 0 injuries"",""on_street_name"":""11 AVENUE"",""off_street_name"":""WEST 17 STREET"",""crash_date"":""2024-06-03T00:00:00.000"",""crash_time"":""16:49"",""borough"":""MANHATTAN"",""zip_code"":""10011"",""latitude"":""40.7447"",""longitude"":""-74.008354"",""location"":""11 AVENUE MANHATTAN, NY 10011"",""number_of_persons_injured"":""0"",""number_of_persons_killed"":""0"",""number_of_pedestrians_injured"":""0"",""number_of_pedestrians_killed"":""0"",""number_of_cyclist_injured"":""0"",""number_of_cyclist_killed"":""0"",""number_of_motorist_injured"":""0"",""number_of_motorist_killed"":""0"",""contributing_factor_vehicle_1"":""Failure to Yield Right-of-Way"",""vehicle_type_code1"":""Taxi"",""contributing_factor_vehicle_2"":""Unspecified"",""vehicle_type_code2"":""Bike"",""cross_street_name"":"""",""contributing_factor_vehicle_3"":"""",""vehicle_type_code_3"":"""",""contributing_factor_vehicle_4"":"""",""vehicle_type_code_4"":""""}"
AGENT to LLM
Now as we can see JSON is a pretty awesome and very useful type of field data type for Milvus but there is another field data type that is also very useful and they are Arrays. An important differentiation between them is that Arrays must be all elements of the same type and have a fixed maximum capacity. You can add less for future inserts.
ARRAY FIELDS
Often data won’t be a simple string or number and may require an array of values. Fortunately Milvus supports this. This is another cool type that along with JSON let you add some pretty useful extra data to your collections.
Adding a field to a schema that is an array is straightforward. As shown below:
field_name=”ArrayFieldName”
You will need a field_name like always, set this to whatever string makes sense for this field and its context.
datatype=DataType.ARRAY
For arrays, the data type must be DataType.ARRAY, not that this is surprising.
element_type=DataType.INT64
For all elements they must match one data type, you can set this to any primitive type like Varchar, Int8, Int16, Int36, Int64, Bool, Float or Double.
max_capacity=5
This is the maximum number of elements that your array can contain. You can have less than this capacity or equal. You cannot have more than this so this appropriately. If your number of elements in your array varies greatly, you may have a lot of spare data here. Choice arrays as a type carefully for your use case.
ROADMAP
In the upcoming Milvus 2.5, we will get to try out our new inverted indexes for Arrays and JSON, so I will update when that happens. I will also go through the amazing list of new features and updates and give them a test run.
For a deeper diver into JSON and Array Data Types take a look at the resources below.
RESOURCES
NOTES
Make sure when you name your fields you keep them simple with only alphanumeric characters and underscores. This one has gotten me before and you don’t want to keep changing your schema. Make sure your ids match the type you use in your schema. If your ID is a String and not an int64 you will get an error.
TIP
If you want to use Milvus Lite it does not currently work on ARM or Windows.
By: Tim Spann