TL;DR: I loaded and VECTORized my 19.000+ photos into my Oracle 23ai database. Converged Database, yes (link). Some Surprising findings came out. Discovering things about Vectors, python, ai-models, and SQL-Operators for Vectors.
Background: I want Text-Searching through my photos.
After 25 years of taking digital pictures, I have way too many photos. I need some clever mechanism to search them...
So when Oracle 23ai came along with the some "ai" built in, I started tweaking around with VECTORs and Vector-Search.
End-Goal is to be able to enter a text-question, and find photos related to that question. (this may take some time / tweaks / blogs...)
I am Not There Yet, not by far. And after months of tweaking, I'm not sure if I will ever reach my goal. But I did Learn some interesting things, tech and non-tech, in the process.
Loading the image: The Easy Part.
Oracle 23ai is a "Converged Database" and should be able to handle any Data. So naturally, I want to load my data into a Table. The Table:
Next, the example python-code, to open and insert the image from a file:
This is of course very much a standard operation. Easy. If you set out to build something like this, check the documentation on the oracle-python driver, and pay attention to the input/output handlers. Maybe more on those later.
In the subsequent version of my experiments, I decided to not load the binary-images anymore, as loading took time and bloated my (containerised) database. There are arguments pro- and contra-, but it is good to know you Can Effortlessly store the images in your Oracle Database.
For your own real-world applications, you can make your own decision: keep images on separate storage or store them in your Converged Database (often Recommended!).
Loading a Vector: some caveat around datatype. RTFM...
The Table to hold vectors, in this case for a model that generates 2048 elements of Float32, like mine did (I used ResNet50 obtained via Hugging Face):
In this case I store the vectors separate from the image-table, because I expect to generate multiple vectors, from different models, for every image. Hence a separate table, related to the image-table via img_id (an FK, of course).
The example python-code (you can see I did some debugging/checking):
Notice a few things:
Python Datatype: Some models generate numpy-arrays, and other produce (standard?) python arrays of <class 'list'>. I'm converting everything to list, and sometimes I print the contents to a file, dbgdata.
Oracle VECTOR datatype: in my case I ended up with "garbage" data in my vectors a few times. That was mostly fixed by trial+error: I tried FLOAT32 and FLOAT64 in the vector-definitions, and I tried "d" and "f" in the array-constructors. The resulting combi of "d" and FLOAT32 finally worked (d-for-Double was counterintuitive for a float-32, but would be a different blog, and not relevant here... )
If you go about inserting VECTORs from python, I strongly Recommend to read The Fantastic Manual (RTFM link). Also, because this version is not net GA at time of Writing (Nov2024), things may still change slightly. My type-conversion might have been caused by old-versions or outdated-release.
Notably about input-handlers and output-handlers you will need to convert between python-classes and oracle-column-type. RTFM and Test! I had some trouble + confusion there. Hence I sometimes ended up printing vectors out from python, and even built a small utility to inspect vectors inside my database (link to victor.py).
Also, the python-array Must be of same length as the defined VECTOR column. You may find out the hard way, especially if you start using various models to generate vectors, and their lengths differ. This is probably a reasons to create a separate table to hold the vectors from each model.
After some trial+err (okok, some of the dbg-code is still there), my vectors got correctly inserted, All Good.
This Works: Vector-similarity search.
When doing Cosine-Similarity comparisons on the image-vectors, I get the "expected" results: Similar images yield Cos-Similar vectors.
To simplify, I created a view to encapsulate the Cosine-Similarity:
Here the calculation is embedded in the view, which makes the typing of SQL a lot easier. The test-code immediately following the create-view sort-of proves that it works. In the limited test-set of my first 200 images it found two that it thinks, according to this "model", are very similar and the filenames are probably no coincidence.
Notice the use of <=> as an operator, rather than calling:
vector_distance ( vec1, vec2)
I prefer an Operator, it reads more like a Query. The available operators are:
vector1 <=> vector2 : Cosine Distance
vector1 <-> vector2 : Euclidean Distance
vector1 <#> vector2 : Negative Dot Product.
Those operators are identical to what is being used in some other SQL implementations. Not sure if they are part of any Standard yet.
And this View worked like a Charm. The two files, 1-view and 2-view in my test-set were indeed Very Similar. I opened them to double-check:
More testing: I also happen to know that image nr 7 is a powerline-pylon (dutch site on pylons). Let's use the view to check if I have more photos of pylons:
After a testrun, I included the "easy open" statements for the files with a high Cosine-Similarity, all above 0.9. And Bingo:
It found 4 other files, and ... the file itself, which of course has a Cos-Similarity of 1.000 with itself. Yes, all Pylons.
All this looks hopeful: I can find images, when I can provide an "Example". The Cosine-Similarity will find images that resemble. The higher the cos-sim, the more the image should look alike.
The view works, I can encapsulate the vector-operators inside views, which is a Great Convenience for writing complex queries.
And more importantly: Similar Images have a high Cosine-Similarity. This is as close to QED as I get at this point.
Of course, I'm not the first to try this. Notably That JeffSmith (Hi Jeff!) did something very similar, using a REST-API looking for a cute dog. As did lots of others on t'interweb.
The Limitation: no Text-Search yet.
OK, now I can "embed" or generate vectors on the images and use them for "similarity" search. But to search the (embeddings of the) images with "type text to search" has not worked yet: None of the models I explored could encode my Text good enough so that a the text-vector and the image-vector have a satisfactory high cosine-similarity.
This means I was not able to use "question-text" yet to search my images, despite the "promise" of several models that it would be possible.
The details would be boring, and not very useful yet as I dont understand all the details enough myself yet.
But in the process I stumbled across several other interesting "findings". More in subsequent blogs.
Some Learnings:
The "models" used in embedding are a whole separate topic. Let me just Recommend the Hugging Face site (link). But there are many more sources and this field will probale evolve rapidly.
Models and Vectors: They only work if you compare like-for-like, e.g. vectors from the same model (duh). To this end, it is probably better to have separate tables for each model. I'm experimenting. Hence ...
Datamodel: must be flexible. I will experiment (already have) with various models for "embedding" and I will have to adjust my datamodel accordingly. For example, it may or may not be smart to store vectors of variable lengths (different models) in the same table. And in several models there are "captions" and "keywords" generated that need to be stored somewhere, preferably in a way that is easily searchable by SQL and by some vector-embedding-mechanism. This datamodel will Evolve.
Indexing: No need yet, to my surprise. So far, all my experiments with vector-search have been (sub)second queries, even on 100K or larger tables. This was a nice surprise. But in future, I think we should explore vector-indexing. The manual on those is worth a study (link) already. Also, I'm sure other DBA-geeks will soon dig into the indexing and explain-plans for vector-search queries. Either from necessity or CTD. Or simply becaus that is what DBA's do, right?
SQL Rules: Once the VECTOR-data is inside the Database, you can use the VECTOR-column and vector-operations in SQL-statements like any other (Thankfully!). I used a VIEW to simplify things, encapsulate the joins, calculate the Cosine-Similarity, and facilitate searching. Neat, Simple, Convenient!
Conclusion: It is a Journey...
[small]
Caveat: This is not one of those smooth (self- marketing-) blogs where the author has prepared 7 blogposts, neatly leading up to a "desired outcome" to demonstrate how Wonderful their huggingface-model is, and how clever the products they use... This is just a DBA-grunt searching his way around new-ish technology.
I've done some interesting coding and running with various models. I need (want) to do More on this topic to Explore.
For example: the generation of Captions and Keywords: This may provide the key to Text-Searching of image. And I found some surprisingly Accurate captions on my own collection.
More Later.
But I'm Enjoying the Journey.
No comments:
Post a Comment