Friday, 22 November 2024

Oracle 23ai - Converged Database, playing with images and text

TL;DR: Investigating the use of VECTORs in Oracle 23ai, I stumble across all kind of (python) models. Some of those, pre trained, models  think they can recognise my photos. Some Surprises. But I am still nowhere near my end-goal: text-searching my pictures. But having the data in Oracle 23ai helps. I can use SQL.


Background

I'm on a queeste to search through my own photos using a text-query, like "photos with pylons in them?". So far I successfully managed to store images and VECTORs generated from the images. Searching image-to-image works OK: Give it an example-image like pylon.jpg, and it finds similar, resembling images. But it is hard to find a model (a vectoriser?) that allows text-search.

In the process, I'm finding other interesting things about "Artificial Intelligence" and the use of python and Oracle. For example, there is "AI" that can generate captions, descriptive text, to the images you give it. So, Let's Try That.

But let me insert somc caution: Automated captions, Labels, are not always correct. some of them become funny. Hence...



Yes, the warning is there for a reason.


Caption-generation: halfway to searching the images?

If an "AI" can generate descriptions of my pictures, that would be a step in the right direction. And I came across several models that pretend they can. A lot of my searches lead to the Hugging Face site (link), where you can find descriptions of the models, example python-code. And where the necessary "data" from the pre-learned moded can be (automatically!) downloaded.

Not all models can be made to work as easily as the blogs/pr-posts or example-code suggest, but I got a few to work fairly quick. 

Disclaimer: A lot of this is "black box" to me. And when someone asks me for details, I might reply: "I have No Idea, but it seems to work".


Found a model that "worked" for me

The one that seemed to work calls itself "Vit-GPT2-coco-en" (link)

ViT-GPT-2 : Vision Transformer - Generative Pretrained Transformer 2  

(Yes, I found later that there is are GPT-1 and a GPT-3 and -4. In my search, I just happened to stumble an implementation of ViT-GPT-2 that worked more or less effortlessly in my situation - I might explore higher versions later, as they look Promising).

I took the example code, did some copy paste, from this page (link), and managed to generate captions for my stored-photos. To use this model, I needed the following at the start of my py-file:

This tells python to use the transformers (a way to run a model, I think) and it sets three functions(fuction-pointers): feature-extractor, tokenizer and model. Those functions will be used to "extract" the caption. 

With careful copy/paste from the example-code, I constructed a function to generate  the caption for an image. The function takes the file-path to the image, and returns the caption as a string:

Notice how it calls the three functions defined (imported) earlier. The actual "magic" is packed inside these three function. And inside a set of data, which is the "pre-learned" part of the model. On first call, you will notice that the transformer-functions will download a sizeable amount of data. Make sure you have a good+fast internet connection, as the amount of download-data can be considerable, sometimes several GBs.

For more information on the transformers, you can study the Hugging Face site. For the moment, I just accept they work.

Once I got it to work, I added a table to my datamodel to hold the captions, in a way where I could possibly run multiple models and store them in the same table based on the model_name.

(Yes, yes, I know, the model-column should relate to a lookup-table to prevent storing the name a thousand times.. Later!).

Then I just let the model loop over all my 19500 photos, and waited... Almost 10hrs, on my sandbox-machine, a vintage MB-Pro from 2013.

Note: I'm deliberately Not using any Hosting-provider, a Saas, or an API-service, out of principle. I try to keep my source-data, my pictures, on my own platform, inside my own "bricks and motar". I fully appreciate that a hosting-provider or cloud-service, or even an API, could do this better+faster. But I'm using my own "Infrastructure" at the moment. I even went so far as to run the caption-generator for a while without any internet connection, just to make sure it was "local".


The Result: Captions.

The resulting text-captions turned out to be primitive but often accurate descriptions of my photos. Here are some Queries on them:

It is clear that most of my pictures contain a "motorcycle" in various locations. No mystery there. Let's clean that list by adding

  Where not-like '%motorcycle%'

Now the highest count if of some "metal object with hole" ? Some folks may already guess what that is, as I ride a certain type of motorcycle...

But more in detail, let me go back to a previous blog (link): I had identified 5 pictures with "powerline-pylons" on them, and the vector-search had correctly identified them as "similar". What would the caption-algorithm think of them? Let's Query for the captions of those 5 pictures:

Hmm. Only one of the five is correctly identified as a power-line. Check the pictures on the previous blog. The two findings of a kite are explained from the triangular-shape combined with the lines leading to them. The train... was not a train, but could be mistaken as such, sort-of.

Main Finding (and I checked other pictures as well...): There are a lot of Wrong-Captions in the list. Doing a check for "false positive" or "false negative" would be rather time consuming. So while some of the captions were Surprisingly Accurate, there is a lot of noise in the results still. Hence the "Warning".

Oh, and that "metal object with a hole" ?  It was the Oil-window on my engine... I tend to hold-down the phone to take a picture of it Every Morning to have an idea of oil-consumption between maintenance. I mostly delete the pictures, but when I keep them, they look like this:

Correcty Identified !  And no worries, the modern engine of a BMW-R1250RT uses virtually No oil between maintenance. But it is a good habit to check. 

There were several other Surprising "accurate finds" and a lot of "funny misses", but that is not the topic of this blog. The use of (this-model) caption mechanism is helpful, but still rather crude, and error-prone.


Learnings.

Python: At this stage, I learned about python (and Oracle). Not much vector-logic happening at my database, yet.

Models, AI.. : I learned there is A Lot More to Learn... (grinning at self...). Others are exploring the use of ONNX models which can be loaded into the database (link). I am sticking to python to run my models. for the moment.

Hugging Face: I got quite skilled at copy-paste of example-code and making that work. The HF site and how it enables use of models was quite useful to me (link again)

Detail on housekeeping, cached-data: The models download a lot of data (the pre-learned-knowlege, possibly a neural-network?). By default this will be in a hidden directory stored under your home-dir : ~/.cache/huggingface. Relevant in case you want to cleanup after testing 20 or so models. My cache had grown close to 100G. It is only cache, so if you clean-up some useful data by accident it will re-download on the next run.

Generated Captions: Sometimes Surprisingly Accurate, sometimes a bit hmmm. Results might say something about my picture-taking-habit, not sure what yet, but I may have to diversiy my subject a little. Too many pictures of motorcycles.

Converged Database: Give it any SQL you want.

Having the data in an RDBMS is still a large blessing. Inside a Converged Database (link), you have the liberty to combine data from any table to look for the information you need at any given time. And I Know how to formulate SQL to gain some insights. 

SQL Rules!


Concluding...: Not there yet.

My original goal was (still is) to enter a question as text and have the system report back with the possible resulting images. It seems That is going to take a lot more googling and experimenting still.

Sunday, 17 November 2024

Vector functions in 23ai - and unexpected insights.

TL;DR: I loaded and VECTORized my 19.000+ photos into my Oracle 23ai database. Converged Database, yes (link). Some Surprising findings came out. Discovering things about Vectors, python, ai-models, and SQL-Operators for Vectors.


Background: I want Text-Searching through my photos.

After 25 years of taking digital pictures, I have way too many photos. I need some clever mechanism to search them...

So when Oracle 23ai came along with the some "ai" built in, I started tweaking around with VECTORs and Vector-Search.

End-Goal is to be able to enter a text-question, and find photos related to that question. (this may take some time / tweaks / blogs...)

I am Not There Yet, not by far. And after months of tweaking, I'm not sure if I will ever reach my goal. But I did Learn some interesting things, tech and non-tech, in the process.


Loading the image: The Easy Part.

Oracle 23ai is a "Converged Database" and should be able to handle any Data. So naturally, I want to load my data into a Table. The Table:

Next, the example python-code, to open and insert the image from a file:

This is of course very much a standard operation. Easy. If you set out to build something like this, check the documentation on the oracle-python driver, and pay attention to the input/output handlers. Maybe more on those later.

In the subsequent version of my experiments, I decided to not load the binary-images anymore, as loading took time and bloated my (containerised) database.  There are arguments pro- and contra-, but it is good to know you Can Effortlessly store the images in your Oracle Database.

For your own real-world applications, you can make your own decision: keep images on separate storage or store them in your Converged Database (often Recommended!). 


Loading a Vector: some caveat around datatype. RTFM...

The Table to hold vectors, in this case for a model that generates 2048 elements of Float32, like mine did (I used ResNet50 obtained via Hugging Face):

In this case I store the vectors separate from the image-table, because I expect to generate multiple vectors, from different models, for every image. Hence a separate table, related to the image-table via img_id (an FK, of course).

The example python-code (you can see I did some debugging/checking):

Notice a few things:

Python Datatype: Some models generate numpy-arrays, and other produce (standard?) python arrays of <class 'list'>. I'm converting everything to list, and sometimes I print the contents to a file, dbgdata.

Oracle VECTOR datatype: in my case I ended up with "garbage" data in my vectors a few times. That was mostly fixed by trial+error: I tried FLOAT32 and FLOAT64 in the vector-definitions, and I tried  "d" and "f" in the array-constructors. The resulting combi of "d" and FLOAT32 finally worked (d-for-Double was counterintuitive for a float-32, but would be a different blog, and not relevant here... )

If you go about inserting VECTORs from python, I strongly Recommend to read The Fantastic Manual (RTFM link). Also, because this version is not net GA at time of Writing (Nov2024), things may still change slightly. My type-conversion might have been caused by old-versions or outdated-release.

Notably about input-handlers and output-handlers you will need to convert between python-classes and oracle-column-type. RTFM and Test! I had some trouble + confusion there. Hence I sometimes ended up printing vectors out from python, and even built a small utility to inspect vectors inside my database (link to victor.py).

Also, the python-array Must be of same length as the defined VECTOR column. You may find out the hard way, especially if you start using various models to generate vectors, and their lengths differ. This is probably a reasons to create a separate table to hold the vectors from each model.

After some trial+err (okok, some of the dbg-code is still there), my vectors got correctly inserted, All Good.


This Works: Vector-similarity search.

When doing Cosine-Similarity comparisons on the image-vectors, I get the "expected" results: Similar images yield Cos-Similar vectors. 

To simplify, I created a view to encapsulate the Cosine-Similarity:

Here the calculation is embedded in the view, which makes the typing of SQL a lot easier. The test-code immediately following the create-view sort-of proves that it works. In the limited test-set of my first 200 images it found two that it thinks, according to this "model", are very similar and the filenames are probably no coincidence.

Notice the use of <=> as an operator, rather than calling:

vector_distance ( vec1, vec2)

I prefer an Operator, it reads more like a Query. The available operators are: 

vector1 <=> vector2  : Cosine Distance

vector1 <-> vector2 : Euclidean Distance

vector1 <#> vector2 : Negative Dot Product.

Those operators are identical to what is being used in some other SQL implementations.  Not sure if they are part of any Standard yet.

And this View worked like a Charm. The two files, 1-view and 2-view in my test-set were indeed Very Similar. I opened them to double-check:


More testing: I also happen to know that image nr 7 is a powerline-pylon (dutch site on pylons). Let's use the view to check if I have more photos of pylons:

After a testrun, I included the "easy open" statements for the files with a high Cosine-Similarity, all above 0.9. And Bingo:

It found 4 other files, and ... the file itself, which of course has a Cos-Similarity of 1.000 with itself. Yes, all Pylons.

All this looks hopeful: I can find images, when I can provide an "Example". The Cosine-Similarity will find images that resemble. The higher the cos-sim, the more the image should look alike.

The view works, I can encapsulate the vector-operators inside views, which is a Great Convenience for writing complex queries.

And more importantly: Similar Images have a high Cosine-Similarity. This is as close to QED as I get at this point.

Of course, I'm not the first to try this. Notably That JeffSmith (Hi Jeff!) did something very similar, using a REST-API looking for a cute dog. As did lots of others on t'interweb.


The Limitation: no Text-Search yet. 

OK, now I can "embed" or generate vectors on the images and use them for "similarity" search. But to search the (embeddings of the) images with "type text to search" has not worked yet: None of the models I explored could encode my Text good enough so that a the text-vector and the image-vector have a satisfactory high cosine-similarity.

This means I was not able to use "question-text" yet to search my images, despite the "promise" of several models that it would be possible. 

The details would be boring, and not very useful yet as I dont understand all the details enough myself yet.  

But in the process I stumbled across several other interesting "findings". More in subsequent blogs.

The search continues. 


Some Learnings:

The "models" used in embedding are a whole separate topic. Let me just Recommend the Hugging Face site (link). But there are many more sources and this field will probale evolve rapidly. 

Models and Vectors: They only work if you compare like-for-like, e.g. vectors from the same model (duh). To this end, it is probably better to have separate tables for each model. I'm experimenting. Hence ...

Datamodel: must be flexible. I will experiment (already have) with various models for "embedding" and I will have to adjust my datamodel accordingly. For example, it may or may not be smart to store vectors of variable lengths (different models) in the same table. And in several models there are "captions" and "keywords" generated that need to be stored somewhere, preferably in a way that is easily searchable by SQL and by some vector-embedding-mechanism. This datamodel will Evolve.

Indexing: No need yet, to my surprise. So far, all my experiments with vector-search have been (sub)second queries, even on 100K or larger tables. This was a nice surprise. But in future, I think we should explore vector-indexing. The manual on those is worth a study (link) already. Also, I'm sure other DBA-geeks will soon dig into the indexing and explain-plans for vector-search queries. Either from necessity or CTD. Or simply becaus that is what DBA's do, right?

SQL Rules: Once the VECTOR-data is inside the Database, you can use the VECTOR-column and vector-operations in SQL-statements like any other (Thankfully!). I used a VIEW to simplify things, encapsulate the joins, calculate the Cosine-Similarity, and facilitate searching.  Neat, Simple, Convenient!


Conclusion: It is a Journey...

[small]

Caveat: This is not one of those smooth (self- marketing-) blogs where the author has prepared 7 blogposts, neatly leading up to a "desired outcome" to demonstrate how Wonderful their huggingface-model is, and how clever the products they use... This is just a DBA-grunt searching his way around new-ish technology. 

[/small]

I've done some interesting coding and running with various models. I need (want) to do More on this topic to Explore.

For example: the generation of Captions and Keywords: This may provide the key to Text-Searching of image. And I found some surprisingly Accurate captions on my own collection. 

More Later.

But I'm Enjoying the Journey.



Friday, 15 November 2024

Look Inside my Vector, Victor. 23ai vector inspection.

 TL;DR: I want to see the element-values of my vectors. For larger vectors, this poses a problem, as SQL*Plus does not easily display the values in the far-end of a sizeable VECTOR. I'll put some SQL and Python code at end of blog.


Background: Inspecting VECTORs

Since Oracle 23ai can store and process VECTOR columns, I am experimenting with some of that. Notably "models" that turn text and images into vectors so they can be compared (searched) using cosine-similarity and other hip methods. The searching sometimes yields remarkable results (surprisingly Clever stuff!), and sometimes just ...  Nothing (???). This vector-search to detect similarity, the heart of the Artificial Intelligence in this gizmo, is notably dependent on the models that generate the VECTORs and on the vector-functions in the database.

Those very clever models are to me "just python code" that I call to generate vector-arrays. However, the type of the elements inside the arrays ( or lists / matrices / tensors) depend on their origin-packages: Float, Binary, Integer, even String-values are possible as vector-elements.

From the conversions between one datatype and another, and from the plumbing of data between python and oracle, I have sometimes ended up with errors or just "wrong data" in my vector-columns a few times.

And, understandably, if you try to compare Pears and Oranges, you end up with un-reliable results, or even straight up errors.

Hence the need to sometimes verify, to "inspect" my VECTORs in detail.


The Problem: Trust No 1.

Some of the models I used generate normal python lists, other use numpy (link) and generate numpy-arrays. Then there is Oracle that wants to know if a VECTOR will contain FLOAT32, or FLOAT64, or BINARY values. And if the values from python do not match the values defined for the column, I seem to end up with either errors, or with scrambled (wrong!) values in the vector-column. 

Hence the Need for input-handlers and output-handlers. You have to know this when processing data between python and oracle (link to docu)

Combine that complexity with the fact that sometimes my cosine-values were suspicious (e.g. probably wrong). 

And all this lead me to want to "inspect" my vector-values using both python and SQL*Plus or SQLcl, just to see if the "generated values" had also ended up correctly inside the VECTOR columns in my database. 


Attempt nr 1: SQL*Plus to view a vector.

For small vectors, there is no problem. Here is my super-simple demo-set (also used to calculate known cos-distance and euclidean-dist, Highschool-Math, but different story): 

As long as I can sensibly fit the values of the vector-elements on a screen, I can display them. With some effort, I can even read the floating-point notation of more complicated vectors:

Now, at some point, the values will become too many and too large to display. Others have suggested to increase buffers, linesize, pagesize etc... But There is a Limit to what you can sensibly Display and Decipher.

If this works for you: You are Good. No need to read further. But in my case: Not Suitable for vectors of 512, 784, or 2048 elements... 


Attempt nr 2: regex to un-pivot a vector.

when looking at the values from SQL*Plus, SQLcl and SQLDeveloper, the data is presented as CSV. And I happen to know a trick to un-pivot CSV data:

That works for the small vectors (I should also remove the '[' but I'm Lazy...).

It also works with larger vectors, with more elements. But with a vector of 512 elements:

The fact that there is an implicit use of VARCHAR with maxsize (32K?) will limit my regex options.

So, at this point, if You are able to display Your Vectors: You are Good. 

But in my case, I had to keep looking...


My Solution: a Python program to "print out" the values.

Since most (all) of my models are currently in python, I looked at something python-y to read+display my vectors. The Requirements that came to mind were:

1) Query any vector, in an ad-hoc way, as I would normally do with SQL.

2) Display All Elements in a readable way to a terminal or write to file, whereby a tee or redirect would also work. 

3) Some easy way to check and compare VECTOR contents (a hashtotal came to mind as a first, crude, way to check a vector).

After some experimenting, this resulted in the program called victor.py. It does approximately the follwing:

a) Connect to the database (I'll work on command-line or connection-name later) and report about the connect: username, global_name, so that the user can confirm "where the program is".

b) Determine which SQL to use. The Query must have a VECTOR (or list) as first column in the return-cursor. It will not complain if there are more columns, but they will be ignored (for the moment).

b1) IF there is an arg[1], THEN assume that is the SQL to use to get the vector(s). Currently, there is a need to "quote" the SQL statement to make it just arg[1].

b2) ELSE ask for an SQL statement. 

c) Open a cursor with the statement, and start returning values. If the SQL contains a syntax error, that will result in a nasty error-stack, but the actual error will be at the bottom of the stack, Readable. 

c1) IF the first column is not a list (vector), THEN report a warning.

c2) Else start processing the VECTOR, report its length, the type of elements (float or...), and print the elements 1-by-1. At the end of the element-list: print some summary (vectorlength, sumtotal, rownum in cursor..).

c3) Prompt for Next-record or Control-C to stop

d) Report some closing message: nr of records, and the SQL-used in case the user forgot... Then Exit the program, which closes cursors and disconnects the session (maybe check this!).

I'll try to include the py-source and a demo-sql at the bottom of this blog.

So far, the resulting code, victor.py, worked for me. Let me show some runs. Or as they say at #POUG: 


This is Blog, Let's Demo:

Calling the program with no arguments: It will prompt for a query. And we'll give it a very simple query: select * from dual ;

Notice the gentle warning. The program needs a Query that returns a VECTOR as the first column of the query. 

So, let's give it a vector. We'll use the new-and-improved feature of a select without a from, and give it just a vector, a vector of Two Elements, two numeric elements with values 1 and 2:

Result! 

The program detects a list (vector) and reports "success": Vector Found. It then reports the python-type of the objects (class 'float'), and the detected dimension, the length of the vector. This vector seems to have Two Elements.

Next it prints out the elements each on a separate line. This list can be quite long as typical vectors are 100s of elements or more, hence this will typically scroll off screen.

And it closes with a "summary": it found a vector of 2 elements, of which the sum-total was 3, and this was the 1st record of the resultset. 

I'll let you test for yourself what happens with : 

select to_vector ( '[1,2]' ) 

union all 

select to_vector ( '[3,4,5]' )

And then re-try that but leave out the "all" keyword, hehehe.


Now for Real Serious Data:

Next it was time for some real Tables. Start small, using the example from the start of the blog. The data in table DV1 consists of 3 records, inserted as: ...

For the query, we answer the prompt with 

select vct, id from dv1 ; 

and the result is:

And Voila: The first of the vectors pops out from a table, I can "read" every element of it, and it waits for me to press "enter" before fetching the next record. 

Also note that the ID field in the select is ignored: it is returned from the cursor, but not used in any of the code. The advantage is this allows you to copy/paste queries with some leftover-columns in case you are in a hurry and dont want to do too much editing. In later versions, I might try to use those additional columns for more information.

If we hit "enter" three times, all three records will be fetched, displayed, and when the cursor is finished, Victor.py will tell you and present a summary:

Three records processed, and just to remind you it re-displays the SQL you made it process to get those results. 

At this point, notice that the use of a semicolon to terminate the query is not needed (python doesn't want it), but if you happen to type or copy a ";" at the end the program will ignore it (I got too many errors from leftover semicolons while testing with copy-paste of queries..). In future, I may want to use multi-line input and then the ";" becomes an important End-Signal. Later.

But Now I want to see a Real Vector: the table IMG_VECTOR contains a VECTOR column with data from the CLIP-model, vectors of dimension 512. The table typically holds 100K records so I may not want to see all of the result-set. At this point I really needed a "Control-C" whenever I forgot to limit the results via a Where-Clause. 

Let me try and inspect the first few records of this table, and I will use the "command line" to pass the SQL. 

Notice the double-quotes around the SQL, something I might remove in future releases. but for the moment I need it to treat the SQL as a single (argv[1]) item.

When the cmd-line contains the SQL, Victor doesn't need to prompt for an SQL, and can start straight away to fetch and display the first vector. The 512 lines will scrolllll all the way, but I'll show the top:

And at the bottom, I can either hit "Enter" after each record, or Control-C to stop fetching. Ive hit Control-C after the 2nd record:

Stopping the fetch means Victor will once more show the SQL, will report how many rows were processed, and will then gracefully Exit. 

Job done.


Closing Remarks: It worked for me.

In your case,  you may be happy with just SQLcl, SQLDeveloper, or SQL*Plus, or you may already have your own utility. 

I do expect Oracle to come up with some "vector inspection" option in its Tools (SQLDeveloper, Apex, SQLcl, PL/SQL-arrays...). There may already be a few when you read this.

I'll try to include or link the python code below, and may even put it out on github when it is ready for that. 

The code contains a number of comments from ideas that I think _could_ be done to improve it, but for the moment it works for me. I'm happy at this point, not sure if I need more.

Feel free to send me suggestions with improvements. But chances are you can improve it yourself, or some other tool(s) out there is already doing a better job. 

----------------------- notes and source code below ------------

Note1: "Trust No 1" is a famous quote from Agent Mulder, which he apparently also used as his computer-password. Those Were The Days, before Tokens and 2FA and Biometrics.

Note2: "What is our Vector Victor" is a quote from the 1980 movie Airplane (link). You will hear a lot of greybeards using that phrase when talking VECTOR. I could not resist...

Below: the SQL to set up the demo, and the py-code from my python-thingy.. Feel free to plagiarise.

------------------------------- victor.sql ---------------------------

Link to the file...

------------------------------- victor.py ---------------------------

Link to the file...




Tuesday, 12 November 2024

Oracle 23ai: Backup of Vector Data.

TL;DR: If you have datatype VECTOR in your tables, you can no longer use exp/imp for quick-backup of (small-ish test)sets. Switch to datapump, or use other ways to save your testsets.


Background:

Experimenting with the new VECTOR datatype in Oracle 23ai, I did a lot of testing and tweaking. And sometimes, I want to save a (test)set: tables with data in them, often with related pieces of PL/SQL that act on the data. The old-fashioned exp tool was ideal for that: I will backup a schema-owner with all objects and data in just one file, and restoring is dead-simple (if you are a legacy-dba like me). 

Some of my testing still uses exp. And That became a problem when I got columns of type VECTOR.


Oops: exp throws errors.

My trusted go-to solution for quickly saving datasets is good-old exp. It did not work in this case:


That table did not get exported...

Notice that I am using the latest version of both the database and the exp-tool, I;m running this from the sever, but I got the same results at my client-side with a slightly older exp version.

This error is more or less expected, as the exp/imp toolset is no longer supported "for general use" (link). The tool is not aware of the new VECTOR type, and apparently it refuses to treat is as "just a BLOB". I can ask Oracle if they would consider upgrading exp, but I suspect the answer is "No". 

NB: One of the reasons exp was deprecated is notably the security-risk: anyone with SQL*Net access could use it to Very Easily extract data from a database. And we all know "security is king".

I needed to look for Alternatives...


SQLdeveloper: Always Helpful.

My first thought was: Scripts. I like scripts, so I would want a file with INSERT statements. SQLDeveloper is Good for That, it is free to use, and you can find the latest version here (link). It can generate insert-scripts to recreate and populate a table. And that worked for some of my vector-tables but not for all. 

Of corse the "insert of a vector" can be bulky: Some of mine are 512 or even 2048 floating point values of type FLOAT64, and that takes a lot of text when written out as an insert-statement.

So, for some of my vector-columns I would Not get the actual "data" in but rather an insert of an EMPTY_CLOB():

My guess is those vectors (these are size 2048) were too large for an 32K varchar2, and the tool just decided it couldnt handle it. I'm also guessing there is a limit to the size of any individual quoted vector that SQL*Plus can re-insert. After all, you can not spool BLOB columns for insertion via SQL either.

So this was only a solution for Some cases, but not for All of my testsets. 

But if this scripting works in your case: You Are already Good. SQLDeveloper can do the job for you.


Re-generate the embeddings from source:

For small datasets, I can re-generate the vectors from the original image or text. It may take 2min, 10min, or even 60min, depending on the model and the dataset. But for small and moderate test-sets, I can always re-run the embedding. Just be patient. Major dowside is: I now need to save my source-data my test-sets. In my case, a lot of images and textfiles. This increases the size of my "tests" by a few orders of magnitude, and it makes it much more tedious to (wait for) restore testsets when I mess them up or want to re-run tests frequently.

Generally Possible, but not Practical.


Use Datapump: This worked in All Cases - Problem Solved.

The replacement of exp and imp is of course the DataPump variety: expdp and impdp. And they are included in the Client-Stack if you download that, as every DEV and DBA should from here (link).

Here is my expdp, started from my local laptop:

No issues.

The biggest Caveat of datapump is that the files will be "server side". In the example, it is located in the dpdump directory of the pdb where my user-data is located.

Your dmp-file ends up in a Directory, usually the dpdump directory, on the Server. You can run datapump itself from your client-machine (I do that a lot). but to access it the dmp-files beyond just expdp/impdp you need access to the location on the server where datapump stores them. I for one like to kick them into the git-repo with my testcases. 

Your DBA will know where the files go by defaults, but you might want to create a separate directory for them.

My preferred way of working would be to "map" the dpdump-directory or the volume where expdp and impdp have their files. Check if that is possible (and allowed) in your situation. Alternatively, you can resort to some copy- or transfer-scheme to get/put your files in a place where you want to keep them.

For assistance with those "files on the server side", you may need to contact your DBA or Sysadmin. 


Lesson: Use SQL-Scripts or Datapump (and say Hi to your DBA)

The VECTOR datatype may be challenging when you need to backup/restore testsets quick and flexible. My preferred way is be to have "sql-scripts", with insert-statements. When such scripts can not do the job, because your vectors are too large, or for any other reasons, I would recommend using expdp and impdp

In using those tools you may need some help from your local, friendly DBA. Always a good idea to be on Friendly Terms with these ppl.


Monday, 11 November 2024

Oracle 23ai: a Vector is not a CLOB, Versions are Important.

TL;DR: When experimenting with new (brand)features in Oracle 23ai, in my case VECTOR-datatype, I ran into some small anomalies. I'll present some tips, tricks, and workarounds in case someone needs them.

In short: Always Use Latest versions of client-tools like SQL*Plus and SQLcl if possible.

Note: As I write + experiment, I find more+more, this can turn into a small series...


Background: 

Discovering the VECTOR type and the promise of "AI-applications", I did the RTFM (link). And also dug up some (python) code to generate and experiment with vectors from commonly-used AI models. More on that in later blogs when can get it to Do Exactly What I Want...

My confusion started when I toyed with scripts to create tables with vector-type columns, and put data in them. I wanted to try-out operations, queries, indexes, and whatever else there is to discover on this new datatype (hint: There is a Lot! The future is going to be Interesting).

Further background: My Test-Platform is often a trusty old MB-Pro, which is a "vintage" Intel-based model from 2013. On that laptop, the client-stack is version 19.8 and the included binary sqlplus executable seems Identical (size and checksum) to the one in the latest download, which is labelled 19.16 (as we write: Nov 2024...).

For "server", I sometimes use VBoxes but mostly I use containers generated from the images provided by Oracle. For this test I was using : 

    podman pull container-registry.oracle.com/database/free:latest

This one contained Oracle database Version 23.5.0.24.07, the Free version because GA is still pushed into the future (I'll let Tim comment on that, I Agree).

And when I experiment, I want everything to be "scripted" and repeatable. Ideally, the scripts are also self-explanatory. That way, in future, I can pick up the script, and run it to see, to repeat/review, or to demo what I was trying to achieve.


First Anomaly, VECOR in SQL*Plus shows up as ... CLOB ?

The script I had in mind to generate my tables with VECTOR columns worked fine (... Table Created, no errors). But when I did a DESC on the table it showed... CLOB ? 

Check this small demo-table with some vectors and a json column:

Notice how the table is created, but the VECTOR columns are reported as CLOB.

Also, as aside, the JSON type is reported as UNDEFINED ?

The DESC is relevant here, because the AI models are quite Specific in the Length and Type of vectors they generate and use. The Length has to match Precisely with the vectors form the model otherwise they won't insert (link). And the type of the vectors ( FLOAT32 or FLOAT64) has to match that of the model. Furthermore, vectors of non-matching length+type can not be used in vector-to-vector arithmetic operations. Hence my frequent use of DESC to verify my exact vector definitions.

Note: I can elaborate on type-conversions and the tricky-ness of using python-models with numpy and other libraries as described here (link), but that is not the topic of this post.

Just to do the modern-thing, I also checked this using SQLcl, only to find that the DESC of the table still didnt report all the details. Hmmmm... But when I downloaded (Aha!) the Latest version of SQLcl, "Release 24.3.1.0 Production Build: 24.3.1.311.1631", downloaded on 11Nov2024, finally the DESC came out Correct:

This was the first hint that You Always Need Latest Versions. But what if your corporate-lockdown-PC doesnt allow you to...? I wanted to look a little further. Not in the least because I am kind of addicted to running all demos as scripts in SQL*Plus from my terminal, or Any 80x25 Terminal for that matter. Simple is Always Best...


Reason: VECTORS (and JSON) are stored similarly to LOBs.

Using SQL*Plus, checking the view USER_TAB_COLUMNS: this showed me the definition, type and dimensions reported by the catalog were "correct". 

Furthermore, the USER_INDEXES told me there were 4 indexes of type LOB on the table. This leads me to suppose that the data in the VECTOR columns (and the JSON column) is stored very similar to LOB-data (link to lob-storage-docu).

There is your first "workaround": query the data-dictionary from the catalog.



Workarounds: Use Latest Client-tools, or ... use the dictionary.

At this point, I went "into the server" by attaching to the container where my database runs. 

    podman exec -it ora23 bash

I correctly assumed that a container-image that provides a certain version, also holds the SQL*Plus and other client tools for Exactly That version.

And Yes, running the CREATE and DESCribe in SQL*Plus from the container (from "the server" if you like), the columns show up as with correct datatype and dimensions, exactly as per the create-statement:


From the server-side: Notice the versions of sqlplus and of the database. And the correct DESC results after the table-create.

Main cause of confusion is the fact that the client-tools for my antique "Vintage" intel-macbook are stuck on v19. I just have to work around that.

All Good.


Lesson: Always aim for the Latest Versions, even your client-tools.

Use the Latest Versions of client-stack. Always. And Verify. Don't hesitate to be suspicious and Re-Verify. In my, rather suspicious mind, I recommend to run your latest scripts with the SQL*Plus version that comes with the database-release. But you may need to upgrade on you PC or obtain server-access for that. Safest is to use the client-tools that come with the version itself, e.g. from the same install(er) or the same container. That is probably the executable that has the latest enhancements and improvements built inside. 

And, if you can, always getting the latest version of SQLcl is a Good Practice.