Friday, 15 November 2024

Look Inside my Vector, Victor. 23ai vector inspection.

 TL;DR: I want to see the element-values of my vectors. For larger vectors, this poses a problem, as SQL*Plus does not easily display the values in the far-end of a sizeable VECTOR. I'll put some SQL and Python code at end of blog.


Background: Inspecting VECTORs

Since Oracle 23ai can store and process VECTOR columns, I am experimenting with some of that. Notably "models" that turn text and images into vectors so they can be compared (searched) using cosine-similarity and other hip methods. The searching sometimes yields remarkable results (surprisingly Clever stuff!), and sometimes just ...  Nothing (???). This vector-search to detect similarity, the heart of the Artificial Intelligence in this gizmo, is notably dependent on the models that generate the VECTORs and on the vector-functions in the database.

Those very clever models are to me "just python code" that I call to generate vector-arrays. However, the type of the elements inside the arrays ( or lists / matrices / tensors) depend on their origin-packages: Float, Binary, Integer, even String-values are possible as vector-elements.

From the conversions between one datatype and another, and from the plumbing of data between python and oracle, I have sometimes ended up with errors or just "wrong data" in my vector-columns a few times.

And, understandably, if you try to compare Pears and Oranges, you end up with un-reliable results, or even straight up errors.

Hence the need to sometimes verify, to "inspect" my VECTORs in detail.


The Problem: Trust No 1.

Some of the models I used generate normal python lists, other use numpy (link) and generate numpy-arrays. Then there is Oracle that wants to know if a VECTOR will contain FLOAT32, or FLOAT64, or BINARY values. And if the values from python do not match the values defined for the column, I seem to end up with either errors, or with scrambled (wrong!) values in the vector-column. 

Hence the Need for input-handlers and output-handlers. You have to know this when processing data between python and oracle (link to docu)

Combine that complexity with the fact that sometimes my cosine-values were suspicious (e.g. probably wrong). 

And all this lead me to want to "inspect" my vector-values using both python and SQL*Plus or SQLcl, just to see if the "generated values" had also ended up correctly inside the VECTOR columns in my database. 


Attempt nr 1: SQL*Plus to view a vector.

For small vectors, there is no problem. Here is my super-simple demo-set (also used to calculate known cos-distance and euclidean-dist, Highschool-Math, but different story): 

As long as I can sensibly fit the values of the vector-elements on a screen, I can display them. With some effort, I can even read the floating-point notation of more complicated vectors:

Now, at some point, the values will become too many and too large to display. Others have suggested to increase buffers, linesize, pagesize etc... But There is a Limit to what you can sensibly Display and Decipher.

If this works for you: You are Good. No need to read further. But in my case: Not Suitable for vectors of 512, 784, or 2048 elements... 


Attempt nr 2: regex to un-pivot a vector.

when looking at the values from SQL*Plus, SQLcl and SQLDeveloper, the data is presented as CSV. And I happen to know a trick to un-pivot CSV data:

That works for the small vectors (I should also remove the '[' but I'm Lazy...).

It also works with larger vectors, with more elements. But with a vector of 512 elements:

The fact that there is an implicit use of VARCHAR with maxsize (32K?) will limit my regex options.

So, at this point, if You are able to display Your Vectors: You are Good. 

But in my case, I had to keep looking...


My Solution: a Python program to "print out" the values.

Since most (all) of my models are currently in python, I looked at something python-y to read+display my vectors. The Requirements that came to mind were:

1) Query any vector, in an ad-hoc way, as I would normally do with SQL.

2) Display All Elements in a readable way to a terminal or write to file, whereby a tee or redirect would also work. 

3) Some easy way to check and compare VECTOR contents (a hashtotal came to mind as a first, crude, way to check a vector).

After some experimenting, this resulted in the program called victor.py. It does approximately the follwing:

a) Connect to the database (I'll work on command-line or connection-name later) and report about the connect: username, global_name, so that the user can confirm "where the program is".

b) Determine which SQL to use. The Query must have a VECTOR (or list) as first column in the return-cursor. It will not complain if there are more columns, but they will be ignored (for the moment).

b1) IF there is an arg[1], THEN assume that is the SQL to use to get the vector(s). Currently, there is a need to "quote" the SQL statement to make it just arg[1].

b2) ELSE ask for an SQL statement. 

c) Open a cursor with the statement, and start returning values. If the SQL contains a syntax error, that will result in a nasty error-stack, but the actual error will be at the bottom of the stack, Readable. 

c1) IF the first column is not a list (vector), THEN report a warning.

c2) Else start processing the VECTOR, report its length, the type of elements (float or...), and print the elements 1-by-1. At the end of the element-list: print some summary (vectorlength, sumtotal, rownum in cursor..).

c3) Prompt for Next-record or Control-C to stop

d) Report some closing message: nr of records, and the SQL-used in case the user forgot... Then Exit the program, which closes cursors and disconnects the session (maybe check this!).

I'll try to include the py-source and a demo-sql at the bottom of this blog.

So far, the resulting code, victor.py, worked for me. Let me show some runs. Or as they say at #POUG: 


This is Blog, Let's Demo:

Calling the program with no arguments: It will prompt for a query. And we'll give it a very simple query: select * from dual ;

Notice the gentle warning. The program needs a Query that returns a VECTOR as the first column of the query. 

So, let's give it a vector. We'll use the new-and-improved feature of a select without a from, and give it just a vector, a vector of Two Elements, two numeric elements with values 1 and 2:

Result! 

The program detects a list (vector) and reports "success": Vector Found. It then reports the python-type of the objects (class 'float'), and the detected dimension, the length of the vector. This vector seems to have Two Elements.

Next it prints out the elements each on a separate line. This list can be quite long as typical vectors are 100s of elements or more, hence this will typically scroll off screen.

And it closes with a "summary": it found a vector of 2 elements, of which the sum-total was 3, and this was the 1st record of the resultset. 

I'll let you test for yourself what happens with : 

select to_vector ( '[1,2]' ) 

union all 

select to_vector ( '[3,4,5]' 

And then re-try that but leave out the "all" keyword, hehehe.


Now for Real Serious Data:

Next it was time for some real Tables. Start small, using the example from the start of the blog. The data in table DV1 consists of 3 records, inserted as: ...

For the query, we answer the prompt with 

select vct, id from dv1 ; 

and the result is:

And Voila: The first of the vectors pops out from a table, I can "read" every element of it, and it waits for me to press "enter" before fetching the next record. 

Also note that the ID field in the select is ignored: it is returned from the cursor, but not used in any of the code. The advantage is this allows you to copy/paste queries with some leftover-columns in case you are in a hurry and dont want to do too much editing. In later versions, I might try to use those additional columns for more information.

If we hit "enter" three times, all three records will be fetched, displayed, and when the cursor is finished, Victor.py will tell you and present a summary:

Three records processed, and just to remind you it re-displays the SQL you made it process to get those results. 

At this point, notice that the use of a semicolon to terminate the query is not needed (python doesn't want it), but if you happen to type or copy a ";" at the end the program will ignore it (I got too many errors from leftover semicolons while testing with copy-paste of queries..). In future, I may want to use multi-line input and then the ";" becomes an important End-Signal. Later.

But Now I want to see a Real Vector: the table IMG_VECTOR contains a VECTOR column with data from the CLIP-model, vectors of dimension 512. The table typically holds 100K records so I may not want to see all of the result-set. At this point I really needed a "Control-C" whenever I forgot to limit the results via a Where-Clause. 

Let me try and inspect the first few records of this table, and I will use the "command line" to pass the SQL. 

Notice the double-quotes around the SQL, something I might remove in future releases. but for the moment I need it to treat the SQL as a single (argv[1]) item.

When the cmd-line contains the SQL, Victor doesn't need to prompt for an SQL, and can start straight away to fetch and display the first vector. The 512 lines will scrolllll all the way, but I'll show the top:

And at the bottom, I can either hit "Enter" after each record, or Control-C to stop fetching. Ive hit Control-C after the 2nd record:

Stopping the fetch means Victor will once more show the SQL, will report how many rows were processed, and will then gracefully Exit. 

Job done.


Closing Remarks: It worked for me.

In your case,  you may be happy with just SQLcl, SQLDeveloper, or SQL*Plus, or you may already have your own utility. 

I do expect Oracle to come up with some "vector inspection" option in its Tools (SQLDeveloper, Apex, SQLcl, PL/SQL-arrays...). There may already be a few when you read this.

I'll try to include or link the python code below, and may even put it out on github when it is ready for that. 

The code contains a number of comments from ideas that I think _could_ be done to improve it, but for the moment it works for me. I'm happy at this point, not sure if I need more.

Feel free to send me suggestions with improvements. But chances are you can improve it yourself, or some other tool(s) out there is already doing a better job. 

----------------------- notes and source code below ------------

Note1: "Trust No 1" is a famous quote from Agent Mulder, which he apparently also used as his computer-password. Those Were The Days, before Tokens and 2FA and Biometrics.

Note2: "What is our Vector Victor" is a quote from the 1980 movie Airplane (link). You will hear a lot of greybeards using that phrase when talking VECTOR. I could not resist...

Below: the SQL to set up the demo, and the py-code from my python-thingy.. Feel free to plagiarise.

------------------------------- victor.sql ---------------------------

Link to the file...

------------------------------- victor.py ---------------------------

Link to the file...




No comments: