Tuesday, 12 November 2024

Oracle 23ai: Backup of Vector Data.

TL;DR: If you have datatype VECTOR in your tables, you can no longer use exp/imp for quick-backup of (small-ish test)sets. Switch to datapump, or use other ways to save your testsets.


Background:

Experimenting with the new VECTOR datatype in Oracle 23ai, I did a lot of testing and tweaking. And sometimes, I want to save a (test)set: tables with data in them, often with related pieces of PL/SQL that act on the data. The old-fashioned exp tool was ideal for that: I will backup a schema-owner with all objects and data in just one file, and restoring is dead-simple (if you are a legacy-dba like me). 

Some of my testing still uses exp. And That became a problem when I got columns of type VECTOR.


Oops: exp throws errors.

My trusted go-to solution for quickly saving datasets is good-old exp. It did not work in this case:


That table did not get exported...

Notice that I am using the latest version of both the database and the exp-tool, I;m running this from the sever, but I got the same results at my client-side with a slightly older exp version.

This error is more or less expected, as the exp/imp toolset is no longer supported "for general use" (link). The tool is not aware of the new VECTOR type, and apparently it refuses to treat is as "just a BLOB". I can ask Oracle if they would consider upgrading exp, but I suspect the answer is "No". 

NB: One of the reasons exp was deprecated is notably the security-risk: anyone with SQL*Net access could use it to Very Easily extract data from a database. And we all know "security is king".

I needed to look for Alternatives...


SQLdeveloper: Always Helpful.

My first thought was: Scripts. I like scripts, so I would want a file with INSERT statements. SQLDeveloper is Good for That, it is free to use, and you can find the latest version here (link). It can generate insert-scripts to recreate and populate a table. And that worked for some of my vector-tables but not for all. 

Of corse the "insert of a vector" can be bulky: Some of mine are 512 or even 2048 floating point values of type FLOAT64, and that takes a lot of text when written out as an insert-statement.

So, for some of my vector-columns I would Not get the actual "data" in but rather an insert of an EMPTY_CLOB():

My guess is those vectors (these are size 2048) were too large for an 32K varchar2, and the tool just decided it couldnt handle it. I'm also guessing there is a limit to the size of any individual quoted vector that SQL*Plus can re-insert. After all, you can not spool BLOB columns for insertion via SQL either.

So this was only a solution for Some cases, but not for All of my testsets. 

But if this scripting works in your case: You Are already Good. SQLDeveloper can do the job for you.


Re-generate the embeddings from source:

For small datasets, I can re-generate the vectors from the original image or text. It may take 2min, 10min, or even 60min, depending on the model and the dataset. But for small and moderate test-sets, I can always re-run the embedding. Just be patient. Major dowside is: I now need to save my source-data my test-sets. In my case, a lot of images and textfiles. This increases the size of my "tests" by a few orders of magnitude, and it makes it much more tedious to (wait for) restore testsets when I mess them up or want to re-run tests frequently.

Generally Possible, but not Practical.


Use Datapump: This worked in All Cases - Problem Solved.

The replacement of exp and imp is of course the DataPump variety: expdp and impdp. And they are included in the Client-Stack if you download that, as every DEV and DBA should from here (link).

Here is my expdp, started from my local laptop:

No issues.

The biggest Caveat of datapump is that the files will be "server side". In the example, it is located in the dpdump directory of the pdb where my user-data is located.

Your dmp-file ends up in a Directory, usually the dpdump directory, on the Server. You can run datapump itself from your client-machine (I do that a lot). but to access it the dmp-files beyond just expdp/impdp you need access to the location on the server where datapump stores them. I for one like to kick them into the git-repo with my testcases. 

Your DBA will know where the files go by defaults, but you might want to create a separate directory for them.

My preferred way of working would be to "map" the dpdump-directory or the volume where expdp and impdp have their files. Check if that is possible (and allowed) in your situation. Alternatively, you can resort to some copy- or transfer-scheme to get/put your files in a place where you want to keep them.

For assistance with those "files on the server side", you may need to contact your DBA or Sysadmin. 


Lesson: Use SQL-Scripts or Datapump (and say Hi to your DBA)

The VECTOR datatype may be challenging when you need to backup/restore testsets quick and flexible. My preferred way is be to have "sql-scripts", with insert-statements. When such scripts can not do the job, because your vectors are too large, or for any other reasons, I would recommend using expdp and impdp

In using those tools you may need some help from your local, friendly DBA. Always a good idea to be on Friendly Terms with these ppl.


Monday, 11 November 2024

Oracle 23ai: a Vector is not a CLOB, Versions are Important.

TL;DR: When experimenting with new (brand)features in Oracle 23ai, in my case VECTOR-datatype, I ran into some small anomalies. I'll present some tips, tricks, and workarounds in case someone needs them.

In short: Always Use Latest versions of client-tools like SQL*Plus and SQLcl if possible.

Note: As I write + experiment, I find more+more, this can turn into a small series...


Background: 

Discovering the VECTOR type and the promise of "AI-applications", I did the RTFM (link). And also dug up some (python) code to generate and experiment with vectors from commonly-used AI models. More on that in later blogs when can get it to Do Exactly What I Want...

My confusion started when I toyed with scripts to create tables with vector-type columns, and put data in them. I wanted to try-out operations, queries, indexes, and whatever else there is to discover on this new datatype (hint: There is a Lot! The future is going to be Interesting).

Further background: My Test-Platform is often a trusty old MB-Pro, which is a "vintage" Intel-based model from 2013. On that laptop, the client-stack is version 19.8 and the included binary sqlplus executable seems Identical (size and checksum) to the one in the latest download, which is labelled 19.16 (as we write: Nov 2024...).

For "server", I sometimes use VBoxes but mostly I use containers generated from the images provided by Oracle. For this test I was using : 

    podman pull container-registry.oracle.com/database/free:latest

This one contained Oracle database Version 23.5.0.24.07, the Free version because GA is still pushed into the future (I'll let Tim comment on that, I Agree).

And when I experiment, I want everything to be "scripted" and repeatable. Ideally, the scripts are also self-explanatory. That way, in future, I can pick up the script, and run it to see, to repeat/review, or to demo what I was trying to achieve.


First Anomaly, VECOR in SQL*Plus shows up as ... CLOB ?

The script I had in mind to generate my tables with VECTOR columns worked fine (... Table Created, no errors). But when I did a DESC on the table it showed... CLOB ? 

Check this small demo-table with some vectors and a json column:

Notice how the table is created, but the VECTOR columns are reported as CLOB.

Also, as aside, the JSON type is reported as UNDEFINED ?

The DESC is relevant here, because the AI models are quite Specific in the Length and Type of vectors they generate and use. The Length has to match Precisely with the vectors form the model otherwise they won't insert (link). And the type of the vectors ( FLOAT32 or FLOAT64) has to match that of the model. Furthermore, vectors of non-matching length+type can not be used in vector-to-vector arithmetic operations. Hence my frequent use of DESC to verify my exact vector definitions.

Note: I can elaborate on type-conversions and the tricky-ness of using python-models with numpy and other libraries as described here (link), but that is not the topic of this post.

Just to do the modern-thing, I also checked this using SQLcl, only to find that the DESC of the table still didnt report all the details. Hmmmm... But when I downloaded (Aha!) the Latest version of SQLcl, "Release 24.3.1.0 Production Build: 24.3.1.311.1631", downloaded on 11Nov2024, finally the DESC came out Correct:

This was the first hint that You Always Need Latest Versions. But what if your corporate-lockdown-PC doesnt allow you to...? I wanted to look a little further. Not in the least because I am kind of addicted to running all demos as scripts in SQL*Plus from my terminal, or Any 80x25 Terminal for that matter. Simple is Always Best...


Reason: VECTORS (and JSON) are stored similarly to LOBs.

Using SQL*Plus, checking the view USER_TAB_COLUMNS: this showed me the definition, type and dimensions reported by the catalog were "correct". 

Furthermore, the USER_INDEXES told me there were 4 indexes of type LOB on the table. This leads me to suppose that the data in the VECTOR columns (and the JSON column) is stored very similar to LOB-data (link to lob-storage-docu).

There is your first "workaround": query the data-dictionary from the catalog.



Workarounds: Use Latest Client-tools, or ... use the dictionary.

At this point, I went "into the server" by attaching to the container where my database runs. 

    podman exec -it ora23 bash

I correctly assumed that a container-image that provides a certain version, also holds the SQL*Plus and other client tools for Exactly That version.

And Yes, running the CREATE and DESCribe in SQL*Plus from the container (from "the server" if you like), the columns show up as with correct datatype and dimensions, exactly as per the create-statement:


From the server-side: Notice the versions of sqlplus and of the database. And the correct DESC results after the table-create.

Main cause of confusion is the fact that the client-tools for my antique "Vintage" intel-macbook are stuck on v19. I just have to work around that.

All Good.


Lesson: Always aim for the Latest Versions, even your client-tools.

Use the Latest Versions of client-stack. Always. And Verify. Don't hesitate to be suspicious and Re-Verify. In my, rather suspicious mind, I recommend to run your latest scripts with the SQL*Plus version that comes with the database-release. But you may need to upgrade on you PC or obtain server-access for that. Safest is to use the client-tools that come with the version itself, e.g. from the same install(er) or the same container. That is probably the executable that has the latest enhancements and improvements built inside. 

And, if you can, always getting the latest version of SQLcl is a Good Practice.