Tuesday, 12 November 2024

Oracle 23ai: Backup of Vector Data.

TL;DR: If you have datatype VECTOR in your tables, you can no longer use exp/imp for quick-backup of (small-ish test)sets. Switch to datapump, or use other ways to save your testsets.


Background:

Experimenting with the new VECTOR datatype in Oracle 23ai, I did a lot of testing and tweaking. And sometimes, I want to save a (test)set: tables with data in them, often with related pieces of PL/SQL that act on the data. The old-fashioned exp tool was ideal for that: I will backup a schema-owner with all objects and data in just one file, and restoring is dead-simple (if you are a legacy-dba like me). 

Some of my testing still uses exp. And That became a problem when I got columns of type VECTOR.


Oops: exp throws errors.

My trusted go-to solution for quickly saving datasets is good-old exp. It did not work in this case:


That table did not get exported...

Notice that I am using the latest version of both the database and the exp-tool, I;m running this from the sever, but I got the same results at my client-side with a slightly older exp version.

This error is more or less expected, as the exp/imp toolset is no longer supported "for general use" (link). The tool is not aware of the new VECTOR type, and apparently it refuses to treat is as "just a BLOB". I can ask Oracle if they would consider upgrading exp, but I suspect the answer is "No". 

NB: One of the reasons exp was deprecated is notably the security-risk: anyone with SQL*Net access could use it to Very Easily extract data from a database. And we all know "security is king".

I needed to look for Alternatives...


SQLdeveloper: Always Helpful.

My first thought was: Scripts. I like scripts, so I would want a file with INSERT statements. SQLDeveloper is Good for That, it is free to use, and you can find the latest version here (link). It can generate insert-scripts to recreate and populate a table. And that worked for some of my vector-tables but not for all. 

Of corse the "insert of a vector" can be bulky: Some of mine are 512 or even 2048 floating point values of type FLOAT64, and that takes a lot of text when written out as an insert-statement.

So, for some of my vector-columns I would Not get the actual "data" in but rather an insert of an EMPTY_CLOB():

My guess is those vectors (these are size 2048) were too large for an 32K varchar2, and the tool just decided it couldnt handle it. I'm also guessing there is a limit to the size of any individual quoted vector that SQL*Plus can re-insert. After all, you can not spool BLOB columns for insertion via SQL either.

So this was only a solution for Some cases, but not for All of my testsets. 

But if this scripting works in your case: You Are already Good. SQLDeveloper can do the job for you.


Re-generate the embeddings from source:

For small datasets, I can re-generate the vectors from the original image or text. It may take 2min, 10min, or even 60min, depending on the model and the dataset. But for small and moderate test-sets, I can always re-run the embedding. Just be patient. Major dowside is: I now need to save my source-data my test-sets. In my case, a lot of images and textfiles. This increases the size of my "tests" by a few orders of magnitude, and it makes it much more tedious to (wait for) restore testsets when I mess them up or want to re-run tests frequently.

Generally Possible, but not Practical.


Use Datapump: This worked in All Cases - Problem Solved.

The replacement of exp and imp is of course the DataPump variety: expdp and impdp. And they are included in the Client-Stack if you download that, as every DEV and DBA should from here (link).

Here is my expdp, started from my local laptop:

No issues.

The biggest Caveat of datapump is that the files will be "server side". In the example, it is located in the dpdump directory of the pdb where my user-data is located.

Your dmp-file ends up in a Directory, usually the dpdump directory, on the Server. You can run datapump itself from your client-machine (I do that a lot). but to access it the dmp-files beyond just expdp/impdp you need access to the location on the server where datapump stores them. I for one like to kick them into the git-repo with my testcases. 

Your DBA will know where the files go by defaults, but you might want to create a separate directory for them.

My preferred way of working would be to "map" the dpdump-directory or the volume where expdp and impdp have their files. Check if that is possible (and allowed) in your situation. Alternatively, you can resort to some copy- or transfer-scheme to get/put your files in a place where you want to keep them.

For assistance with those "files on the server side", you may need to contact your DBA or Sysadmin. 


Lesson: Use SQL-Scripts or Datapump (and say Hi to your DBA)

The VECTOR datatype may be challenging when you need to backup/restore testsets quick and flexible. My preferred way is be to have "sql-scripts", with insert-statements. When such scripts can not do the job, because your vectors are too large, or for any other reasons, I would recommend using expdp and impdp

In using those tools you may need some help from your local, friendly DBA. Always a good idea to be on Friendly Terms with these ppl.


No comments: