Friday, 20 December 2024

Oracle 23ai, Domains, Enums and Where-Clauses.

TL;DR: In Oracle23ai, the DOMAIN type is a nice addition to implement more sophisticated Constraints. Someone suggested the DOMAIN of type ENUM as a column-constraint as an LoV. Works like a charm. But the behaviour of some where-clauses may not be what you think. Let me show you some very simple examples, and some workarounds, both bad and Good.

A prism, just because I used colors as an example.


Background

After seeing a presentation on the use of ENUM-Domains for use af List of Values on a column I did some scripting to see how it could work (previous blogs here + here). Now I tried to use it in Where-Conditions. 

So I have a few domains, one of which is color:


And I created a table that uses the domain in a column-definition, like this:

The column is defined as type "color_enum" and thus inherits the "LoV defined in the Domain. As I consider the color-column to be akin to a Foreign-Key, I have already created a index on it. 

When I ask for the INFO of the table, using SQLcl, this is what I get:


The column COLOR apparently got defined as number(38,0) e.g. an Integer, because it can not hold decimals. But the same column also has the DOMAIN specified as COLOR_ENUM. The info-command also shows I have indexes on both the ID an don the Color-column.

Note: I'm using SQLcl (link) here. In SQL*Plus you would need to do a DESC of the table. Latest versions of SQL*Plus will show the numer + domain, but older versions will just show the column as number (38,0).


Do some Inserts

We need some data in the table so let's...

Notice how I can use both the enum.name, which is effectively a set of Constants, or an actual numeric value to insert data. I did not have to visit an LoV table to get (read or join) my values.

Just curious: how does it handle "wrong values": 

There is no "purple" defined in the Domain-Enum and an error pops up. But the description of the error is a little vague in my opinion. It seems to read "color_enum.purple  as a reference to a column. I think the actual error here is more of an "invalid value" or "check constraint violated". But I can see why the error is what it is.

And what if we try to insert a number out of range:

When trying to insert a number outside the valid range, the Error is more Clear. It refers to a "check constraint" and it mentions the DOMAIN. This should give even a developer enough clues to fix it.

And we have Proven that the Domain acts as both Check-Constraint and LoV. It is very much similar to an FK (as tested in previous blog as well).

Next is doing some Retrieval.


Selecting from the table.

The result of our inserts is a table with the following data:

No surprises here. But notice again that the columns "id" and "color" are numbers. and we need a display function to obtain the "name" of the color. 

Now suppose we want to search for records by the name of the color. We need to put the condition against the domain_display():

Notice the Table Access Full. But also notice that the filter, the condition on the display-function. It seems to employ a CASE-When statement to pick up the values from the DOMAIN. 

I suspect that the use of the function to find the value of  like 'RED%' is akin to a type-mismatch (e.g. comparing varchars to numbers), which stops us from using the index on the numeric column color. But a Full Table Access can be a bad thing on large tables.

It works much better if we use constants or plain numbers to look for our records:

If the types of the comparison matche the indexed, numeric, value the query Can use the index. And this is the desired outcome. 

The Warning is Clear: Do No Use that function in a search-criterium. Unless...


Function Based Index?

Because the domain_display () is a function, we could consider adding an Function Based Index. Let's try that:

That works. With the FBI the where clause has an index that machtes the domain_display, and uses that index. Maybe this is what you need. But I am generally hesitant to add more indexes. I would prefer to use matching types in the where-clause: constants, or other numeric conditions.


More Exotic: join the DOMAIN:

For a more exotic idea: you can use the ENUM-domain as a record-source in the from clause, and add a join to the Query. Not that strange, because this is what you would to when you were using an FK-Constraint anyway:

And it works. Apparently I can use the domain as a record-source with no problem. Let's have a look at the explain-plan:

There is a list of "filter" operations on the domain, but  the query does use the intended (numeric) Index. This is how you would to it when the constraint was a classic FK to an LoV table. And this still works as well.


Summary: DOMAIN-ENUM can be used as LoV.

The DOMAIN can act as LoV and FK. They bring both the "check contraint" and provide readable "constants" for use in code, without the need for a separate table to keep the values. In some cases, this will bring you a reduction of IO-Effort and make your queries more efficient (see earlie blogs). And the same limitations apply when searching for the "desciption of the LoV".  

Advantage or not: You cannot mess with a Domain as easily as with a table. A DOMAIN cannot be edited in the same way as an LoV table (which I have been know to edit in a live system...). 

Furthermore, I showed a workaround with an FBI. But I do Not Recommend putting Function-Based-Indexes on domain-columns indiscriminately. Every extra indexes should be given careful Consideration.


Oh, and: RTFM

There is a lot more on DOMAINS than just LoV. Domains can/should (IMHO) be used for Sophisticated Constraints on one or more (groups of) columns. Do Go check the Documentation

Or check this page by Tim for copy-paste usable examples. After all: RTFM = Read Tim's Fantastic (summary of the) Manual (link).

Enjoy.


-- -- -- -- -- -- end of blogpost, appendix below -- -- -- -- -- -- 

Appendix

Script to replay the demos..

demo_enum_5where.sql : just hit enter several times to walk trought the demo.

Happy Coding (Tip: just use SQL wherever you can!)


Tuesday, 17 December 2024

Oracle 23ai, Domains, Enums, use-cases.

 TL;DR: The new DOMAIN type in Oracle23ai can be used as LoV. And in some cases that can save you Effort: if it does not have to visit the FK-tables. I'll show some examples, and I'll demonstrate a few pitfalls to avoid.


Background.

It all started when someone suggested to use DOMAINs of type ENUM as an alternative for FK-tables with List-of-Value lookup-data. For example, if you want your table to have a field "color", limited to just the colors of the rainbow. You can use a check-constraint (simple!), or a table with color-values and an FK-constraint, or in Oracle 23ai, you can use a DOMAIN. We will compare ENUM-domains with t FK-LoV tables. 


Datamodel:  a fact table with two LoVs

For this little demo I will use an old-fashioned datamodel with two lookups tables, and one long fact table. And the queries on that 3-table model will be compared to the same queries against a single table where the LoV are held in two DOMAINS of type ENUM.

First the traditional model looks, it like this:


(I love a primitive diagram...) 

And the Fact table is defined as follows:


All very Classic, Data-table with LoVs  and Foreign Keys. The FK are the "constraints" that ensure correct data. I also put indexes on the FK-fields, but you could ommit those.

Now if you want, you can replace those FKs with DOMAINS of type ENUM and use that for the column-definitions:

I stated by defining two DOMAINS, the color-one looks like:


I created a similar domain with 1400 Roman Numerals (too long to show, but there will be a complete re-play script at the bottom of the blog). 

Now create the table and use those domains:


Note how two of the columns have their type defined by a DOMAIN-ENUM. There is no FK, and you can leave out the LoV tables completely. But the values for the columns are now Constrained by the domain-definition. For good measure, I created indexes on them, as I expect to be "searching" for those at some time. 

We filled both tables with Identical data, and did a generate_statistics on them. Now we are ready do some some comparisons.


Aggregate Queries

Let's try to aggregate some color-records, and have a Query that _needs_ to visit the table (otherwise, the optimiser could sneakily just only use an index):


Access is via Index-full-scan and table-access full. Join is done by hash-join (more efficient than Nested Loop in this case). Looks like expected behavious. I happen to know that the LoV table t_color_lov is 8 blocks, and the data-table is 768 blocks in size. Check the statistics of the query:


Just over 700 gets. More or less as expected as there will be some empty blocks in the table. 

Now try this using the table that Does Not have FKs, but is using DOMAIN. First the explain:


We see 1 Full-Scan of the table. No other tables involved, because the constraints are done by the DOMAIN, not by FKs. 

The query-stats look like:


Maybe somewhat to our surprise, there is little gain here. Both queries are Equally efficient. The 7 records for color-LoV probably fit into 1 block, hence the case of the "enum" is consuming just 1 block less.

[ soap-box: ]

Now, YMMV, because I heard of ppl who got bitten by "too many LoVs, too many joins". And they claim to fix this with either Check-Constraints or simply removing the FK-constraints. But I would Strongly Suggest to examine your Explain-plans first. The optimiser is well-capable to Scan the small tables just-once and use hash- or merge-joins to avoid (nested)Loops. Domains are, IMHO, not a way to solve those bad-queries (but they could, potentially...)

[\steps off soap-box]

I also think Connor McDonald, and others have previously proven, sufficiently, that : Joins, Good Joins, Well Done Joins, ... Are Not Expensive.

And in the case of relatively Large data-tables, and retrieving "a lot of records", the elimination of Joins to small tables by using DOMAINS will only yield very limited benefit (1 block out of 700...). 


Single-record retrieval queries.

Let's compare what happens when we retrieve just 1 data-records. The relative overhead of joining to LoV tables might weigh a little heavier on those "precise" cases. We are retrieving Two records and want to see color and roman (e.g. get those from the LoV or from te domain):


We notice the unique scan to retrieve the two records from data_2lov, a full-scan on the color_lov (7 records, effectively just 1 block) to pick out the 1 record needed, followed by the use of the unique-index on roman_lov to retrieve the roman-numerals, one for each record. The predicate informationd confirms this.

Then we check the statistics, they show 13 block-gets:


By many criteria, 13 block-gets is an Efficient statement. 

Note for SQL-geeks: Yes, There is room for minor improvements. By using a "covering index" on the data_table. Or from the use of  IOTs. Maybe, with some extra indexing, this query could be brought back another 3 block-gets (eliminate 1 step for every table) to 10 gets. I consider that out of scope for now as not a lot of systems seem to go that far in optimising.  But for critical systems I would recommend use of IOTs (Index Organized Tables, docu here) for the LoVs. Should we add a demo of Extreme-Indexing... ? (just for the fun of it...?). Maybe Later (reserve space for link to script).


Now let's retrieve the same info from the data-table using the DOMAINS:

The SQL and the Explain-Plan show only One Table involved this Query. The color and the roman are retrieved from the domain_display ( column ) function. For good measure, we added the same order-by as in the previous query.

The Statistics now show :


Only 6 gets. Because we only need to use the PK to retrieve each record. Probably two index-blocks (index-root and 1 branch) and one table-block (by index rowid) for each of the two numbers in the in-list. Total of 6 gets.

Hence, even though the original effort per query was low (13 gets is an efficient query by most standards), the savings are Significant: We reduced the gets by half. And if you have more of those LoVs per record, those savings can add up. 

Note: Possibly you get even more benefit, notably if your datamodel is missing some indexes. Or if you have stale/wrong/missing statistics whereby your optimiser is "deteriorating your query-plans". In those cases, the elimination of LoV joins will (appear to) help. But in those cases, you have other (datamodel + optimiser) problems!

But in short: the elimination of some joins, especially in complex or high-frequency SQL may Benefit from the use of DOMAINS. QED.


My interpretations (and ymmv...) 

1st: For Large, Aggregate-type queries, there is No big Advantage.

On lage queries, where you scan significant parts of the "data" or fact-table, the additional, single, Full-Scan of the lookup-tables is probably negligible. Small compared to the effort of scanning the data-table. Hence if your work consists of lage, aggregate-type queries, there is little gain in avoiding LoV tables. Provided you have your indexes done well and statistics up-to-date. Also c.f. Soap-Box statement above...

2nd: Possible Gain on fetching small-sets or Single-Records. 

On retrieval of small sets, or individual records, Especially if you have to retrieve many LoV-values, the additional table-access for the LoVs Can be a larger part of your Query-effort.  And here you can save on the additional Hop(s) to the LoV table(s). This can be beneficial If your work consists of a lot of small, OLTP-style SQL-statements where you need to access LoVs Many and Often. Again, providing you have you indexes Done Correctly.

3rd, Note: I have some remaining notes on using domain-expressions in where-clauses as well. Maybe put those in a followup to prevent a "too-long-read". 


Closing Remarks. 

In summary, I think the use-case for "reducing effort on LoV" should be focussed on the cases where the LoV is a relatively large part of the total-effort. That is mostly (if done correctly) the case in retrieving small sets or high-frequency, quick-verifications. 

So there Is in specific situations a Good Case for using DOMAIN-ENUMs.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- 

demo_enum_4select.sql : the complete script, just hit enter 9 times.

demo_enum_4extreme_index.sql : remind me to add this if I get there...


Monday, 16 December 2024

Oracle 23ai - enums, what is behind the scenes ?

 TL;DR: The DOMAINs in Oracle 23ai are an additional way to add constraints to your Datamodel and you should investigate this to make your database "Smarter". When testing the simplest of domains, the ENUM, they appear Quite Efficient. And as I can never resist trying some extreme cases, I did hit some limit as well.


Background 

Domains are introduced as a way to define complex and sophisticated (check-)Constraints, and to (re)use those in multiple places in a database. Enforcement of the constraint is by the RDBMS, hence this keeps your Logic as close to the Data as possible and avoid re-coding things in various application-components. Domains can be Very Useful in this respect. You can start your own investigation here (link).

This blog is about the simplest possible case: a List of Values (LoV) constraint. For a relatively simple LoV we have the choice between using a Domain of type ENUM or an FK-constraint to limit the values for a column. The nature of an FK to a lookup- table is very much "Relational", as you can define the valid options in a separate table and let the RDBMS enforce the constraint. The nature of a DOMAIN is more declarative, and resembles a sophisticated CHECK-Constraint.

Note: A word of Thanks to Chris Saxon, who did the presentation that triggered me to doodle the scripts that ultimately lead to this set of tests. It was Fun to do.

A reference to Asterix, because of the Roman Numerals in this blog...


Testing, a small and a large Enum.

I could not resist so I created two domain-ENUMs and did some SQL on them. Here is what I did (I'll put links to full scripts at the bottom..).

My first Enum, a list of colors:


That was Easy. And now for a really long one: the list of valid Roman Numbers in Oracle...



This shows I can potentially turn a LoV table of 4000 into a "DOMAIN", and maybe gain some performance by avoiding a join to lookup-table ?

And after creations, you can query them as normal tables. Let's see how that goes: The list of values can be SELECTed and sorted at will:



Because the queries on the rather large "roman_enum" were a bit sluggish, and as I am Always curious, I checked on the effort behind the scenes. The autoexplain looked ok: no buffers-gets or disk-io:



This query is causing a lot of round-trips because the result-set of 3999 records has to be pushed to the client, but Still no buffer-gets or disk-io visible. Hence operations on DOMAINS should be buffer- and IO-Efficient. Promising. 

But queries on the large domain did use up a lot of cpu-time, which could be a warning. I ignored the warning at that time, because a full-query on such a large domain is probably a weird case anyway.

Here are the statistics from v$sqlarea from the shared pool, they showed me similar good information:



The Good Part: No logical- or physical-IO. That two of the queries showed buffer-gets was, apparently due to them being the first-time-query to the DOMAINS after the purging of the sh-pool to allow clean measurement of the statistics.

But the fact that the "composition" of the relatively large set for roman-numbers had taken a lot of CPU- and Elapsed-time was "interesting".


Now use DOMAIN on Columns in a Table:

To get some real usage, I needed tables with columns that use the (enum) domains in their type-definition. So I created two very simple tables:



Notice how the 2nd column of each table only has a definition that refers to an enum-domain. There is no other infomation to say that the values in that column are acutally numeric (in my cases). If you RTFM deeper, you will find that the actual enum_values that go in are not limited to numbers. But I have not tested that (yet).

But  let's insert some data and see what it looks like. Fist the Colors...



As you can see, you can insert data using either a number (e.g. the enum_Value) or the "constant" from the DOMAIN (e.g. the enum_Name). And on query, the value is reported as the enum_value (e.g. similar to reporting the key of an FK table). If you want to display the enum_value, you need to use a function: 

Select DOMAIN_DATA ( enum_value_column ) from  Table;

The table-definition will then use the domain defined for that column to "look up" and display the description. The lookup or display does not seem to incur additional IO, it will probably need a little extra CPU ticks, I might test that later.


A Quirk to display the large domain?

Here is what happend is I query the table with the roman_enum, first using select-star:



No surprise (yet). 

You can see I used numbers that could represent years, and because of the domain-definition, each enum_value or "row-number" corresponds to the enum_name with the equivalent roman-number. I know some ppl like to use Roman-Numerals for years, hence my use of this domain.

But now let me try to display the "roman numeral value" using the DOMAIN_DATA function:



Resulting in an Error ?? 

Also, the error was "instant", there was no noticable (cpu-)time consumed before throwing the error. And the error was always like a syntax- or typo-style error. Well, as far as I could see, there were no syntax-errors in my queries... ? (I am notirious for typos, so I did check several times, and re-typed the statement in a few ways, no relief, just errors while trying to retrieve my "roman number"...). 

Almost as if the RDBMS told me I should just revert to TO_CHAR( rn, 'RN'), which is a lot simpler anyway.

I defined a few more, different, smaller, enums, which all Worked Fine... ???

Then did a little searching (in my own database, the datamodel, data-dictionary), and found that the data-dictionary had a view "USER_DOMAINS". It contains a CLOB-column called DATA_DISPLAY, in which there is a CASE statement that seems to compose my enum:



I then got suspicious and checked the size of that column, using dbms_clob.getlength() for my domains:



Hmmm. I went back and cut my roman-numeral-domain in half, and in half again, and then did some testing to find the max-size where display_data still worked... 

And the workaround was more or less as I suspected: Smaller Domain. In my case, if I got the size of the clob below 32K, I could display the data. Here is a small demo, contained in a separate file: demo_enum_size.sql:


Voila. That works. But in this case I seem to be limited to about 1400 values.


Summary: Domains are Useful. but Test !

In practice I would probably never use a DOMAIN that large. And there are other practicalities, like you can not (yet) alter a domain without (probably) causing a window-of-non-constraint, or some outage. This needs more Testing (...). In general for such long lists, I (still) think a real LoV TABLE with FK is more appropriate. And I know an LoV table can be "edited" in a LIVE system, BTDT ;-).

But the fact is that DOMAINS have advantages too. DOMANs are declarative, that domains do not need access to "data" (buffers or disk), and  that domains can implement constraints that are much more sophisticated than old fashioned "check", that makes them potentially Very Interesting. I should test more...

So maybe TBC.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Appendix: Scripts used to compose this blog..

demo_enum.sql : self running, just hit enter some 15 times..

demo_enum_2size.sql: self running, shorter, just to work around the error. 


Edit: Almost fogot:

demo_enum_io.sql : to display statistics from v$sqlarea, show the CPU + IO

Friday, 6 December 2024

Oracle 23ai - unloading of vector data.

 TL;DR: For demo and repeatable-test cases, I like to use scripts to set/reset my test-cases. Vectors can pose a problem here bcse they are not always easily "scriptable". Not Yet anyway. SQLcl can help with the unload commands.



Pop-in an image of "unloading", because I am in the end just a farm-boy who likes tractors and combine-harversters and stuff...


Background: Scripts, Reproducible cases.

Always looking for ways to quickly test or demo something, I am a Big Fan of script-tools like SQLcl and SQL*Plus (and bash, and sometimes python). Good Demo- and Test-cases should be easy to reproduce, on any platform, even 10yrs later. The oldest "test" I got is a few scripts and one bench.dmp file from 2001, created with Oracle v8.1.7, and it is still "usable" (not very relevant anymore, but still usable, that is Backward Compatibility for you...). 

The creation and manipulation of test-data for VECTOR columns sometimes poses a problem. See also earlier blog (link) where I pointed out that the (ancient) exp/imp tools do not handle VECTOR data, but datapump does.

This blog is partly a call for help to Oracle (Hi Jeff!) to help facilitate the simple spool+reload of Vector-Data. I'm fairly sure there are several solutions, but not sure which ones to explore or use in my specific cases.


This Worked: SQLcl unload (as insert-into)

SQL*Plus and earlier versions of SQLcl had given me several errors, seemingly treating the VECTOR as a CLOB or LOB. But hey: Use Latest Version! Right?

Using SQLcl, version 24.2.0 I found the following to work, using the unload :


This generated a runnable sql-file that would correctly re-insert my data, even when the vector itself was (much) larger than 4K. My Tested case used vectors that were still under 32K. Might try bigger ones later.

The secret was, apparently, in the trick to split the vector into smaller chunks, which each occupied a line of just under 512 characters (linesize 512?). Each line was a set of numbers converted TO_CLOB ( ... ), and concatenated with its peers. The resulting CLOB was inserted into the VECTOR column, and this gave the desired result: my vectors were re-inserted. This is a screenshot of the unloaded-file:

Notice the repeated TO_CLOB(...) statements that are concatenated together to form the complete vector (which is apparently the equivalent of a clob...). 

Note: If you look inside the user_indexes and user_segments,  you can see that vector-columns have corresponding ilob and lob indexes: Storage-wise they are apparently treated similar to LOB/BLOB/CLOB columns.

I was relatively happy with this "concatenate-clobs trick" for several reasons:

. The unload generated correct, runnable scripts.

. No intermediate (staging) tables or additional pl/sql (helper-) procedures needed.

. Even SQL*Plus was able to insert this data. I had a Very Portable Demo!

But there were some clouds on the horizon...


Possible Problem with new version SQLcl ?

Then, I ran into an error after downloading sqlcl v24.3. The unload-command had changed slightly: it generated an insert of a single VECTOR. 

Hence, unloading with SQLcl version 24.3.2.0, build ..3301718, then trying to reload resulted in this error:

ORA-01704: string literal too long...

When I opened the unloaded file, the 3rd line in the sql-file was now 30.621 characters long (hm.. just under 32K, that seemed a coincidence - I should try with longer vector some time... ). There were no more TO_CLOB (...) concatenations, and the vector was just a quoted string of values, looking like this:

This simple quoting of the vector will work for small vectors (below 4K or even 32K), but not for seriously large vectors where the string-representation will surpass these values. Also, I am not sure what would happen to binary vectors: how would they be represented as quoted string ? (More stuff to test!.. I know, I know, but Time... Maybe later).


Final notes: Multiple Solutions.

From working with LOBS and CLOBS myself long time ago, I know how Tricky these things can be: They mostly require use of some other language than just SQL(*Plus) to manipulate them. They need to be split into chunks that are manageable by whichever language/system/host is doing the processing. But as a simple SQL user who just wants to run Easy Test- and Demo-scripts, I am hoping the solution will remain Simple.

By documenting this "case" I hope Oracle can either fix this (by reverting back to using multiple TO_CLOB?), or work around this problem in some other way.

One suggestion I got in private-message was to try using JSON: get the VECTOR in some JSON format and use that to unload/load the data, possibly in smaller chunks. And of course you can also revert to additional PL/SQL, Java(script) or Python to manipulate vectors. But all those solutions involve some additional (bespoke) coding on my part. And I just want Simple-Demo cases.

Suggestions Always Welcome ! 


----- End of Blogpost ------ 


Attachements, Addendums (just some files)


Below are links several links to files to test/demo for yourself:

ini_unload.sql:  Set up the test-case, create table and load nine records with VECTOR(2048, FLOAT64), The data is generated from some of my experiments with vectors.

tst_unload.sql:  Demonstrate the working (or error) of unload-as-inserts using SQLcl. This script will run the actual demo and possibly hit the "error" when it runs the script generated from the unload-command.

If you want see the two unloaded files: 

The one with "working" inserts and 

The one that generated the error..


Happy Scripting.



Friday, 22 November 2024

Oracle 23ai - Converged Database, playing with images and text

TL;DR: Investigating the use of VECTORs in Oracle 23ai, I stumble across all kind of (python) models. Some of those, pre trained, models  think they can recognise my photos. Some Surprises. But I am still nowhere near my end-goal: text-searching my pictures. But having the data in Oracle 23ai helps. I can use SQL.


Background

I'm on a queeste to search through my own photos using a text-query, like "photos with pylons in them?". So far I successfully managed to store images and VECTORs generated from the images. Searching image-to-image works OK: Give it an example-image like pylon.jpg, and it finds similar, resembling images. But it is hard to find a model (a vectoriser?) that allows text-search.

In the process, I'm finding other interesting things about "Artificial Intelligence" and the use of python and Oracle. For example, there is "AI" that can generate captions, descriptive text, to the images you give it. So, Let's Try That.

But let me insert somc caution: Automated captions, Labels, are not always correct. some of them become funny. Hence...



Yes, the warning is there for a reason.


Caption-generation: halfway to searching the images?

If an "AI" can generate descriptions of my pictures, that would be a step in the right direction. And I came across several models that pretend they can. A lot of my searches lead to the Hugging Face site (link), where you can find descriptions of the models, example python-code. And where the necessary "data" from the pre-learned moded can be (automatically!) downloaded.

Not all models can be made to work as easily as the blogs/pr-posts or example-code suggest, but I got a few to work fairly quick. 

Disclaimer: A lot of this is "black box" to me. And when someone asks me for details, I might reply: "I have No Idea, but it seems to work".


Found a model that "worked" for me

The one that seemed to work calls itself "Vit-GPT2-coco-en" (link)

ViT-GPT-2 : Vision Transformer - Generative Pretrained Transformer 2  

(Yes, I found later that there is are GPT-1 and a GPT-3 and -4. In my search, I just happened to stumble an implementation of ViT-GPT-2 that worked more or less effortlessly in my situation - I might explore higher versions later, as they look Promising).

I took the example code, did some copy paste, from this page (link), and managed to generate captions for my stored-photos. To use this model, I needed the following at the start of my py-file:

This tells python to use the transformers (a way to run a model, I think) and it sets three functions(fuction-pointers): feature-extractor, tokenizer and model. Those functions will be used to "extract" the caption. 

With careful copy/paste from the example-code, I constructed a function to generate  the caption for an image. The function takes the file-path to the image, and returns the caption as a string:

Notice how it calls the three functions defined (imported) earlier. The actual "magic" is packed inside these three function. And inside a set of data, which is the "pre-learned" part of the model. On first call, you will notice that the transformer-functions will download a sizeable amount of data. Make sure you have a good+fast internet connection, as the amount of download-data can be considerable, sometimes several GBs.

For more information on the transformers, you can study the Hugging Face site. For the moment, I just accept they work.

Once I got it to work, I added a table to my datamodel to hold the captions, in a way where I could possibly run multiple models and store them in the same table based on the model_name.

(Yes, yes, I know, the model-column should relate to a lookup-table to prevent storing the name a thousand times.. Later!).

Then I just let the model loop over all my 19500 photos, and waited... Almost 10hrs, on my sandbox-machine, a vintage MB-Pro from 2013.

Note: I'm deliberately Not using any Hosting-provider, a Saas, or an API-service, out of principle. I try to keep my source-data, my pictures, on my own platform, inside my own "bricks and motar". I fully appreciate that a hosting-provider or cloud-service, or even an API, could do this better+faster. But I'm using my own "Infrastructure" at the moment. I even went so far as to run the caption-generator for a while without any internet connection, just to make sure it was "local".


The Result: Captions.

The resulting text-captions turned out to be primitive but often accurate descriptions of my photos. Here are some Queries on them:

It is clear that most of my pictures contain a "motorcycle" in various locations. No mystery there. Let's clean that list by adding

  Where not-like '%motorcycle%'

Now the highest count if of some "metal object with hole" ? Some folks may already guess what that is, as I ride a certain type of motorcycle...

But more in detail, let me go back to a previous blog (link): I had identified 5 pictures with "powerline-pylons" on them, and the vector-search had correctly identified them as "similar". What would the caption-algorithm think of them? Let's Query for the captions of those 5 pictures:

Hmm. Only one of the five is correctly identified as a power-line. Check the pictures on the previous blog. The two findings of a kite are explained from the triangular-shape combined with the lines leading to them. The train... was not a train, but could be mistaken as such, sort-of.

Main Finding (and I checked other pictures as well...): There are a lot of Wrong-Captions in the list. Doing a check for "false positive" or "false negative" would be rather time consuming. So while some of the captions were Surprisingly Accurate, there is a lot of noise in the results still. Hence the "Warning".

Oh, and that "metal object with a hole" ?  It was the Oil-window on my engine... I tend to hold-down the phone to take a picture of it Every Morning to have an idea of oil-consumption between maintenance. I mostly delete the pictures, but when I keep them, they look like this:

Correcty Identified !  And no worries, the modern engine of a BMW-R1250RT uses virtually No oil between maintenance. But it is a good habit to check. 

There were several other Surprising "accurate finds" and a lot of "funny misses", but that is not the topic of this blog. The use of (this-model) caption mechanism is helpful, but still rather crude, and error-prone.


Learnings.

Python: At this stage, I learned about python (and Oracle). Not much vector-logic happening at my database, yet.

Models, AI.. : I learned there is A Lot More to Learn... (grinning at self...). Others are exploring the use of ONNX models which can be loaded into the database (link). I am sticking to python to run my models. for the moment.

Hugging Face: I got quite skilled at copy-paste of example-code and making that work. The HF site and how it enables use of models was quite useful to me (link again)

Detail on housekeeping, cached-data: The models download a lot of data (the pre-learned-knowlege, possibly a neural-network?). By default this will be in a hidden directory stored under your home-dir : ~/.cache/huggingface. Relevant in case you want to cleanup after testing 20 or so models. My cache had grown close to 100G. It is only cache, so if you clean-up some useful data by accident it will re-download on the next run.

Generated Captions: Sometimes Surprisingly Accurate, sometimes a bit hmmm. Results might say something about my picture-taking-habit, not sure what yet, but I may have to diversiy my subject a little. Too many pictures of motorcycles.

Converged Database: Give it any SQL you want.

Having the data in an RDBMS is still a large blessing. Inside a Converged Database (link), you have the liberty to combine data from any table to look for the information you need at any given time. And I Know how to formulate SQL to gain some insights. 

SQL Rules!


Concluding...: Not there yet.

My original goal was (still is) to enter a question as text and have the system report back with the possible resulting images. It seems That is going to take a lot more googling and experimenting still.