Monday, 10 March 2025

The Generic Tips (was: python and Oracle)

TL;DR: Some Very Basic Tips, to write better programs and tools. In my recent case: using python and Oracle. But in the handover to colleagues, I came to realise there are some tips+tricks that need re-iterating every few years. So Here Goes (again). 

My opinion is of course strongly "biased" towards an Oracle Database, but that is not always a bad thing.

The actual blog on my Oracle-python-tricks will have to wait a bit.. but it is in the pipeline (# makefifo pythonstuff ; link to future blog...) 


The things that tend to wash up on the desk of a DBA...



Background: Been There Done That (have T-shirts).

Originally, I set out to describe some of the pyton programs (functions, really) that I (re)used to make my life and that of my cstmr easier. As it turned out, we ended up discussing some "habits" and practices rather than the actual code. So just a few generic tips, which I also used to explain things to cstmrs. Hence I wrote that up before I dipped into my python code.

YMMV and "Never say Never": The tips applied to myself and my cstmr-situation. But I came across enough situations to say they apply in many case. However, Your Milage May Vary. So if you think they do not apply to you, I am fine with that. I'm Not going to go religious on it.

My content here applies to the type of (simple-) jobs that tend to wash-up on the desk of a DBA: Interfaces, ad-hoc reports, occasional maintenance-jobs, loading "meta data". that sort of thing.

Also, do not expect High-Tech. In my case, with the combination of python and Oracle, I have only build relatively simple programs. And I do nearly all my work just from a "terminal" on either linux, mac or sometimes even windows. I use mainly vi, SQL-Developer (and occasionally some IDE or even notepad) as editors, and often still use print/f or dbms_output statements for debugging (showing my age eh...). 



Non-functional "Requirements": try to include in Specification.

Requirements, Specifications, or whatever name you have, often state: "ingest these datasets" or "produce a list of gizmos+items in that format", or "return a set of records with this data..." . And that is expected. 

The un-spoken or so called "Non Functional Requirements" are often along the lines of:

1) Make sure it runs in future infrastructure (if we move to "the other cloud"...)

2) Allow running- and maintenance by our outsourced-vendor-resources.

2a) Include some (unit-, integeration-) test capability. testset...

3) Make sure it runs inside 1 min (One Minute, even if 10G of xml-data....)

4) Produce some output or report to allow control or verification (and keep that)

5) Allow only "authorised" ppl/tools to run, and/or prevent DoS situations.

 etc...

Those requirements are often unspecified but implied when you are asked to do "just this simple program". But it helps to write them down (briefly!) even if nobody wants to read them. You may also need some trace of those Non-Functionals to justify the choices that you make (e.g. why didnt you build this program in Go, in C++, or as a bash-script using just grep and awk? ).

Note: The checklist of 1 to ... etc  above would be Tip-nr-0, if I was so pedantic as to start numbering at Zero...

So here come the "Generic Tips  (after weeding out half)...



Tip nr 1: Write it down, Briefly but Clearly

Why? Well, See above...

But also: the reader of your writings is probably yourself or your replacement-person in 3 years, or even in 20yrs (Yes.. I've seen them). Hence, it helps if you can introduce your program with a few lines of both the "functional" and "non-functional" requirements (what does it need to do!).

By writing (or just summarising) what your program needs to do, you can also "verify" it with the user, the architect or whoever is asking for it to be built.

You may also want to include some "justification", e.g. why you choose PL/SQL over Python or FORTRAN (hehehe).



Tip nr 2: Draw one or more ERDs. For better Understanding!

To understand a program: it rarely helps to read the source-code. But a look at the tables, or the ERD of the underlying data can do miracles for understanding. I tend to create my ERDs from the tables in the datbase and rarely work with other models than the Table-Relationship-Diagram. 

Even if you have very simple data, it helps to have an ERD. Not only will an ERD help you define tables (if you even use tables at all...) and the way those table relate. But a brief documentation of each data-element will also allow you to communicate purpose and usage of the data-elements. 

In creating the ERD you will most likely "discover" a few things about the data you process. I notably walk down the lines (relations) and verify things like .. "This order has zero, one, or more details...", and then ask "Is that Correct...?"

When you have to explain your program to colleagues, or have to ask "more questions" from the users: an ERD is a great tool to communicate, and much more readable than "open this source code" (the user may not have the dev-tools installed...).



Tip 3: SmartDB: Use. Your. Database (e.g. use SQL and PL/SQL)

If the data is moving into or out-of a an RDBMS: Consider using SQL and PL/SQL as much as possible.

Even if your organisation prefers using VMware virtual machines, or your architect prefers the use use of "spot-instances", or k8s, or some serverless FaaS solution... If your data ends up in a database, that database is the Safest,  the Longest-Lasting, and probably the Easiest to Maintain place to "store + run your program".

I recall my managers, in the 90s and 00s demanding things like "use FORTRAN to convert incoming data", or "only C will do this fast enough". In the end most of that code ended up being converted (reduced!) into PL/SQL with good results.

The excuses to not use a database are plenty:

- Database CPU is too expensive (maybe, but the database is often the moste efficient place to process and keep Data)

- PL/SQL requires a DBA... (well, a pl/sql person, acutally, so ..?)

-  This is so simple, you should use grep/awk for this (ok, try that. and then introduce a small change in requirements...or start processing GBs of data...)

I must say: I have broken this rule of #SmartDB many times, often at the expense of myself and my cstmr. Nothing lives as long as the database...

Of course, YMMV.  In some cases FORTRAN, C++ or python, is advised bcse of available tools or libraries (as was sometimes my case). The main Valid reason to do processing outside of the RDBMS is, in my opinion, the use of certain libraries for specific functions (math, geo, or some specific engineering- or scientific purpuse library).

But In General, by solving the problems "inside the database", the code was the most efficient and the easiest to maintain (some links to the praise of #SmartDB...?).



Tip 4: Provide some Test Cases.

This may sound trivial, but it can really help. I'm not specifically advocating TDD (Test Driven Development), or some other "framework" that may require a lot of methodology (and work). But you need some test-cases, if only to prove that the original code you delivered still works.

Ideally, you have a scripts (test_setup.sh and test_run.sh).

And those tests should not "break" or otherwise affect your production data. Hence this needs some thought, and probably needs to be embedded into the processes of the customer. 

All cstmrs have Test-platforms. Some cstmrs even have separate platforms where they run Production systems.



Tip 5: Plan Liberally, every IT tweak takes more time than expected.

The best known book on this is probably "The Mythical Man Month" by F.P. Brooks. One of my early managers made me read it. Recommended!

My rule of thumb is to sometimes multiply the original time-estimate. For simple systems: multiply by 3.14 ( approximately pi, Archimedes...). For more complicated systems: multiply by 9.8 ( approx pi-square or g)

Most ppl can remember pi and g (earth gravity).  Because simply stating "I multiplied by 10" sounds rather un-scientific.



Tip x : Stick with the tool(s), unless...

I also tend to include a warning: stick with the tool (bash, python, sqlcl, pl/sql...). If every little-maintenance job completely re-writes the program (in Java, bcse that is the future standard... ), you end up with a hodgepodge of code. If some future coder Insists on re-writing (in .cpp using chatgpt...): think hard, and possibly find a coder that Does understand the original tools/languages it was written in.

When I was an engineering student we were taught some rules around robustness, and maintainability. Such as how to choose components or parts: 

- Less Components make a system more Robust.

- Less Different Components make a system easier to Maintain.

I never forgot those.. and those simple rules also apply to IT-systems.



Oops... Writing Too Much ? 

This text is getting way to long, nobody reads that much anymore (so I went back and weeded out half of it).

I could add a several more rules. Notably around the fact that the Oracle RDBMS now includes several neat features that make your life easier, while still staying at the #SmartDB level e.g. "With the Database". There are now options to process JSON, to build REST-services (recommended), the use of APEX for simple window-on-data (recommended).  And you can now include ONNX models (new in Oracle23ai!). But all those are above + beyond the scope of what I wanted to write.

So there.



Summary: 

Write Some Specs, for yourself and your colleagues.

Plan carefully (and maybe read The Mythical Man Month...)

Make na ERD, it helps the understanding and facilitates discussions

Use #SmartDB whenever possible, it is more efficient.

Provide some Test-cases.

Limit your nr of tools/languages, prevent code-chaos.


That is about it.. Next instalment should be : python Tips + python Code ...

 

Friday, 20 December 2024

Oracle 23ai, Domains, Enums and Where-Clauses.

TL;DR: In Oracle23ai, the DOMAIN type is a nice addition to implement more sophisticated Constraints. Someone suggested the DOMAIN of type ENUM as a column-constraint as an LoV. Works like a charm. But the behaviour of some where-clauses may not be what you think. Let me show you some very simple examples, and some workarounds, both bad and Good.

A prism, just because I used colors as an example.


Background

After seeing a presentation on the use of ENUM-Domains for use af List of Values on a column I did some scripting to see how it could work (previous blogs here + here). Now I tried to use it in Where-Conditions. 

So I have a few domains, one of which is color:


And I created a table that uses the domain in a column-definition, like this:

The column is defined as type "color_enum" and thus inherits the "LoV defined in the Domain. As I consider the color-column to be akin to a Foreign-Key, I have already created a index on it. 

When I ask for the INFO of the table, using SQLcl, this is what I get:


The column COLOR apparently got defined as number(38,0) e.g. an Integer, because it can not hold decimals. But the same column also has the DOMAIN specified as COLOR_ENUM. The info-command also shows I have indexes on both the ID an don the Color-column.

Note: I'm using SQLcl (link) here. In SQL*Plus you would need to do a DESC of the table. Latest versions of SQL*Plus will show the numer + domain, but older versions will just show the column as number (38,0).


Do some Inserts

We need some data in the table so let's...

Notice how I can use both the enum.name, which is effectively a set of Constants, or an actual numeric value to insert data. I did not have to visit an LoV table to get (read or join) my values.

Just curious: how does it handle "wrong values": 

There is no "purple" defined in the Domain-Enum and an error pops up. But the description of the error is a little vague in my opinion. It seems to read "color_enum.purple  as a reference to a column. I think the actual error here is more of an "invalid value" or "check constraint violated". But I can see why the error is what it is.

And what if we try to insert a number out of range:

When trying to insert a number outside the valid range, the Error is more Clear. It refers to a "check constraint" and it mentions the DOMAIN. This should give even a developer enough clues to fix it.

And we have Proven that the Domain acts as both Check-Constraint and LoV. It is very much similar to an FK (as tested in previous blog as well).

Next is doing some Retrieval.


Selecting from the table.

The result of our inserts is a table with the following data:

No surprises here. But notice again that the columns "id" and "color" are numbers. and we need a display function to obtain the "name" of the color. 

Now suppose we want to search for records by the name of the color. We need to put the condition against the domain_display():

Notice the Table Access Full. But also notice that the filter, the condition on the display-function. It seems to employ a CASE-When statement to pick up the values from the DOMAIN. 

I suspect that the use of the function to find the value of  like 'RED%' is akin to a type-mismatch (e.g. comparing varchars to numbers), which stops us from using the index on the numeric column color. But a Full Table Access can be a bad thing on large tables.

It works much better if we use constants or plain numbers to look for our records:

If the types of the comparison matche the indexed, numeric, value the query Can use the index. And this is the desired outcome. 

The Warning is Clear: Do No Use that function in a search-criterium. Unless...


Function Based Index?

Because the domain_display () is a function, we could consider adding an Function Based Index. Let's try that:

That works. With the FBI the where clause has an index that machtes the domain_display, and uses that index. Maybe this is what you need. But I am generally hesitant to add more indexes. I would prefer to use matching types in the where-clause: constants, or other numeric conditions.


More Exotic: join the DOMAIN:

For a more exotic idea: you can use the ENUM-domain as a record-source in the from clause, and add a join to the Query. Not that strange, because this is what you would to when you were using an FK-Constraint anyway:

And it works. Apparently I can use the domain as a record-source with no problem. Let's have a look at the explain-plan:

There is a list of "filter" operations on the domain, but  the query does use the intended (numeric) Index. This is how you would to it when the constraint was a classic FK to an LoV table. And this still works as well.


Summary: DOMAIN-ENUM can be used as LoV.

The DOMAIN can act as LoV and FK. They bring both the "check contraint" and provide readable "constants" for use in code, without the need for a separate table to keep the values. In some cases, this will bring you a reduction of IO-Effort and make your queries more efficient (see earlie blogs). And the same limitations apply when searching for the "desciption of the LoV".  

Advantage or not: You cannot mess with a Domain as easily as with a table. A DOMAIN cannot be edited in the same way as an LoV table (which I have been know to edit in a live system...). 

Furthermore, I showed a workaround with an FBI. But I do Not Recommend putting Function-Based-Indexes on domain-columns indiscriminately. Every extra indexes should be given careful Consideration.


Oh, and: RTFM

There is a lot more on DOMAINS than just LoV. Domains can/should (IMHO) be used for Sophisticated Constraints on one or more (groups of) columns. Do Go check the Documentation

Or check this page by Tim for copy-paste usable examples. After all: RTFM = Read Tim's Fantastic (summary of the) Manual (link).

Enjoy.


-- -- -- -- -- -- end of blogpost, appendix below -- -- -- -- -- -- 

Appendix

Script to replay the demos..

demo_enum_5where.sql : just hit enter several times to walk trought the demo.

Happy Coding (Tip: just use SQL wherever you can!)


Tuesday, 17 December 2024

Oracle 23ai, Domains, Enums, use-cases.

 TL;DR: The new DOMAIN type in Oracle23ai can be used as LoV. And in some cases that can save you Effort: if it does not have to visit the FK-tables. I'll show some examples, and I'll demonstrate a few pitfalls to avoid.


Background.

It all started when someone suggested to use DOMAINs of type ENUM as an alternative for FK-tables with List-of-Value lookup-data. For example, if you want your table to have a field "color", limited to just the colors of the rainbow. You can use a check-constraint (simple!), or a table with color-values and an FK-constraint, or in Oracle 23ai, you can use a DOMAIN. We will compare ENUM-domains with t FK-LoV tables. 


Datamodel:  a fact table with two LoVs

For this little demo I will use an old-fashioned datamodel with two lookups tables, and one long fact table. And the queries on that 3-table model will be compared to the same queries against a single table where the LoV are held in two DOMAINS of type ENUM.

First the traditional model looks, it like this:


(I love a primitive diagram...) 

And the Fact table is defined as follows:


All very Classic, Data-table with LoVs  and Foreign Keys. The FK are the "constraints" that ensure correct data. I also put indexes on the FK-fields, but you could ommit those.

Now if you want, you can replace those FKs with DOMAINS of type ENUM and use that for the column-definitions:

I stated by defining two DOMAINS, the color-one looks like:


I created a similar domain with 1400 Roman Numerals (too long to show, but there will be a complete re-play script at the bottom of the blog). 

Now create the table and use those domains:


Note how two of the columns have their type defined by a DOMAIN-ENUM. There is no FK, and you can leave out the LoV tables completely. But the values for the columns are now Constrained by the domain-definition. For good measure, I created indexes on them, as I expect to be "searching" for those at some time. 

We filled both tables with Identical data, and did a generate_statistics on them. Now we are ready do some some comparisons.


Aggregate Queries

Let's try to aggregate some color-records, and have a Query that _needs_ to visit the table (otherwise, the optimiser could sneakily just only use an index):


Access is via Index-full-scan and table-access full. Join is done by hash-join (more efficient than Nested Loop in this case). Looks like expected behavious. I happen to know that the LoV table t_color_lov is 8 blocks, and the data-table is 768 blocks in size. Check the statistics of the query:


Just over 700 gets. More or less as expected as there will be some empty blocks in the table. 

Now try this using the table that Does Not have FKs, but is using DOMAIN. First the explain:


We see 1 Full-Scan of the table. No other tables involved, because the constraints are done by the DOMAIN, not by FKs. 

The query-stats look like:


Maybe somewhat to our surprise, there is little gain here. Both queries are Equally efficient. The 7 records for color-LoV probably fit into 1 block, hence the case of the "enum" is consuming just 1 block less.

[ soap-box: ]

Now, YMMV, because I heard of ppl who got bitten by "too many LoVs, too many joins". And they claim to fix this with either Check-Constraints or simply removing the FK-constraints. But I would Strongly Suggest to examine your Explain-plans first. The optimiser is well-capable to Scan the small tables just-once and use hash- or merge-joins to avoid (nested)Loops. Domains are, IMHO, not a way to solve those bad-queries (but they could, potentially...)

[\steps off soap-box]

I also think Connor McDonald, and others have previously proven, sufficiently, that : Joins, Good Joins, Well Done Joins, ... Are Not Expensive.

And in the case of relatively Large data-tables, and retrieving "a lot of records", the elimination of Joins to small tables by using DOMAINS will only yield very limited benefit (1 block out of 700...). 


Single-record retrieval queries.

Let's compare what happens when we retrieve just 1 data-records. The relative overhead of joining to LoV tables might weigh a little heavier on those "precise" cases. We are retrieving Two records and want to see color and roman (e.g. get those from the LoV or from te domain):


We notice the unique scan to retrieve the two records from data_2lov, a full-scan on the color_lov (7 records, effectively just 1 block) to pick out the 1 record needed, followed by the use of the unique-index on roman_lov to retrieve the roman-numerals, one for each record. The predicate informationd confirms this.

Then we check the statistics, they show 13 block-gets:


By many criteria, 13 block-gets is an Efficient statement. 

Note for SQL-geeks: Yes, There is room for minor improvements. By using a "covering index" on the data_table. Or from the use of  IOTs. Maybe, with some extra indexing, this query could be brought back another 3 block-gets (eliminate 1 step for every table) to 10 gets. I consider that out of scope for now as not a lot of systems seem to go that far in optimising.  But for critical systems I would recommend use of IOTs (Index Organized Tables, docu here) for the LoVs. Should we add a demo of Extreme-Indexing... ? (just for the fun of it...?). Maybe Later (reserve space for link to script).


Now let's retrieve the same info from the data-table using the DOMAINS:

The SQL and the Explain-Plan show only One Table involved this Query. The color and the roman are retrieved from the domain_display ( column ) function. For good measure, we added the same order-by as in the previous query.

The Statistics now show :


Only 6 gets. Because we only need to use the PK to retrieve each record. Probably two index-blocks (index-root and 1 branch) and one table-block (by index rowid) for each of the two numbers in the in-list. Total of 6 gets.

Hence, even though the original effort per query was low (13 gets is an efficient query by most standards), the savings are Significant: We reduced the gets by half. And if you have more of those LoVs per record, those savings can add up. 

Note: Possibly you get even more benefit, notably if your datamodel is missing some indexes. Or if you have stale/wrong/missing statistics whereby your optimiser is "deteriorating your query-plans". In those cases, the elimination of LoV joins will (appear to) help. But in those cases, you have other (datamodel + optimiser) problems!

But in short: the elimination of some joins, especially in complex or high-frequency SQL may Benefit from the use of DOMAINS. QED.


My interpretations (and ymmv...) 

1st: For Large, Aggregate-type queries, there is No big Advantage.

On lage queries, where you scan significant parts of the "data" or fact-table, the additional, single, Full-Scan of the lookup-tables is probably negligible. Small compared to the effort of scanning the data-table. Hence if your work consists of lage, aggregate-type queries, there is little gain in avoiding LoV tables. Provided you have your indexes done well and statistics up-to-date. Also c.f. Soap-Box statement above...

2nd: Possible Gain on fetching small-sets or Single-Records. 

On retrieval of small sets, or individual records, Especially if you have to retrieve many LoV-values, the additional table-access for the LoVs Can be a larger part of your Query-effort.  And here you can save on the additional Hop(s) to the LoV table(s). This can be beneficial If your work consists of a lot of small, OLTP-style SQL-statements where you need to access LoVs Many and Often. Again, providing you have you indexes Done Correctly.

3rd, Note: I have some remaining notes on using domain-expressions in where-clauses as well. Maybe put those in a followup to prevent a "too-long-read". 


Closing Remarks. 

In summary, I think the use-case for "reducing effort on LoV" should be focussed on the cases where the LoV is a relatively large part of the total-effort. That is mostly (if done correctly) the case in retrieving small sets or high-frequency, quick-verifications. 

So there Is in specific situations a Good Case for using DOMAIN-ENUMs.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- 

demo_enum_4select.sql : the complete script, just hit enter 9 times.

demo_enum_4extreme_index.sql : remind me to add this if I get there...


Monday, 16 December 2024

Oracle 23ai - enums, what is behind the scenes ?

 TL;DR: The DOMAINs in Oracle 23ai are an additional way to add constraints to your Datamodel and you should investigate this to make your database "Smarter". When testing the simplest of domains, the ENUM, they appear Quite Efficient. And as I can never resist trying some extreme cases, I did hit some limit as well.


Background 

Domains are introduced as a way to define complex and sophisticated (check-)Constraints, and to (re)use those in multiple places in a database. Enforcement of the constraint is by the RDBMS, hence this keeps your Logic as close to the Data as possible and avoid re-coding things in various application-components. Domains can be Very Useful in this respect. You can start your own investigation here (link).

This blog is about the simplest possible case: a List of Values (LoV) constraint. For a relatively simple LoV we have the choice between using a Domain of type ENUM or an FK-constraint to limit the values for a column. The nature of an FK to a lookup- table is very much "Relational", as you can define the valid options in a separate table and let the RDBMS enforce the constraint. The nature of a DOMAIN is more declarative, and resembles a sophisticated CHECK-Constraint.

Note: A word of Thanks to Chris Saxon, who did the presentation that triggered me to doodle the scripts that ultimately lead to this set of tests. It was Fun to do.

A reference to Asterix, because of the Roman Numerals in this blog...


Testing, a small and a large Enum.

I could not resist so I created two domain-ENUMs and did some SQL on them. Here is what I did (I'll put links to full scripts at the bottom..).

My first Enum, a list of colors:


That was Easy. And now for a really long one: the list of valid Roman Numbers in Oracle...



This shows I can potentially turn a LoV table of 4000 into a "DOMAIN", and maybe gain some performance by avoiding a join to lookup-table ?

And after creations, you can query them as normal tables. Let's see how that goes: The list of values can be SELECTed and sorted at will:



Because the queries on the rather large "roman_enum" were a bit sluggish, and as I am Always curious, I checked on the effort behind the scenes. The autoexplain looked ok: no buffers-gets or disk-io:



This query is causing a lot of round-trips because the result-set of 3999 records has to be pushed to the client, but Still no buffer-gets or disk-io visible. Hence operations on DOMAINS should be buffer- and IO-Efficient. Promising. 

But queries on the large domain did use up a lot of cpu-time, which could be a warning. I ignored the warning at that time, because a full-query on such a large domain is probably a weird case anyway.

Here are the statistics from v$sqlarea from the shared pool, they showed me similar good information:



The Good Part: No logical- or physical-IO. That two of the queries showed buffer-gets was, apparently due to them being the first-time-query to the DOMAINS after the purging of the sh-pool to allow clean measurement of the statistics.

But the fact that the "composition" of the relatively large set for roman-numbers had taken a lot of CPU- and Elapsed-time was "interesting".


Now use DOMAIN on Columns in a Table:

To get some real usage, I needed tables with columns that use the (enum) domains in their type-definition. So I created two very simple tables:



Notice how the 2nd column of each table only has a definition that refers to an enum-domain. There is no other infomation to say that the values in that column are acutally numeric (in my cases). If you RTFM deeper, you will find that the actual enum_values that go in are not limited to numbers. But I have not tested that (yet).

But  let's insert some data and see what it looks like. Fist the Colors...



As you can see, you can insert data using either a number (e.g. the enum_Value) or the "constant" from the DOMAIN (e.g. the enum_Name). And on query, the value is reported as the enum_value (e.g. similar to reporting the key of an FK table). If you want to display the enum_value, you need to use a function: 

Select DOMAIN_DATA ( enum_value_column ) from  Table;

The table-definition will then use the domain defined for that column to "look up" and display the description. The lookup or display does not seem to incur additional IO, it will probably need a little extra CPU ticks, I might test that later.


A Quirk to display the large domain?

Here is what happend is I query the table with the roman_enum, first using select-star:



No surprise (yet). 

You can see I used numbers that could represent years, and because of the domain-definition, each enum_value or "row-number" corresponds to the enum_name with the equivalent roman-number. I know some ppl like to use Roman-Numerals for years, hence my use of this domain.

But now let me try to display the "roman numeral value" using the DOMAIN_DATA function:



Resulting in an Error ?? 

Also, the error was "instant", there was no noticable (cpu-)time consumed before throwing the error. And the error was always like a syntax- or typo-style error. Well, as far as I could see, there were no syntax-errors in my queries... ? (I am notirious for typos, so I did check several times, and re-typed the statement in a few ways, no relief, just errors while trying to retrieve my "roman number"...). 

Almost as if the RDBMS told me I should just revert to TO_CHAR( rn, 'RN'), which is a lot simpler anyway.

I defined a few more, different, smaller, enums, which all Worked Fine... ???

Then did a little searching (in my own database, the datamodel, data-dictionary), and found that the data-dictionary had a view "USER_DOMAINS". It contains a CLOB-column called DATA_DISPLAY, in which there is a CASE statement that seems to compose my enum:



I then got suspicious and checked the size of that column, using dbms_clob.getlength() for my domains:



Hmmm. I went back and cut my roman-numeral-domain in half, and in half again, and then did some testing to find the max-size where display_data still worked... 

And the workaround was more or less as I suspected: Smaller Domain. In my case, if I got the size of the clob below 32K, I could display the data. Here is a small demo, contained in a separate file: demo_enum_size.sql:


Voila. That works. But in this case I seem to be limited to about 1400 values.


Summary: Domains are Useful. but Test !

In practice I would probably never use a DOMAIN that large. And there are other practicalities, like you can not (yet) alter a domain without (probably) causing a window-of-non-constraint, or some outage. This needs more Testing (...). In general for such long lists, I (still) think a real LoV TABLE with FK is more appropriate. And I know an LoV table can be "edited" in a LIVE system, BTDT ;-).

But the fact is that DOMAINS have advantages too. DOMANs are declarative, that domains do not need access to "data" (buffers or disk), and  that domains can implement constraints that are much more sophisticated than old fashioned "check", that makes them potentially Very Interesting. I should test more...

So maybe TBC.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Appendix: Scripts used to compose this blog..

demo_enum.sql : self running, just hit enter some 15 times..

demo_enum_2size.sql: self running, shorter, just to work around the error. 


Edit: Almost fogot:

demo_enum_io.sql : to display statistics from v$sqlarea, show the CPU + IO

Friday, 6 December 2024

Oracle 23ai - unloading of vector data.

 TL;DR: For demo and repeatable-test cases, I like to use scripts to set/reset my test-cases. Vectors can pose a problem here bcse they are not always easily "scriptable". Not Yet anyway. SQLcl can help with the unload commands.



Pop-in an image of "unloading", because I am in the end just a farm-boy who likes tractors and combine-harversters and stuff...


Background: Scripts, Reproducible cases.

Always looking for ways to quickly test or demo something, I am a Big Fan of script-tools like SQLcl and SQL*Plus (and bash, and sometimes python). Good Demo- and Test-cases should be easy to reproduce, on any platform, even 10yrs later. The oldest "test" I got is a few scripts and one bench.dmp file from 2001, created with Oracle v8.1.7, and it is still "usable" (not very relevant anymore, but still usable, that is Backward Compatibility for you...). 

The creation and manipulation of test-data for VECTOR columns sometimes poses a problem. See also earlier blog (link) where I pointed out that the (ancient) exp/imp tools do not handle VECTOR data, but datapump does.

This blog is partly a call for help to Oracle (Hi Jeff!) to help facilitate the simple spool+reload of Vector-Data. I'm fairly sure there are several solutions, but not sure which ones to explore or use in my specific cases.


This Worked: SQLcl unload (as insert-into)

SQL*Plus and earlier versions of SQLcl had given me several errors, seemingly treating the VECTOR as a CLOB or LOB. But hey: Use Latest Version! Right?

Using SQLcl, version 24.2.0 I found the following to work, using the unload :


This generated a runnable sql-file that would correctly re-insert my data, even when the vector itself was (much) larger than 4K. My Tested case used vectors that were still under 32K. Might try bigger ones later.

The secret was, apparently, in the trick to split the vector into smaller chunks, which each occupied a line of just under 512 characters (linesize 512?). Each line was a set of numbers converted TO_CLOB ( ... ), and concatenated with its peers. The resulting CLOB was inserted into the VECTOR column, and this gave the desired result: my vectors were re-inserted. This is a screenshot of the unloaded-file:

Notice the repeated TO_CLOB(...) statements that are concatenated together to form the complete vector (which is apparently the equivalent of a clob...). 

Note: If you look inside the user_indexes and user_segments,  you can see that vector-columns have corresponding ilob and lob indexes: Storage-wise they are apparently treated similar to LOB/BLOB/CLOB columns.

I was relatively happy with this "concatenate-clobs trick" for several reasons:

. The unload generated correct, runnable scripts.

. No intermediate (staging) tables or additional pl/sql (helper-) procedures needed.

. Even SQL*Plus was able to insert this data. I had a Very Portable Demo!

But there were some clouds on the horizon...


Possible Problem with new version SQLcl ?

Then, I ran into an error after downloading sqlcl v24.3. The unload-command had changed slightly: it generated an insert of a single VECTOR. 

Hence, unloading with SQLcl version 24.3.2.0, build ..3301718, then trying to reload resulted in this error:

ORA-01704: string literal too long...

When I opened the unloaded file, the 3rd line in the sql-file was now 30.621 characters long (hm.. just under 32K, that seemed a coincidence - I should try with longer vector some time... ). There were no more TO_CLOB (...) concatenations, and the vector was just a quoted string of values, looking like this:

This simple quoting of the vector will work for small vectors (below 4K or even 32K), but not for seriously large vectors where the string-representation will surpass these values. Also, I am not sure what would happen to binary vectors: how would they be represented as quoted string ? (More stuff to test!.. I know, I know, but Time... Maybe later).


Final notes: Multiple Solutions.

From working with LOBS and CLOBS myself long time ago, I know how Tricky these things can be: They mostly require use of some other language than just SQL(*Plus) to manipulate them. They need to be split into chunks that are manageable by whichever language/system/host is doing the processing. But as a simple SQL user who just wants to run Easy Test- and Demo-scripts, I am hoping the solution will remain Simple.

By documenting this "case" I hope Oracle can either fix this (by reverting back to using multiple TO_CLOB?), or work around this problem in some other way.

One suggestion I got in private-message was to try using JSON: get the VECTOR in some JSON format and use that to unload/load the data, possibly in smaller chunks. And of course you can also revert to additional PL/SQL, Java(script) or Python to manipulate vectors. But all those solutions involve some additional (bespoke) coding on my part. And I just want Simple-Demo cases.

Suggestions Always Welcome ! 


----- End of Blogpost ------ 


Attachements, Addendums (just some files)


Below are links several links to files to test/demo for yourself:

ini_unload.sql:  Set up the test-case, create table and load nine records with VECTOR(2048, FLOAT64), The data is generated from some of my experiments with vectors.

tst_unload.sql:  Demonstrate the working (or error) of unload-as-inserts using SQLcl. This script will run the actual demo and possibly hit the "error" when it runs the script generated from the unload-command.

If you want see the two unloaded files: 

The one with "working" inserts and 

The one that generated the error..


Happy Scripting.



Friday, 22 November 2024

Oracle 23ai - Converged Database, playing with images and text

TL;DR: Investigating the use of VECTORs in Oracle 23ai, I stumble across all kind of (python) models. Some of those, pre trained, models  think they can recognise my photos. Some Surprises. But I am still nowhere near my end-goal: text-searching my pictures. But having the data in Oracle 23ai helps. I can use SQL.


Background

I'm on a queeste to search through my own photos using a text-query, like "photos with pylons in them?". So far I successfully managed to store images and VECTORs generated from the images. Searching image-to-image works OK: Give it an example-image like pylon.jpg, and it finds similar, resembling images. But it is hard to find a model (a vectoriser?) that allows text-search.

In the process, I'm finding other interesting things about "Artificial Intelligence" and the use of python and Oracle. For example, there is "AI" that can generate captions, descriptive text, to the images you give it. So, Let's Try That.

But let me insert somc caution: Automated captions, Labels, are not always correct. some of them become funny. Hence...



Yes, the warning is there for a reason.


Caption-generation: halfway to searching the images?

If an "AI" can generate descriptions of my pictures, that would be a step in the right direction. And I came across several models that pretend they can. A lot of my searches lead to the Hugging Face site (link), where you can find descriptions of the models, example python-code. And where the necessary "data" from the pre-learned moded can be (automatically!) downloaded.

Not all models can be made to work as easily as the blogs/pr-posts or example-code suggest, but I got a few to work fairly quick. 

Disclaimer: A lot of this is "black box" to me. And when someone asks me for details, I might reply: "I have No Idea, but it seems to work".


Found a model that "worked" for me

The one that seemed to work calls itself "Vit-GPT2-coco-en" (link)

ViT-GPT-2 : Vision Transformer - Generative Pretrained Transformer 2  

(Yes, I found later that there is are GPT-1 and a GPT-3 and -4. In my search, I just happened to stumble an implementation of ViT-GPT-2 that worked more or less effortlessly in my situation - I might explore higher versions later, as they look Promising).

I took the example code, did some copy paste, from this page (link), and managed to generate captions for my stored-photos. To use this model, I needed the following at the start of my py-file:

This tells python to use the transformers (a way to run a model, I think) and it sets three functions(fuction-pointers): feature-extractor, tokenizer and model. Those functions will be used to "extract" the caption. 

With careful copy/paste from the example-code, I constructed a function to generate  the caption for an image. The function takes the file-path to the image, and returns the caption as a string:

Notice how it calls the three functions defined (imported) earlier. The actual "magic" is packed inside these three function. And inside a set of data, which is the "pre-learned" part of the model. On first call, you will notice that the transformer-functions will download a sizeable amount of data. Make sure you have a good+fast internet connection, as the amount of download-data can be considerable, sometimes several GBs.

For more information on the transformers, you can study the Hugging Face site. For the moment, I just accept they work.

Once I got it to work, I added a table to my datamodel to hold the captions, in a way where I could possibly run multiple models and store them in the same table based on the model_name.

(Yes, yes, I know, the model-column should relate to a lookup-table to prevent storing the name a thousand times.. Later!).

Then I just let the model loop over all my 19500 photos, and waited... Almost 10hrs, on my sandbox-machine, a vintage MB-Pro from 2013.

Note: I'm deliberately Not using any Hosting-provider, a Saas, or an API-service, out of principle. I try to keep my source-data, my pictures, on my own platform, inside my own "bricks and motar". I fully appreciate that a hosting-provider or cloud-service, or even an API, could do this better+faster. But I'm using my own "Infrastructure" at the moment. I even went so far as to run the caption-generator for a while without any internet connection, just to make sure it was "local".


The Result: Captions.

The resulting text-captions turned out to be primitive but often accurate descriptions of my photos. Here are some Queries on them:

It is clear that most of my pictures contain a "motorcycle" in various locations. No mystery there. Let's clean that list by adding

  Where not-like '%motorcycle%'

Now the highest count if of some "metal object with hole" ? Some folks may already guess what that is, as I ride a certain type of motorcycle...

But more in detail, let me go back to a previous blog (link): I had identified 5 pictures with "powerline-pylons" on them, and the vector-search had correctly identified them as "similar". What would the caption-algorithm think of them? Let's Query for the captions of those 5 pictures:

Hmm. Only one of the five is correctly identified as a power-line. Check the pictures on the previous blog. The two findings of a kite are explained from the triangular-shape combined with the lines leading to them. The train... was not a train, but could be mistaken as such, sort-of.

Main Finding (and I checked other pictures as well...): There are a lot of Wrong-Captions in the list. Doing a check for "false positive" or "false negative" would be rather time consuming. So while some of the captions were Surprisingly Accurate, there is a lot of noise in the results still. Hence the "Warning".

Oh, and that "metal object with a hole" ?  It was the Oil-window on my engine... I tend to hold-down the phone to take a picture of it Every Morning to have an idea of oil-consumption between maintenance. I mostly delete the pictures, but when I keep them, they look like this:

Correcty Identified !  And no worries, the modern engine of a BMW-R1250RT uses virtually No oil between maintenance. But it is a good habit to check. 

There were several other Surprising "accurate finds" and a lot of "funny misses", but that is not the topic of this blog. The use of (this-model) caption mechanism is helpful, but still rather crude, and error-prone.


Learnings.

Python: At this stage, I learned about python (and Oracle). Not much vector-logic happening at my database, yet.

Models, AI.. : I learned there is A Lot More to Learn... (grinning at self...). Others are exploring the use of ONNX models which can be loaded into the database (link). I am sticking to python to run my models. for the moment.

Hugging Face: I got quite skilled at copy-paste of example-code and making that work. The HF site and how it enables use of models was quite useful to me (link again)

Detail on housekeeping, cached-data: The models download a lot of data (the pre-learned-knowlege, possibly a neural-network?). By default this will be in a hidden directory stored under your home-dir : ~/.cache/huggingface. Relevant in case you want to cleanup after testing 20 or so models. My cache had grown close to 100G. It is only cache, so if you clean-up some useful data by accident it will re-download on the next run.

Generated Captions: Sometimes Surprisingly Accurate, sometimes a bit hmmm. Results might say something about my picture-taking-habit, not sure what yet, but I may have to diversiy my subject a little. Too many pictures of motorcycles.

Converged Database: Give it any SQL you want.

Having the data in an RDBMS is still a large blessing. Inside a Converged Database (link), you have the liberty to combine data from any table to look for the information you need at any given time. And I Know how to formulate SQL to gain some insights. 

SQL Rules!


Concluding...: Not there yet.

My original goal was (still is) to enter a question as text and have the system report back with the possible resulting images. It seems That is going to take a lot more googling and experimenting still.