Friday, 20 December 2024

Oracle 23ai, Domains, Enums and Where-Clauses.

TL;DR: In Oracle23ai, the DOMAIN type is a nice addition to implement more sophisticated Constraints. Someone suggested the DOMAIN of type ENUM as a column-constraint as an LoV. Works like a charm. But the behaviour of some where-clauses may not be what you think. Let me show you some very simple examples, and some workarounds, both bad and Good.

A prism, just because I used colors as an example.


Background

After seeing a presentation on the use of ENUM-Domains for use af List of Values on a column I did some scripting to see how it could work (previous blogs here + here). Now I tried to use it in Where-Conditions. 

So I have a few domains, one of which is color:


And I created a table that uses the domain in a column-definition, like this:

The column is defined as type "color_enum" and thus inherits the "LoV defined in the Domain. As I consider the color-column to be akin to a Foreign-Key, I have already created a index on it. 

When I ask for the INFO of the table, using SQLcl, this is what I get:


The column COLOR apparently got defined as number(38,0) e.g. an Integer, because it can not hold decimals. But the same column also has the DOMAIN specified as COLOR_ENUM. The info-command also shows I have indexes on both the ID an don the Color-column.

Note: I'm using SQLcl (link) here. In SQL*Plus you would need to do a DESC of the table. Latest versions of SQL*Plus will show the numer + domain, but older versions will just show the column as number (38,0).


Do some Inserts

We need some data in the table so let's...

Notice how I can use both the enum.name, which is effectively a set of Constants, or an actual numeric value to insert data. I did not have to visit an LoV table to get (read or join) my values.

Just curious: how does it handle "wrong values": 

There is no "purple" defined in the Domain-Enum and an error pops up. But the description of the error is a little vague in my opinion. It seems to read "color_enum.purple  as a reference to a column. I think the actual error here is more of an "invalid value" or "check constraint violated". But I can see why the error is what it is.

And what if we try to insert a number out of range:

When trying to insert a number outside the valid range, the Error is more Clear. It refers to a "check constraint" and it mentions the DOMAIN. This should give even a developer enough clues to fix it.

And we have Proven that the Domain acts as both Check-Constraint and LoV. It is very much similar to an FK (as tested in previous blog as well).

Next is doing some Retrieval.


Selecting from the table.

The result of our inserts is a table with the following data:

No surprises here. But notice again that the columns "id" and "color" are numbers. and we need a display function to obtain the "name" of the color. 

Now suppose we want to search for records by the name of the color. We need to put the condition against the domain_display():

Notice the Table Access Full. But also notice that the filter, the condition on the display-function. It seems to employ a CASE-When statement to pick up the values from the DOMAIN. 

I suspect that the use of the function to find the value of  like 'RED%' is akin to a type-mismatch (e.g. comparing varchars to numbers), which stops us from using the index on the numeric column color. But a Full Table Access can be a bad thing on large tables.

It works much better if we use constants or plain numbers to look for our records:

If the types of the comparison matche the indexed, numeric, value the query Can use the index. And this is the desired outcome. 

The Warning is Clear: Do No Use that function in a search-criterium. Unless...


Function Based Index?

Because the domain_display () is a function, we could consider adding an Function Based Index. Let's try that:

That works. With the FBI the where clause has an index that machtes the domain_display, and uses that index. Maybe this is what you need. But I am generally hesitant to add more indexes. I would prefer to use matching types in the where-clause: constants, or other numeric conditions.


More Exotic: join the DOMAIN:

For a more exotic idea: you can use the ENUM-domain as a record-source in the from clause, and add a join to the Query. Not that strange, because this is what you would to when you were using an FK-Constraint anyway:

And it works. Apparently I can use the domain as a record-source with no problem. Let's have a look at the explain-plan:

There is a list of "filter" operations on the domain, but  the query does use the intended (numeric) Index. This is how you would to it when the constraint was a classic FK to an LoV table. And this still works as well.


Summary: DOMAIN-ENUM can be used as LoV.

The DOMAIN can act as LoV and FK. They bring both the "check contraint" and provide readable "constants" for use in code, without the need for a separate table to keep the values. In some cases, this will bring you a reduction of IO-Effort and make your queries more efficient (see earlie blogs). And the same limitations apply when searching for the "desciption of the LoV".  

Advantage or not: You cannot mess with a Domain as easily as with a table. A DOMAIN cannot be edited in the same way as an LoV table (which I have been know to edit in a live system...). 

Furthermore, I showed a workaround with an FBI. But I do Not Recommend putting Function-Based-Indexes on domain-columns indiscriminately. Every extra indexes should be given careful Consideration.


Oh, and: RTFM

There is a lot more on DOMAINS than just LoV. Domains can/should (IMHO) be used for Sophisticated Constraints on one or more (groups of) columns. Do Go check the Documentation

Or check this page by Tim for copy-paste usable examples. After all: RTFM = Read Tim's Fantastic (summary of the) Manual (link).

Enjoy.


-- -- -- -- -- -- end of blogpost, appendix below -- -- -- -- -- -- 

Appendix

Script to replay the demos..

demo_enum_5where.sql : just hit enter several times to walk trought the demo.

Happy Coding (Tip: just use SQL wherever you can!)


Tuesday, 17 December 2024

Oracle 23ai, Domains, Enums, use-cases.

 TL;DR: The new DOMAIN type in Oracle23ai can be used as LoV. And in some cases that can save you Effort: if it does not have to visit the FK-tables. I'll show some examples, and I'll demonstrate a few pitfalls to avoid.


Background.

It all started when someone suggested to use DOMAINs of type ENUM as an alternative for FK-tables with List-of-Value lookup-data. For example, if you want your table to have a field "color", limited to just the colors of the rainbow. You can use a check-constraint (simple!), or a table with color-values and an FK-constraint, or in Oracle 23ai, you can use a DOMAIN. We will compare ENUM-domains with t FK-LoV tables. 


Datamodel:  a fact table with two LoVs

For this little demo I will use an old-fashioned datamodel with two lookups tables, and one long fact table. And the queries on that 3-table model will be compared to the same queries against a single table where the LoV are held in two DOMAINS of type ENUM.

First the traditional model looks, it like this:


(I love a primitive diagram...) 

And the Fact table is defined as follows:


All very Classic, Data-table with LoVs  and Foreign Keys. The FK are the "constraints" that ensure correct data. I also put indexes on the FK-fields, but you could ommit those.

Now if you want, you can replace those FKs with DOMAINS of type ENUM and use that for the column-definitions:

I stated by defining two DOMAINS, the color-one looks like:


I created a similar domain with 1400 Roman Numerals (too long to show, but there will be a complete re-play script at the bottom of the blog). 

Now create the table and use those domains:


Note how two of the columns have their type defined by a DOMAIN-ENUM. There is no FK, and you can leave out the LoV tables completely. But the values for the columns are now Constrained by the domain-definition. For good measure, I created indexes on them, as I expect to be "searching" for those at some time. 

We filled both tables with Identical data, and did a generate_statistics on them. Now we are ready do some some comparisons.


Aggregate Queries

Let's try to aggregate some color-records, and have a Query that _needs_ to visit the table (otherwise, the optimiser could sneakily just only use an index):


Access is via Index-full-scan and table-access full. Join is done by hash-join (more efficient than Nested Loop in this case). Looks like expected behavious. I happen to know that the LoV table t_color_lov is 8 blocks, and the data-table is 768 blocks in size. Check the statistics of the query:


Just over 700 gets. More or less as expected as there will be some empty blocks in the table. 

Now try this using the table that Does Not have FKs, but is using DOMAIN. First the explain:


We see 1 Full-Scan of the table. No other tables involved, because the constraints are done by the DOMAIN, not by FKs. 

The query-stats look like:


Maybe somewhat to our surprise, there is little gain here. Both queries are Equally efficient. The 7 records for color-LoV probably fit into 1 block, hence the case of the "enum" is consuming just 1 block less.

[ soap-box: ]

Now, YMMV, because I heard of ppl who got bitten by "too many LoVs, too many joins". And they claim to fix this with either Check-Constraints or simply removing the FK-constraints. But I would Strongly Suggest to examine your Explain-plans first. The optimiser is well-capable to Scan the small tables just-once and use hash- or merge-joins to avoid (nested)Loops. Domains are, IMHO, not a way to solve those bad-queries (but they could, potentially...)

[\steps off soap-box]

I also think Connor McDonald, and others have previously proven, sufficiently, that : Joins, Good Joins, Well Done Joins, ... Are Not Expensive.

And in the case of relatively Large data-tables, and retrieving "a lot of records", the elimination of Joins to small tables by using DOMAINS will only yield very limited benefit (1 block out of 700...). 


Single-record retrieval queries.

Let's compare what happens when we retrieve just 1 data-records. The relative overhead of joining to LoV tables might weigh a little heavier on those "precise" cases. We are retrieving Two records and want to see color and roman (e.g. get those from the LoV or from te domain):


We notice the unique scan to retrieve the two records from data_2lov, a full-scan on the color_lov (7 records, effectively just 1 block) to pick out the 1 record needed, followed by the use of the unique-index on roman_lov to retrieve the roman-numerals, one for each record. The predicate informationd confirms this.

Then we check the statistics, they show 13 block-gets:


By many criteria, 13 block-gets is an Efficient statement. 

Note for SQL-geeks: Yes, There is room for minor improvements. By using a "covering index" on the data_table. Or from the use of  IOTs. Maybe, with some extra indexing, this query could be brought back another 3 block-gets (eliminate 1 step for every table) to 10 gets. I consider that out of scope for now as not a lot of systems seem to go that far in optimising.  But for critical systems I would recommend use of IOTs (Index Organized Tables, docu here) for the LoVs. Should we add a demo of Extreme-Indexing... ? (just for the fun of it...?). Maybe Later (reserve space for link to script).


Now let's retrieve the same info from the data-table using the DOMAINS:

The SQL and the Explain-Plan show only One Table involved this Query. The color and the roman are retrieved from the domain_display ( column ) function. For good measure, we added the same order-by as in the previous query.

The Statistics now show :


Only 6 gets. Because we only need to use the PK to retrieve each record. Probably two index-blocks (index-root and 1 branch) and one table-block (by index rowid) for each of the two numbers in the in-list. Total of 6 gets.

Hence, even though the original effort per query was low (13 gets is an efficient query by most standards), the savings are Significant: We reduced the gets by half. And if you have more of those LoVs per record, those savings can add up. 

Note: Possibly you get even more benefit, notably if your datamodel is missing some indexes. Or if you have stale/wrong/missing statistics whereby your optimiser is "deteriorating your query-plans". In those cases, the elimination of LoV joins will (appear to) help. But in those cases, you have other (datamodel + optimiser) problems!

But in short: the elimination of some joins, especially in complex or high-frequency SQL may Benefit from the use of DOMAINS. QED.


My interpretations (and ymmv...) 

1st: For Large, Aggregate-type queries, there is No big Advantage.

On lage queries, where you scan significant parts of the "data" or fact-table, the additional, single, Full-Scan of the lookup-tables is probably negligible. Small compared to the effort of scanning the data-table. Hence if your work consists of lage, aggregate-type queries, there is little gain in avoiding LoV tables. Provided you have your indexes done well and statistics up-to-date. Also c.f. Soap-Box statement above...

2nd: Possible Gain on fetching small-sets or Single-Records. 

On retrieval of small sets, or individual records, Especially if you have to retrieve many LoV-values, the additional table-access for the LoVs Can be a larger part of your Query-effort.  And here you can save on the additional Hop(s) to the LoV table(s). This can be beneficial If your work consists of a lot of small, OLTP-style SQL-statements where you need to access LoVs Many and Often. Again, providing you have you indexes Done Correctly.

3rd, Note: I have some remaining notes on using domain-expressions in where-clauses as well. Maybe put those in a followup to prevent a "too-long-read". 


Closing Remarks. 

In summary, I think the use-case for "reducing effort on LoV" should be focussed on the cases where the LoV is a relatively large part of the total-effort. That is mostly (if done correctly) the case in retrieving small sets or high-frequency, quick-verifications. 

So there Is in specific situations a Good Case for using DOMAIN-ENUMs.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- 

demo_enum_4select.sql : the complete script, just hit enter 9 times.

demo_enum_4extreme_index.sql : remind me to add this if I get there...


Monday, 16 December 2024

Oracle 23ai - enums, what is behind the scenes ?

 TL;DR: The DOMAINs in Oracle 23ai are an additional way to add constraints to your Datamodel and you should investigate this to make your database "Smarter". When testing the simplest of domains, the ENUM, they appear Quite Efficient. And as I can never resist trying some extreme cases, I did hit some limit as well.


Background 

Domains are introduced as a way to define complex and sophisticated (check-)Constraints, and to (re)use those in multiple places in a database. Enforcement of the constraint is by the RDBMS, hence this keeps your Logic as close to the Data as possible and avoid re-coding things in various application-components. Domains can be Very Useful in this respect. You can start your own investigation here (link).

This blog is about the simplest possible case: a List of Values (LoV) constraint. For a relatively simple LoV we have the choice between using a Domain of type ENUM or an FK-constraint to limit the values for a column. The nature of an FK to a lookup- table is very much "Relational", as you can define the valid options in a separate table and let the RDBMS enforce the constraint. The nature of a DOMAIN is more declarative, and resembles a sophisticated CHECK-Constraint.

Note: A word of Thanks to Chris Saxon, who did the presentation that triggered me to doodle the scripts that ultimately lead to this set of tests. It was Fun to do.

A reference to Asterix, because of the Roman Numerals in this blog...


Testing, a small and a large Enum.

I could not resist so I created two domain-ENUMs and did some SQL on them. Here is what I did (I'll put links to full scripts at the bottom..).

My first Enum, a list of colors:


That was Easy. And now for a really long one: the list of valid Roman Numbers in Oracle...



This shows I can potentially turn a LoV table of 4000 into a "DOMAIN", and maybe gain some performance by avoiding a join to lookup-table ?

And after creations, you can query them as normal tables. Let's see how that goes: The list of values can be SELECTed and sorted at will:



Because the queries on the rather large "roman_enum" were a bit sluggish, and as I am Always curious, I checked on the effort behind the scenes. The autoexplain looked ok: no buffers-gets or disk-io:



This query is causing a lot of round-trips because the result-set of 3999 records has to be pushed to the client, but Still no buffer-gets or disk-io visible. Hence operations on DOMAINS should be buffer- and IO-Efficient. Promising. 

But queries on the large domain did use up a lot of cpu-time, which could be a warning. I ignored the warning at that time, because a full-query on such a large domain is probably a weird case anyway.

Here are the statistics from v$sqlarea from the shared pool, they showed me similar good information:



The Good Part: No logical- or physical-IO. That two of the queries showed buffer-gets was, apparently due to them being the first-time-query to the DOMAINS after the purging of the sh-pool to allow clean measurement of the statistics.

But the fact that the "composition" of the relatively large set for roman-numbers had taken a lot of CPU- and Elapsed-time was "interesting".


Now use DOMAIN on Columns in a Table:

To get some real usage, I needed tables with columns that use the (enum) domains in their type-definition. So I created two very simple tables:



Notice how the 2nd column of each table only has a definition that refers to an enum-domain. There is no other infomation to say that the values in that column are acutally numeric (in my cases). If you RTFM deeper, you will find that the actual enum_values that go in are not limited to numbers. But I have not tested that (yet).

But  let's insert some data and see what it looks like. Fist the Colors...



As you can see, you can insert data using either a number (e.g. the enum_Value) or the "constant" from the DOMAIN (e.g. the enum_Name). And on query, the value is reported as the enum_value (e.g. similar to reporting the key of an FK table). If you want to display the enum_value, you need to use a function: 

Select DOMAIN_DATA ( enum_value_column ) from  Table;

The table-definition will then use the domain defined for that column to "look up" and display the description. The lookup or display does not seem to incur additional IO, it will probably need a little extra CPU ticks, I might test that later.


A Quirk to display the large domain?

Here is what happend is I query the table with the roman_enum, first using select-star:



No surprise (yet). 

You can see I used numbers that could represent years, and because of the domain-definition, each enum_value or "row-number" corresponds to the enum_name with the equivalent roman-number. I know some ppl like to use Roman-Numerals for years, hence my use of this domain.

But now let me try to display the "roman numeral value" using the DOMAIN_DATA function:



Resulting in an Error ?? 

Also, the error was "instant", there was no noticable (cpu-)time consumed before throwing the error. And the error was always like a syntax- or typo-style error. Well, as far as I could see, there were no syntax-errors in my queries... ? (I am notirious for typos, so I did check several times, and re-typed the statement in a few ways, no relief, just errors while trying to retrieve my "roman number"...). 

Almost as if the RDBMS told me I should just revert to TO_CHAR( rn, 'RN'), which is a lot simpler anyway.

I defined a few more, different, smaller, enums, which all Worked Fine... ???

Then did a little searching (in my own database, the datamodel, data-dictionary), and found that the data-dictionary had a view "USER_DOMAINS". It contains a CLOB-column called DATA_DISPLAY, in which there is a CASE statement that seems to compose my enum:



I then got suspicious and checked the size of that column, using dbms_clob.getlength() for my domains:



Hmmm. I went back and cut my roman-numeral-domain in half, and in half again, and then did some testing to find the max-size where display_data still worked... 

And the workaround was more or less as I suspected: Smaller Domain. In my case, if I got the size of the clob below 32K, I could display the data. Here is a small demo, contained in a separate file: demo_enum_size.sql:


Voila. That works. But in this case I seem to be limited to about 1400 values.


Summary: Domains are Useful. but Test !

In practice I would probably never use a DOMAIN that large. And there are other practicalities, like you can not (yet) alter a domain without (probably) causing a window-of-non-constraint, or some outage. This needs more Testing (...). In general for such long lists, I (still) think a real LoV TABLE with FK is more appropriate. And I know an LoV table can be "edited" in a LIVE system, BTDT ;-).

But the fact is that DOMAINS have advantages too. DOMANs are declarative, that domains do not need access to "data" (buffers or disk), and  that domains can implement constraints that are much more sophisticated than old fashioned "check", that makes them potentially Very Interesting. I should test more...

So maybe TBC.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Appendix: Scripts used to compose this blog..

demo_enum.sql : self running, just hit enter some 15 times..

demo_enum_2size.sql: self running, shorter, just to work around the error. 


Edit: Almost fogot:

demo_enum_io.sql : to display statistics from v$sqlarea, show the CPU + IO

Friday, 6 December 2024

Oracle 23ai - unloading of vector data.

 TL;DR: For demo and repeatable-test cases, I like to use scripts to set/reset my test-cases. Vectors can pose a problem here bcse they are not always easily "scriptable". Not Yet anyway. SQLcl can help with the unload commands.



Pop-in an image of "unloading", because I am in the end just a farm-boy who likes tractors and combine-harversters and stuff...


Background: Scripts, Reproducible cases.

Always looking for ways to quickly test or demo something, I am a Big Fan of script-tools like SQLcl and SQL*Plus (and bash, and sometimes python). Good Demo- and Test-cases should be easy to reproduce, on any platform, even 10yrs later. The oldest "test" I got is a few scripts and one bench.dmp file from 2001, created with Oracle v8.1.7, and it is still "usable" (not very relevant anymore, but still usable, that is Backward Compatibility for you...). 

The creation and manipulation of test-data for VECTOR columns sometimes poses a problem. See also earlier blog (link) where I pointed out that the (ancient) exp/imp tools do not handle VECTOR data, but datapump does.

This blog is partly a call for help to Oracle (Hi Jeff!) to help facilitate the simple spool+reload of Vector-Data. I'm fairly sure there are several solutions, but not sure which ones to explore or use in my specific cases.


This Worked: SQLcl unload (as insert-into)

SQL*Plus and earlier versions of SQLcl had given me several errors, seemingly treating the VECTOR as a CLOB or LOB. But hey: Use Latest Version! Right?

Using SQLcl, version 24.2.0 I found the following to work, using the unload :


This generated a runnable sql-file that would correctly re-insert my data, even when the vector itself was (much) larger than 4K. My Tested case used vectors that were still under 32K. Might try bigger ones later.

The secret was, apparently, in the trick to split the vector into smaller chunks, which each occupied a line of just under 512 characters (linesize 512?). Each line was a set of numbers converted TO_CLOB ( ... ), and concatenated with its peers. The resulting CLOB was inserted into the VECTOR column, and this gave the desired result: my vectors were re-inserted. This is a screenshot of the unloaded-file:

Notice the repeated TO_CLOB(...) statements that are concatenated together to form the complete vector (which is apparently the equivalent of a clob...). 

Note: If you look inside the user_indexes and user_segments,  you can see that vector-columns have corresponding ilob and lob indexes: Storage-wise they are apparently treated similar to LOB/BLOB/CLOB columns.

I was relatively happy with this "concatenate-clobs trick" for several reasons:

. The unload generated correct, runnable scripts.

. No intermediate (staging) tables or additional pl/sql (helper-) procedures needed.

. Even SQL*Plus was able to insert this data. I had a Very Portable Demo!

But there were some clouds on the horizon...


Possible Problem with new version SQLcl ?

Then, I ran into an error after downloading sqlcl v24.3. The unload-command had changed slightly: it generated an insert of a single VECTOR. 

Hence, unloading with SQLcl version 24.3.2.0, build ..3301718, then trying to reload resulted in this error:

ORA-01704: string literal too long...

When I opened the unloaded file, the 3rd line in the sql-file was now 30.621 characters long (hm.. just under 32K, that seemed a coincidence - I should try with longer vector some time... ). There were no more TO_CLOB (...) concatenations, and the vector was just a quoted string of values, looking like this:

This simple quoting of the vector will work for small vectors (below 4K or even 32K), but not for seriously large vectors where the string-representation will surpass these values. Also, I am not sure what would happen to binary vectors: how would they be represented as quoted string ? (More stuff to test!.. I know, I know, but Time... Maybe later).


Final notes: Multiple Solutions.

From working with LOBS and CLOBS myself long time ago, I know how Tricky these things can be: They mostly require use of some other language than just SQL(*Plus) to manipulate them. They need to be split into chunks that are manageable by whichever language/system/host is doing the processing. But as a simple SQL user who just wants to run Easy Test- and Demo-scripts, I am hoping the solution will remain Simple.

By documenting this "case" I hope Oracle can either fix this (by reverting back to using multiple TO_CLOB?), or work around this problem in some other way.

One suggestion I got in private-message was to try using JSON: get the VECTOR in some JSON format and use that to unload/load the data, possibly in smaller chunks. And of course you can also revert to additional PL/SQL, Java(script) or Python to manipulate vectors. But all those solutions involve some additional (bespoke) coding on my part. And I just want Simple-Demo cases.

Suggestions Always Welcome ! 


----- End of Blogpost ------ 


Attachements, Addendums (just some files)


Below are links several links to files to test/demo for yourself:

ini_unload.sql:  Set up the test-case, create table and load nine records with VECTOR(2048, FLOAT64), The data is generated from some of my experiments with vectors.

tst_unload.sql:  Demonstrate the working (or error) of unload-as-inserts using SQLcl. This script will run the actual demo and possibly hit the "error" when it runs the script generated from the unload-command.

If you want see the two unloaded files: 

The one with "working" inserts and 

The one that generated the error..


Happy Scripting.