Monday, 7 April 2025

Oracle 23ai and python. some simple code.

TL;DR: Python and Oracle 23ai: A few helpful ideas, programs and code-snippets that made my life with python and Oracle easier. This one is about Connecting and Connection-info.


Background:

Below are some of the other small pieces of code that I split off into separate py-files and that are included into most of my (oracle-db connected) programs.

Maybe they can help you.

As with every language, after a while you develop habits and small tools that (hopefully) make you more productive or more secure in your work. In my case, with the combination of python and Oracle, I have only build very simple programs. And I do nearly all my work just from a terminal on either linux, mac or sometimes even windows. I use vi (and occasionally notepad or an IDE) as editors, and often still use print/f or dbms_output statements for debugging (showing my age eh...).

Last year I built a small python program to print out the contents of vectors stored in Oracle (blog on "victor" here...). And I have a few more "tools" that I built to help in my occasional-program or in my regular dba-ish tasks.

Caveat 1: Of course you will build your own. I know ppl re-use code much better when they first type it themselves. I Know you will do this this differently, and you will do it Much Better. 

Caveat 2: Don't expect high-tech: What I did with python is very simple, most of it is "pushed to the DBA" job. Things that I would call "plumber-stuff": process incoming files, updating some meta-data like LoV lists, produce simple interface-data, or very simple reports-to-mail. The kind of stuff that never seems to go away, no matter how sophisticated the new ecosystems become.


Environment-settings and other variables: python-dotenv.

As most programs have to work in various environments, I needed a way to quickly set or obtain the various variables for LOCAL, DEV, TEST, PRE and PROD. The easiest solution turned out to be a tool called python-dotenv.

If you dont know of this tool yet, read the intro (link), and play with it. 

I am still grateful to Gianni and Jeff and a few others who tipped me about this gizmo. Thx Folks!

In my case, it started by keeping the oracle-connect information plus a few other more system- or env-specific variables that I need to distinguish between various environments. It also coincided nicely with the preference of my cstmr to "keep stuff in config files".

And suddenly I found python-dotenv very tempting for multiple use-cases, and  I started to hoard a lot of "settable variables" into the file. Very Useful.


Open a db-connection, use a function (part 1)

As all my programs relate to an RDBMS, I need all programs to "connect" to the database. The information needed to connect was easily stored in a dotenv-file, but the creation of a connection needed some attention. For several reasons:

- We want credentials to be separate from code. Not hardcoded (and this time we mean it, looking at you Scott and Tiger..). Credentials need to be both safely stored and easily modifiable (some systems still change passwords every 32 days...)

- We want to verify that a connection is to the Correct, Intended database. Preferable to some logfile or stdout stream (check part 2).

The easiest part was to pick-up the credentials from the dotenv-file. To this extend, we created a function that picks up the connect-info (including tiger...) from an env-file. So far so good:

The first attempt looked like this:

Notice several things:

- We print some of the info to stdout, just to verify. Out-comment if you like, but I Want this check. More Later.

- We use easy-connect syntax.

- The function returns the connection-object, the connection does not have to be a global object.

If you want more complicated connects, if you prefer to use tns-aliases, or complete tns-entries, or other types of dsn: that would require a slight re-code. But you can stick to the concept of using dotenv, it is able to use multi-line variables. Hence you have a lot of options. I didnt have a need for more complicated connection-specifications yet, but in most cases, I would store the information in dotenv.


Open a db-connection, and Verify to stdout (part 2)

A very common mistake, and a rather Impactful Mistake, is to connect some infrequently used, adhoc-running metadata-update-tool to the wrong database... Ooops! (of course, in Your case, the passwords between Dev and Prod would be different, but some systems are made by Murphy... )

My previous blogpost on this was "how to set SQL-Prompt"..

https://simpleoracledba.blogspot.com/2023/06/do-you-know-where-you-are.html

And I wanted something similar to my python-programs.

Easy... In python, is it very simple to take any valid SELECT, have it executed, and print the results to stdout:

I inserted this snip of code into the ora_logon function:

 cursor = ora_conn.cursor()

  for row in cursor.execute ( sql_show_conn ):

    print  ( ' ora_login:', row[1] )

And now I can define sql_show_conn with any select I want, notably the ones I use in SQLcl and SQL*Plus to show my connection:

My favorite is:

  # customize this sql to show connection info on logon

  sql_show_conn="""

    select 2 as ordr

       , 'version : ' ||  substr ( banner_full, -12) as txt

    from v$version

  UNION ALL

    select 1

       , 'user    : '|| user || ' @ ' || global_name|| ' ' as txt

    FROM global_name     gn

  UNION ALL

    SELECT 3

       , 'prompt  : ' || user

         || ' @ ' ||db.name

         || ' @ '|| SYS_CONTEXT('USERENV','SERVER_HOST')

         || decode  (SYS_CONTEXT('USERENV','SERVER_HOST')

              , '98b6d46dd637',      ' (xe-dev)'

              , 'ip-nnn-nn-2-109',   ' (PROD)'

              , 'ip-nnn-nn-0-226',   ' (ACC)'

              , '2c4a51017a44',      ' (dckr-23ai)'

              , 'b88d087b53d3',      ' (dckr-23.7)'

              , ' (-envname-)')

         || ' > '

    FROM    v$database      db

  order by 1

  """


This is the equivalent of setting a fancy prompt in SQLcl to indicate "where you are" (link to old blog). 

This SQL, actually a UNION of three queries, roughly translates to : 1) show user and global_name, 2) show the version and 3) use a fancy decode to show which environment you are at, which I often use to list out dev/text/acc/prod and other varieties we use.

Because at a certain cstmer, the pdb-names and global names and service-names are often identical over several environments, we use the container-name or host-name to identify "where we are", and decode that to some "meaningful name" like "PROD". 

And you can make the verification-query as fancy as you want (or just keep it simple). In some cases, it was turned into a local linux-style MOTD, and in some cases into a complete Welcome-Message including some instructions and warnings.

You can add items to this cursor at your hearts content, and the union + order trick allow you to re-arrange the outcomes. Knock yourself Out (some devs really do!)

The result, in my case, comes out like this:


The complete, include-able function, together with some test-code at the bottom of the file, can be found here: link to ora_login.py.


Summary: Feel free to re-use, but...

From thinking- and coding my logon-function, using dotenv and my own logon-function, I got some good benefits. 

But, big Caveat: Tooling is Personal, and YMMV.

Over time I have taken to use a lot of self-built tools and scripts, and some tools+utils I picked up from 't internet or other coders (and mostly modified for my own use).

I did find that it is difficult to get others to re-use tools that I crafted, or any tool from anyone else.

I have a few other snippets of python code that I (re-)use a lot, and I will see how those develop over time (if at all..). So far, if my code or just my ideas does inspire some of you to improve your code: Job done.

My main tip to others on code-reuse is generally: Read, Consider, Try, ... and in case of doubt:  DIY. 

The most (re)used scripts and code will probably be the "built by you" version.

Enjoy !

-- -- -- -- -- end of this blogpost, for now -- -- -- --

Monday, 10 March 2025

The Generic Tips (was: python and Oracle)

TL;DR: Some Very Basic Tips, to write better programs and tools. In my recent case: using python and Oracle. But in the handover to colleagues, I came to realise there are some tips+tricks that need re-iterating every few years. So Here Goes (again). 

My opinion is of course strongly "biased" towards an Oracle Database, but that is not always a bad thing.

The actual blog on my Oracle-python-tricks will have to wait a bit.. but it is in the pipeline (# makefifo pythonstuff ; link to future blog...) 


The things that tend to wash up on the desk of a DBA...



Background: Been There Done That (have T-shirts).

Originally, I set out to describe some of the pyton programs (functions, really) that I (re)used to make my life and that of my cstmr easier. As it turned out, we ended up discussing some "habits" and practices rather than the actual code. So just a few generic tips, which I also used to explain things to cstmrs. Hence I wrote that up before I dipped into my python code.

YMMV and "Never say Never": The tips applied to myself and my cstmr-situation. But I came across enough situations to say they apply in many case. However, Your Milage May Vary. So if you think they do not apply to you, I am fine with that. I'm Not going to go religious on it.

My content here applies to the type of (simple-) jobs that tend to wash-up on the desk of a DBA: Interfaces, ad-hoc reports, occasional maintenance-jobs, loading "meta data". that sort of thing.

Also, do not expect High-Tech. In my case, with the combination of python and Oracle, I have only build relatively simple programs. And I do nearly all my work just from a "terminal" on either linux, mac or sometimes even windows. I use mainly vi, SQL-Developer (and occasionally some IDE or even notepad) as editors, and often still use print/f or dbms_output statements for debugging (showing my age eh...). 



Non-functional "Requirements": try to include in Specification.

Requirements, Specifications, or whatever name you have, often state: "ingest these datasets" or "produce a list of gizmos+items in that format", or "return a set of records with this data..." . And that is expected. 

The un-spoken or so called "Non Functional Requirements" are often along the lines of:

1) Make sure it runs in future infrastructure (if we move to "the other cloud"...)

2) Allow running- and maintenance by our outsourced-vendor-resources.

2a) Include some (unit-, integeration-) test capability. testset...

3) Make sure it runs inside 1 min (One Minute, even if 10G of xml-data....)

4) Produce some output or report to allow control or verification (and keep that)

5) Allow only "authorised" ppl/tools to run, and/or prevent DoS situations.

 etc...

Those requirements are often unspecified but implied when you are asked to do "just this simple program". But it helps to write them down (briefly!) even if nobody wants to read them. You may also need some trace of those Non-Functionals to justify the choices that you make (e.g. why didnt you build this program in Go, in C++, or as a bash-script using just grep and awk? ).

Note: The checklist of 1 to ... etc  above would be Tip-nr-0, if I was so pedantic as to start numbering at Zero...

So here come the "Generic Tips  (after weeding out half)...



Tip nr 1: Write it down, Briefly but Clearly

Why? Well, See above...

But also: the reader of your writings is probably yourself or your replacement-person in 3 years, or even in 20yrs (Yes.. I've seen them). Hence, it helps if you can introduce your program with a few lines of both the "functional" and "non-functional" requirements (what does it need to do!).

By writing (or just summarising) what your program needs to do, you can also "verify" it with the user, the architect or whoever is asking for it to be built.

You may also want to include some "justification", e.g. why you choose PL/SQL over Python or FORTRAN (hehehe).



Tip nr 2: Draw one or more ERDs. For better Understanding!

To understand a program: it rarely helps to read the source-code. But a look at the tables, or the ERD of the underlying data can do miracles for understanding. I tend to create my ERDs from the tables in the datbase and rarely work with other models than the Table-Relationship-Diagram. 

Even if you have very simple data, it helps to have an ERD. Not only will an ERD help you define tables (if you even use tables at all...) and the way those table relate. But a brief documentation of each data-element will also allow you to communicate purpose and usage of the data-elements. 

In creating the ERD you will most likely "discover" a few things about the data you process. I notably walk down the lines (relations) and verify things like .. "This order has zero, one, or more details...", and then ask "Is that Correct...?"

When you have to explain your program to colleagues, or have to ask "more questions" from the users: an ERD is a great tool to communicate, and much more readable than "open this source code" (the user may not have the dev-tools installed...).



Tip 3: SmartDB: Use. Your. Database (e.g. use SQL and PL/SQL)

If the data is moving into or out-of a an RDBMS: Consider using SQL and PL/SQL as much as possible.

Even if your organisation prefers using VMware virtual machines, or your architect prefers the use use of "spot-instances", or k8s, or some serverless FaaS solution... If your data ends up in a database, that database is the Safest,  the Longest-Lasting, and probably the Easiest to Maintain place to "store + run your program".

I recall my managers, in the 90s and 00s demanding things like "use FORTRAN to convert incoming data", or "only C will do this fast enough". In the end most of that code ended up being converted (reduced!) into PL/SQL with good results.

The excuses to not use a database are plenty:

- Database CPU is too expensive (maybe, but the database is often the moste efficient place to process and keep Data)

- PL/SQL requires a DBA... (well, a pl/sql person, acutally, so ..?)

-  This is so simple, you should use grep/awk for this (ok, try that. and then introduce a small change in requirements...or start processing GBs of data...)

I must say: I have broken this rule of #SmartDB many times, often at the expense of myself and my cstmr. Nothing lives as long as the database...

Of course, YMMV.  In some cases FORTRAN, C++ or python, is advised bcse of available tools or libraries (as was sometimes my case). The main Valid reason to do processing outside of the RDBMS is, in my opinion, the use of certain libraries for specific functions (math, geo, or some specific engineering- or scientific purpuse library).

But In General, by solving the problems "inside the database", the code was the most efficient and the easiest to maintain (some links to the praise of #SmartDB...?).



Tip 4: Provide some Test Cases.

This may sound trivial, but it can really help. I'm not specifically advocating TDD (Test Driven Development), or some other "framework" that may require a lot of methodology (and work). But you need some test-cases, if only to prove that the original code you delivered still works.

Ideally, you have a scripts (test_setup.sh and test_run.sh).

And those tests should not "break" or otherwise affect your production data. Hence this needs some thought, and probably needs to be embedded into the processes of the customer. 

All cstmrs have Test-platforms. Some cstmrs even have separate platforms where they run Production systems.



Tip 5: Plan Liberally, every IT tweak takes more time than expected.

The best known book on this is probably "The Mythical Man Month" by F.P. Brooks. One of my early managers made me read it. Recommended!

My rule of thumb is to sometimes multiply the original time-estimate. For simple systems: multiply by 3.14 ( approximately pi, Archimedes...). For more complicated systems: multiply by 9.8 ( approx pi-square or g)

Most ppl can remember pi and g (earth gravity).  Because simply stating "I multiplied by 10" sounds rather un-scientific.



Tip x : Stick with the tool(s), unless...

I also tend to include a warning: stick with the tool (bash, python, sqlcl, pl/sql...). If every little-maintenance job completely re-writes the program (in Java, bcse that is the future standard... ), you end up with a hodgepodge of code. If some future coder Insists on re-writing (in .cpp using chatgpt...): think hard, and possibly find a coder that Does understand the original tools/languages it was written in.

When I was an engineering student we were taught some rules around robustness, and maintainability. Such as how to choose components or parts: 

- Less Components make a system more Robust.

- Less Different Components make a system easier to Maintain.

I never forgot those.. and those simple rules also apply to IT-systems.



Oops... Writing Too Much ? 

This text is getting way to long, nobody reads that much anymore (so I went back and weeded out half of it).

I could add a several more rules. Notably around the fact that the Oracle RDBMS now includes several neat features that make your life easier, while still staying at the #SmartDB level e.g. "With the Database". There are now options to process JSON, to build REST-services (recommended), the use of APEX for simple window-on-data (recommended).  And you can now include ONNX models (new in Oracle23ai!). But all those are above + beyond the scope of what I wanted to write.

So there.



Summary: 

Write Some Specs, for yourself and your colleagues.

Plan carefully (and maybe read The Mythical Man Month...)

Make na ERD, it helps the understanding and facilitates discussions

Use #SmartDB whenever possible, it is more efficient.

Provide some Test-cases.

Limit your nr of tools/languages, prevent code-chaos.


That is about it.. Next instalment should be : python Tips + python Code ...

 

Friday, 20 December 2024

Oracle 23ai, Domains, Enums and Where-Clauses.

TL;DR: In Oracle23ai, the DOMAIN type is a nice addition to implement more sophisticated Constraints. Someone suggested the DOMAIN of type ENUM as a column-constraint as an LoV. Works like a charm. But the behaviour of some where-clauses may not be what you think. Let me show you some very simple examples, and some workarounds, both bad and Good.

A prism, just because I used colors as an example.


Background

After seeing a presentation on the use of ENUM-Domains for use af List of Values on a column I did some scripting to see how it could work (previous blogs here + here). Now I tried to use it in Where-Conditions. 

So I have a few domains, one of which is color:


And I created a table that uses the domain in a column-definition, like this:

The column is defined as type "color_enum" and thus inherits the "LoV defined in the Domain. As I consider the color-column to be akin to a Foreign-Key, I have already created a index on it. 

When I ask for the INFO of the table, using SQLcl, this is what I get:


The column COLOR apparently got defined as number(38,0) e.g. an Integer, because it can not hold decimals. But the same column also has the DOMAIN specified as COLOR_ENUM. The info-command also shows I have indexes on both the ID an don the Color-column.

Note: I'm using SQLcl (link) here. In SQL*Plus you would need to do a DESC of the table. Latest versions of SQL*Plus will show the numer + domain, but older versions will just show the column as number (38,0).


Do some Inserts

We need some data in the table so let's...

Notice how I can use both the enum.name, which is effectively a set of Constants, or an actual numeric value to insert data. I did not have to visit an LoV table to get (read or join) my values.

Just curious: how does it handle "wrong values": 

There is no "purple" defined in the Domain-Enum and an error pops up. But the description of the error is a little vague in my opinion. It seems to read "color_enum.purple  as a reference to a column. I think the actual error here is more of an "invalid value" or "check constraint violated". But I can see why the error is what it is.

And what if we try to insert a number out of range:

When trying to insert a number outside the valid range, the Error is more Clear. It refers to a "check constraint" and it mentions the DOMAIN. This should give even a developer enough clues to fix it.

And we have Proven that the Domain acts as both Check-Constraint and LoV. It is very much similar to an FK (as tested in previous blog as well).

Next is doing some Retrieval.


Selecting from the table.

The result of our inserts is a table with the following data:

No surprises here. But notice again that the columns "id" and "color" are numbers. and we need a display function to obtain the "name" of the color. 

Now suppose we want to search for records by the name of the color. We need to put the condition against the domain_display():

Notice the Table Access Full. But also notice that the filter, the condition on the display-function. It seems to employ a CASE-When statement to pick up the values from the DOMAIN. 

I suspect that the use of the function to find the value of  like 'RED%' is akin to a type-mismatch (e.g. comparing varchars to numbers), which stops us from using the index on the numeric column color. But a Full Table Access can be a bad thing on large tables.

It works much better if we use constants or plain numbers to look for our records:

If the types of the comparison matche the indexed, numeric, value the query Can use the index. And this is the desired outcome. 

The Warning is Clear: Do No Use that function in a search-criterium. Unless...


Function Based Index?

Because the domain_display () is a function, we could consider adding an Function Based Index. Let's try that:

That works. With the FBI the where clause has an index that machtes the domain_display, and uses that index. Maybe this is what you need. But I am generally hesitant to add more indexes. I would prefer to use matching types in the where-clause: constants, or other numeric conditions.


More Exotic: join the DOMAIN:

For a more exotic idea: you can use the ENUM-domain as a record-source in the from clause, and add a join to the Query. Not that strange, because this is what you would to when you were using an FK-Constraint anyway:

And it works. Apparently I can use the domain as a record-source with no problem. Let's have a look at the explain-plan:

There is a list of "filter" operations on the domain, but  the query does use the intended (numeric) Index. This is how you would to it when the constraint was a classic FK to an LoV table. And this still works as well.


Summary: DOMAIN-ENUM can be used as LoV.

The DOMAIN can act as LoV and FK. They bring both the "check contraint" and provide readable "constants" for use in code, without the need for a separate table to keep the values. In some cases, this will bring you a reduction of IO-Effort and make your queries more efficient (see earlie blogs). And the same limitations apply when searching for the "desciption of the LoV".  

Advantage or not: You cannot mess with a Domain as easily as with a table. A DOMAIN cannot be edited in the same way as an LoV table (which I have been know to edit in a live system...). 

Furthermore, I showed a workaround with an FBI. But I do Not Recommend putting Function-Based-Indexes on domain-columns indiscriminately. Every extra indexes should be given careful Consideration.


Oh, and: RTFM

There is a lot more on DOMAINS than just LoV. Domains can/should (IMHO) be used for Sophisticated Constraints on one or more (groups of) columns. Do Go check the Documentation

Or check this page by Tim for copy-paste usable examples. After all: RTFM = Read Tim's Fantastic (summary of the) Manual (link).

Enjoy.


-- -- -- -- -- -- end of blogpost, appendix below -- -- -- -- -- -- 

Appendix

Script to replay the demos..

demo_enum_5where.sql : just hit enter several times to walk trought the demo.

Happy Coding (Tip: just use SQL wherever you can!)


Tuesday, 17 December 2024

Oracle 23ai, Domains, Enums, use-cases.

 TL;DR: The new DOMAIN type in Oracle23ai can be used as LoV. And in some cases that can save you Effort: if it does not have to visit the FK-tables. I'll show some examples, and I'll demonstrate a few pitfalls to avoid.


Background.

It all started when someone suggested to use DOMAINs of type ENUM as an alternative for FK-tables with List-of-Value lookup-data. For example, if you want your table to have a field "color", limited to just the colors of the rainbow. You can use a check-constraint (simple!), or a table with color-values and an FK-constraint, or in Oracle 23ai, you can use a DOMAIN. We will compare ENUM-domains with t FK-LoV tables. 


Datamodel:  a fact table with two LoVs

For this little demo I will use an old-fashioned datamodel with two lookups tables, and one long fact table. And the queries on that 3-table model will be compared to the same queries against a single table where the LoV are held in two DOMAINS of type ENUM.

First the traditional model looks, it like this:


(I love a primitive diagram...) 

And the Fact table is defined as follows:


All very Classic, Data-table with LoVs  and Foreign Keys. The FK are the "constraints" that ensure correct data. I also put indexes on the FK-fields, but you could ommit those.

Now if you want, you can replace those FKs with DOMAINS of type ENUM and use that for the column-definitions:

I stated by defining two DOMAINS, the color-one looks like:


I created a similar domain with 1400 Roman Numerals (too long to show, but there will be a complete re-play script at the bottom of the blog). 

Now create the table and use those domains:


Note how two of the columns have their type defined by a DOMAIN-ENUM. There is no FK, and you can leave out the LoV tables completely. But the values for the columns are now Constrained by the domain-definition. For good measure, I created indexes on them, as I expect to be "searching" for those at some time. 

We filled both tables with Identical data, and did a generate_statistics on them. Now we are ready do some some comparisons.


Aggregate Queries

Let's try to aggregate some color-records, and have a Query that _needs_ to visit the table (otherwise, the optimiser could sneakily just only use an index):


Access is via Index-full-scan and table-access full. Join is done by hash-join (more efficient than Nested Loop in this case). Looks like expected behavious. I happen to know that the LoV table t_color_lov is 8 blocks, and the data-table is 768 blocks in size. Check the statistics of the query:


Just over 700 gets. More or less as expected as there will be some empty blocks in the table. 

Now try this using the table that Does Not have FKs, but is using DOMAIN. First the explain:


We see 1 Full-Scan of the table. No other tables involved, because the constraints are done by the DOMAIN, not by FKs. 

The query-stats look like:


Maybe somewhat to our surprise, there is little gain here. Both queries are Equally efficient. The 7 records for color-LoV probably fit into 1 block, hence the case of the "enum" is consuming just 1 block less.

[ soap-box: ]

Now, YMMV, because I heard of ppl who got bitten by "too many LoVs, too many joins". And they claim to fix this with either Check-Constraints or simply removing the FK-constraints. But I would Strongly Suggest to examine your Explain-plans first. The optimiser is well-capable to Scan the small tables just-once and use hash- or merge-joins to avoid (nested)Loops. Domains are, IMHO, not a way to solve those bad-queries (but they could, potentially...)

[\steps off soap-box]

I also think Connor McDonald, and others have previously proven, sufficiently, that : Joins, Good Joins, Well Done Joins, ... Are Not Expensive.

And in the case of relatively Large data-tables, and retrieving "a lot of records", the elimination of Joins to small tables by using DOMAINS will only yield very limited benefit (1 block out of 700...). 


Single-record retrieval queries.

Let's compare what happens when we retrieve just 1 data-records. The relative overhead of joining to LoV tables might weigh a little heavier on those "precise" cases. We are retrieving Two records and want to see color and roman (e.g. get those from the LoV or from te domain):


We notice the unique scan to retrieve the two records from data_2lov, a full-scan on the color_lov (7 records, effectively just 1 block) to pick out the 1 record needed, followed by the use of the unique-index on roman_lov to retrieve the roman-numerals, one for each record. The predicate informationd confirms this.

Then we check the statistics, they show 13 block-gets:


By many criteria, 13 block-gets is an Efficient statement. 

Note for SQL-geeks: Yes, There is room for minor improvements. By using a "covering index" on the data_table. Or from the use of  IOTs. Maybe, with some extra indexing, this query could be brought back another 3 block-gets (eliminate 1 step for every table) to 10 gets. I consider that out of scope for now as not a lot of systems seem to go that far in optimising.  But for critical systems I would recommend use of IOTs (Index Organized Tables, docu here) for the LoVs. Should we add a demo of Extreme-Indexing... ? (just for the fun of it...?). Maybe Later (reserve space for link to script).


Now let's retrieve the same info from the data-table using the DOMAINS:

The SQL and the Explain-Plan show only One Table involved this Query. The color and the roman are retrieved from the domain_display ( column ) function. For good measure, we added the same order-by as in the previous query.

The Statistics now show :


Only 6 gets. Because we only need to use the PK to retrieve each record. Probably two index-blocks (index-root and 1 branch) and one table-block (by index rowid) for each of the two numbers in the in-list. Total of 6 gets.

Hence, even though the original effort per query was low (13 gets is an efficient query by most standards), the savings are Significant: We reduced the gets by half. And if you have more of those LoVs per record, those savings can add up. 

Note: Possibly you get even more benefit, notably if your datamodel is missing some indexes. Or if you have stale/wrong/missing statistics whereby your optimiser is "deteriorating your query-plans". In those cases, the elimination of LoV joins will (appear to) help. But in those cases, you have other (datamodel + optimiser) problems!

But in short: the elimination of some joins, especially in complex or high-frequency SQL may Benefit from the use of DOMAINS. QED.


My interpretations (and ymmv...) 

1st: For Large, Aggregate-type queries, there is No big Advantage.

On lage queries, where you scan significant parts of the "data" or fact-table, the additional, single, Full-Scan of the lookup-tables is probably negligible. Small compared to the effort of scanning the data-table. Hence if your work consists of lage, aggregate-type queries, there is little gain in avoiding LoV tables. Provided you have your indexes done well and statistics up-to-date. Also c.f. Soap-Box statement above...

2nd: Possible Gain on fetching small-sets or Single-Records. 

On retrieval of small sets, or individual records, Especially if you have to retrieve many LoV-values, the additional table-access for the LoVs Can be a larger part of your Query-effort.  And here you can save on the additional Hop(s) to the LoV table(s). This can be beneficial If your work consists of a lot of small, OLTP-style SQL-statements where you need to access LoVs Many and Often. Again, providing you have you indexes Done Correctly.

3rd, Note: I have some remaining notes on using domain-expressions in where-clauses as well. Maybe put those in a followup to prevent a "too-long-read". 


Closing Remarks. 

In summary, I think the use-case for "reducing effort on LoV" should be focussed on the cases where the LoV is a relatively large part of the total-effort. That is mostly (if done correctly) the case in retrieving small sets or high-frequency, quick-verifications. 

So there Is in specific situations a Good Case for using DOMAIN-ENUMs.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- 

demo_enum_4select.sql : the complete script, just hit enter 9 times.

demo_enum_4extreme_index.sql : remind me to add this if I get there...


Monday, 16 December 2024

Oracle 23ai - enums, what is behind the scenes ?

 TL;DR: The DOMAINs in Oracle 23ai are an additional way to add constraints to your Datamodel and you should investigate this to make your database "Smarter". When testing the simplest of domains, the ENUM, they appear Quite Efficient. And as I can never resist trying some extreme cases, I did hit some limit as well.


Background 

Domains are introduced as a way to define complex and sophisticated (check-)Constraints, and to (re)use those in multiple places in a database. Enforcement of the constraint is by the RDBMS, hence this keeps your Logic as close to the Data as possible and avoid re-coding things in various application-components. Domains can be Very Useful in this respect. You can start your own investigation here (link).

This blog is about the simplest possible case: a List of Values (LoV) constraint. For a relatively simple LoV we have the choice between using a Domain of type ENUM or an FK-constraint to limit the values for a column. The nature of an FK to a lookup- table is very much "Relational", as you can define the valid options in a separate table and let the RDBMS enforce the constraint. The nature of a DOMAIN is more declarative, and resembles a sophisticated CHECK-Constraint.

Note: A word of Thanks to Chris Saxon, who did the presentation that triggered me to doodle the scripts that ultimately lead to this set of tests. It was Fun to do.

A reference to Asterix, because of the Roman Numerals in this blog...


Testing, a small and a large Enum.

I could not resist so I created two domain-ENUMs and did some SQL on them. Here is what I did (I'll put links to full scripts at the bottom..).

My first Enum, a list of colors:


That was Easy. And now for a really long one: the list of valid Roman Numbers in Oracle...



This shows I can potentially turn a LoV table of 4000 into a "DOMAIN", and maybe gain some performance by avoiding a join to lookup-table ?

And after creations, you can query them as normal tables. Let's see how that goes: The list of values can be SELECTed and sorted at will:



Because the queries on the rather large "roman_enum" were a bit sluggish, and as I am Always curious, I checked on the effort behind the scenes. The autoexplain looked ok: no buffers-gets or disk-io:



This query is causing a lot of round-trips because the result-set of 3999 records has to be pushed to the client, but Still no buffer-gets or disk-io visible. Hence operations on DOMAINS should be buffer- and IO-Efficient. Promising. 

But queries on the large domain did use up a lot of cpu-time, which could be a warning. I ignored the warning at that time, because a full-query on such a large domain is probably a weird case anyway.

Here are the statistics from v$sqlarea from the shared pool, they showed me similar good information:



The Good Part: No logical- or physical-IO. That two of the queries showed buffer-gets was, apparently due to them being the first-time-query to the DOMAINS after the purging of the sh-pool to allow clean measurement of the statistics.

But the fact that the "composition" of the relatively large set for roman-numbers had taken a lot of CPU- and Elapsed-time was "interesting".


Now use DOMAIN on Columns in a Table:

To get some real usage, I needed tables with columns that use the (enum) domains in their type-definition. So I created two very simple tables:



Notice how the 2nd column of each table only has a definition that refers to an enum-domain. There is no other infomation to say that the values in that column are acutally numeric (in my cases). If you RTFM deeper, you will find that the actual enum_values that go in are not limited to numbers. But I have not tested that (yet).

But  let's insert some data and see what it looks like. Fist the Colors...



As you can see, you can insert data using either a number (e.g. the enum_Value) or the "constant" from the DOMAIN (e.g. the enum_Name). And on query, the value is reported as the enum_value (e.g. similar to reporting the key of an FK table). If you want to display the enum_value, you need to use a function: 

Select DOMAIN_DATA ( enum_value_column ) from  Table;

The table-definition will then use the domain defined for that column to "look up" and display the description. The lookup or display does not seem to incur additional IO, it will probably need a little extra CPU ticks, I might test that later.


A Quirk to display the large domain?

Here is what happend is I query the table with the roman_enum, first using select-star:



No surprise (yet). 

You can see I used numbers that could represent years, and because of the domain-definition, each enum_value or "row-number" corresponds to the enum_name with the equivalent roman-number. I know some ppl like to use Roman-Numerals for years, hence my use of this domain.

But now let me try to display the "roman numeral value" using the DOMAIN_DATA function:



Resulting in an Error ?? 

Also, the error was "instant", there was no noticable (cpu-)time consumed before throwing the error. And the error was always like a syntax- or typo-style error. Well, as far as I could see, there were no syntax-errors in my queries... ? (I am notirious for typos, so I did check several times, and re-typed the statement in a few ways, no relief, just errors while trying to retrieve my "roman number"...). 

Almost as if the RDBMS told me I should just revert to TO_CHAR( rn, 'RN'), which is a lot simpler anyway.

I defined a few more, different, smaller, enums, which all Worked Fine... ???

Then did a little searching (in my own database, the datamodel, data-dictionary), and found that the data-dictionary had a view "USER_DOMAINS". It contains a CLOB-column called DATA_DISPLAY, in which there is a CASE statement that seems to compose my enum:



I then got suspicious and checked the size of that column, using dbms_clob.getlength() for my domains:



Hmmm. I went back and cut my roman-numeral-domain in half, and in half again, and then did some testing to find the max-size where display_data still worked... 

And the workaround was more or less as I suspected: Smaller Domain. In my case, if I got the size of the clob below 32K, I could display the data. Here is a small demo, contained in a separate file: demo_enum_size.sql:


Voila. That works. But in this case I seem to be limited to about 1400 values.


Summary: Domains are Useful. but Test !

In practice I would probably never use a DOMAIN that large. And there are other practicalities, like you can not (yet) alter a domain without (probably) causing a window-of-non-constraint, or some outage. This needs more Testing (...). In general for such long lists, I (still) think a real LoV TABLE with FK is more appropriate. And I know an LoV table can be "edited" in a LIVE system, BTDT ;-).

But the fact is that DOMAINS have advantages too. DOMANs are declarative, that domains do not need access to "data" (buffers or disk), and  that domains can implement constraints that are much more sophisticated than old fashioned "check", that makes them potentially Very Interesting. I should test more...

So maybe TBC.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Appendix: Scripts used to compose this blog..

demo_enum.sql : self running, just hit enter some 15 times..

demo_enum_2size.sql: self running, shorter, just to work around the error. 


Edit: Almost fogot:

demo_enum_io.sql : to display statistics from v$sqlarea, show the CPU + IO

Friday, 6 December 2024

Oracle 23ai - unloading of vector data.

 TL;DR: For demo and repeatable-test cases, I like to use scripts to set/reset my test-cases. Vectors can pose a problem here bcse they are not always easily "scriptable". Not Yet anyway. SQLcl can help with the unload commands.



Pop-in an image of "unloading", because I am in the end just a farm-boy who likes tractors and combine-harversters and stuff...


Background: Scripts, Reproducible cases.

Always looking for ways to quickly test or demo something, I am a Big Fan of script-tools like SQLcl and SQL*Plus (and bash, and sometimes python). Good Demo- and Test-cases should be easy to reproduce, on any platform, even 10yrs later. The oldest "test" I got is a few scripts and one bench.dmp file from 2001, created with Oracle v8.1.7, and it is still "usable" (not very relevant anymore, but still usable, that is Backward Compatibility for you...). 

The creation and manipulation of test-data for VECTOR columns sometimes poses a problem. See also earlier blog (link) where I pointed out that the (ancient) exp/imp tools do not handle VECTOR data, but datapump does.

This blog is partly a call for help to Oracle (Hi Jeff!) to help facilitate the simple spool+reload of Vector-Data. I'm fairly sure there are several solutions, but not sure which ones to explore or use in my specific cases.


This Worked: SQLcl unload (as insert-into)

SQL*Plus and earlier versions of SQLcl had given me several errors, seemingly treating the VECTOR as a CLOB or LOB. But hey: Use Latest Version! Right?

Using SQLcl, version 24.2.0 I found the following to work, using the unload :


This generated a runnable sql-file that would correctly re-insert my data, even when the vector itself was (much) larger than 4K. My Tested case used vectors that were still under 32K. Might try bigger ones later.

The secret was, apparently, in the trick to split the vector into smaller chunks, which each occupied a line of just under 512 characters (linesize 512?). Each line was a set of numbers converted TO_CLOB ( ... ), and concatenated with its peers. The resulting CLOB was inserted into the VECTOR column, and this gave the desired result: my vectors were re-inserted. This is a screenshot of the unloaded-file:

Notice the repeated TO_CLOB(...) statements that are concatenated together to form the complete vector (which is apparently the equivalent of a clob...). 

Note: If you look inside the user_indexes and user_segments,  you can see that vector-columns have corresponding ilob and lob indexes: Storage-wise they are apparently treated similar to LOB/BLOB/CLOB columns.

I was relatively happy with this "concatenate-clobs trick" for several reasons:

. The unload generated correct, runnable scripts.

. No intermediate (staging) tables or additional pl/sql (helper-) procedures needed.

. Even SQL*Plus was able to insert this data. I had a Very Portable Demo!

But there were some clouds on the horizon...


Possible Problem with new version SQLcl ?

Then, I ran into an error after downloading sqlcl v24.3. The unload-command had changed slightly: it generated an insert of a single VECTOR. 

Hence, unloading with SQLcl version 24.3.2.0, build ..3301718, then trying to reload resulted in this error:

ORA-01704: string literal too long...

When I opened the unloaded file, the 3rd line in the sql-file was now 30.621 characters long (hm.. just under 32K, that seemed a coincidence - I should try with longer vector some time... ). There were no more TO_CLOB (...) concatenations, and the vector was just a quoted string of values, looking like this:

This simple quoting of the vector will work for small vectors (below 4K or even 32K), but not for seriously large vectors where the string-representation will surpass these values. Also, I am not sure what would happen to binary vectors: how would they be represented as quoted string ? (More stuff to test!.. I know, I know, but Time... Maybe later).


Final notes: Multiple Solutions.

From working with LOBS and CLOBS myself long time ago, I know how Tricky these things can be: They mostly require use of some other language than just SQL(*Plus) to manipulate them. They need to be split into chunks that are manageable by whichever language/system/host is doing the processing. But as a simple SQL user who just wants to run Easy Test- and Demo-scripts, I am hoping the solution will remain Simple.

By documenting this "case" I hope Oracle can either fix this (by reverting back to using multiple TO_CLOB?), or work around this problem in some other way.

One suggestion I got in private-message was to try using JSON: get the VECTOR in some JSON format and use that to unload/load the data, possibly in smaller chunks. And of course you can also revert to additional PL/SQL, Java(script) or Python to manipulate vectors. But all those solutions involve some additional (bespoke) coding on my part. And I just want Simple-Demo cases.

Suggestions Always Welcome ! 


----- End of Blogpost ------ 


Attachements, Addendums (just some files)


Below are links several links to files to test/demo for yourself:

ini_unload.sql:  Set up the test-case, create table and load nine records with VECTOR(2048, FLOAT64), The data is generated from some of my experiments with vectors.

tst_unload.sql:  Demonstrate the working (or error) of unload-as-inserts using SQLcl. This script will run the actual demo and possibly hit the "error" when it runs the script generated from the unload-command.

If you want see the two unloaded files: 

The one with "working" inserts and 

The one that generated the error..


Happy Scripting.