Unstructured Text

Sunday, November 12, 2023

On SQLAlchemy

I've been using SQLAlchemy and reading about it for a few months now, and I don't get it. I don't mean I don't get SQLAlchemy, I've learned how to use it. But I don't get why people use it, and are actually enthousiastic about it.

I'm just writing this article because I can't seem to find any articles on the internet that are critical towards it, and I feel there should be one.

To explain why I don't like SQLAlchemy I will just go through all the advantages and disadvantages that are mentioned and respond to them.

Let's abbreviate SQLAlchemy to Alchemy and explain what Alchemy should do. It is an ORM (Object-Relation-Mapper, so it converts database records to Python objects, and the other way around), and it has a query generation language that allows you to write queries using method chaining. Alchemy changed it's syntax in version 2.

Is there an alternative to Alchemy? Yes you can use psycopg to perform plain SQL queries and manually write simple functions that create objects from database records.

I read SQLAlchemy 2 In Practice which is a good book, but the author doesn't even ask the question why we should use Alchemy, and implies that it is the only way to work with database. That got me thinking.

Grinberg actually wrote the book because he created a tutorial on Flask and found that most questions were asked about Alchemy:

Many of the questions are from developers that need to know how to do something a little more complex than what I present in the article, but are unable to figure it out how to do it on their own.

Advantages

Easy to set up

If you know what to do, creating a few simple models to set up the ORM, is indeed easy to do, and creating ORM functionality manually also takes some time. So setting up the models and generating database tables is easy enough. It gets more tricky when you need to set up a many-to-many table, especially when it has extra columns.

Alchemy gives abstraction to the backend database. So, an average developer does not have to worry about SQL statements.

This is not only not true, it's deceptive. You need to know SQL, and you need to know the the syntax of Alchemy as well. If you know how to do something in SQL and want to do it in Alchemy, you have to find out how to do it, and its syntax has many quirks and idiosyncrasies.

The transition to other databases becomes easier.

This is true, if you have used SQLite and want to switch to Postgres, and you haven't used any SQLite specific constructs, you can switch without changing any code. But this is not a big advantage. Most of the syntax is standard SQL anyway. And how often does it occur that you change one database for another? Finally, changing one database for another and expecting this has no side effects is naive. In fact it may bring quite substantial problems with it.

Readability and Maintainability: SQLAlchemy code tends to be more readable and easier to maintain than raw SQL queries. It leverages Python classes and objects to represent database tables and rows, making your code intuitive and self-explanatory.

It is true that adding a join to a query requires less code. Except from that, plain SQL is just as readable and just as easy to maintain.

Security: SQLAlchemy includes built-in protection against SQL injection. It automatically escapes user input, effectively thwarting malicious SQL code execution.

Every modern database library includes protection against SQL injection.

Queries are optimized and may work better than SQL if you wrote it yourself unless you are an SQL veteran.

This is also deceptive. Alchemy is slower by default, simply because it forms a wrapper around SQL. Furthermore, since Alchemy has a different conceptual model than SQL, it is not clear what SQL queries are actually performed. Only by learning the ins and outs of Alchemy, and tweaking many of the model declarations, will you be able to ensure that Alchemy performs efficient queries. Of course there will be some edge cases in which Alchemy queries are actually more efficient than plain SQL queries, but only the ones written by someone with little experience in SQL.

On StackOverflow, there are, at the time of writing, 16,000 questions with SQLAlchemy tag. Therefore, you should not be concerned about unsolvable bugs when using this tool.

Quite a relief!

Disadvantages

There could be instances where Alchemy might become inefficient. Therefore, knowing SQL is always desired.

When performance matters you will need to learn Alchemy from A to Z. Alchemy still allows you to use plain SQL queries, so this may be a way to escape.

Performance overhead: SQLAlchemy introduces a slight performance overhead compared to raw SQL queries. While this overhead is generally inconsequential for most applications, it may become a concern for high-performance scenarios.

True

Learning Curve: SQLAlchemy has a learning curve, particularly for developers new to ORMs. It takes some time to become proficient in SQLAlchemy’s API and concepts.

Believe me, it takes a lot of time. Alchemy has many fixes for problems it created itself. It is a completely different mental model you need to acquire and you need to switch between this model and the plain SQL model.

Additional disadvantages

These disadvantages come from me.

Alchemy diffuses domain model code with storage code.

It's good practice to separate domain model code with database storage code. It makes you aware of database queries and allows you to focus on them. Alchemy combines the two. It may create a database request when you just get an attribute. This is not a problem for a simple operation. But when you operate on thousands of objects, you will need to know exactly what Alchemy does internally, and make tweaks to the model to ensure that it performs the right kind of queries.

Asynchronous access

When developing asynchronous code for Alchemy, you need to make changes in a number of places to prevent Alchemy from making implicit database calls. Reading about that in Grinberg's book positively scared me.

Additional advantages

Alembic

To be fair, Alchemy comes with a tool that is actually quite nice, and which is not often mentioned in the advantages I read. It's Alembic, the database migration tool that generates queries based on the changes to the model you make. This allows you to keep your model up to date and have Alembic take care of migrations.

Conclusion

As this is a critical article, I recommend against using SQLAlchemy. Writing plain SQL queries is just easier, and gives you more control. Creating simple ORM functionality and writing migration scripts doesn't need to be complicated either.

References

https://www.analyticsvidhya.com/blog/2022/07/a-brief-introduction-to-sql-alchemy/

https://python.plainenglish.io/sqlalchemy-vs-raw-sql-queries-in-python-a-comparative-example-b7a838ebef82

https://trantriducs.medium.com/orm-and-sqlalchemy-the-magic-wand-in-database-management-a2d02877a57a

Monday, November 6, 2023

Some remarks on "Puzzels of Reference"

Intro

I'm reading "Puzzels of Reference" by Herman Cappelen & Josh Dever, and I have some remarks about it.

Chapter 2: A Puzzle about Belief, Reference, and Agency

The chapter is about two names that refer to the same entity, but at the same time seem to refer to two different entities.

Examples:

Marilyn Monroe and Norma Mortenson
Clark Kent and Superman

Why is it that you can't just replace "Norma Mortenson" for "Marilyn Monroe" in the sentence

Jill just discovered that Marilyn Monroe is Norma Mortenson

It surprises me that the book doesn't even name the distinction between sense and reference that was created by Frege. They must know about it. Frege gives as example the Morning Star and Evening Star that both refer to the planet Venus.

So the answer to the puzzle is that Marilyn Monroe and Norma Mortenson are two names of senses, that have the same referent. The sense of a name is the thought it expresses.

The authors come closest to this option in their "believing under a guise", but it's much less clear than the sense solution.

To stay with Kripke's Causal-Communicative Model, you might say that if there are two senses to a name, that there are not one, but two causal chains from name to referent.

Postscript 1:

About the sentence: "Clark Kent is Superman", what does it mean? If it would mean that they are identical, you should be able to replace one name by the next. Also it proves hard to implement such identity in a system. I came up with this idea: "Clark Kent is Superman" should be compared to sentences as "Clark Kent is male". Element belongs to set. To be precise: the entity named "Clark Kent" belongs to the set of entities named "Superman".

Postscript 2:

"Superman" may also be the name of a thief pretending to be him.

Superman has been seen taking gold and gems from a jewelry store.

What would be the meaning of "Superman" in this sentence? The sense of "Superman" here is the real Superman (the thought it expresses is that it is really Superman), but the referent is the thief. The causal chain takes a twist after the sense.

Postscript 3:

Elon Musk changed Twitter into X.

Are Twitter and X two names for the same entity? Are they still the same thing? If not, what about the Morning Star and the Evening Star? Don't they have different characteristics? What does it mean they are the same, really? Is it the physical aspect? What about Superman and Clark Kent, are they fictionally physically the same? What about the caterpillar and the butterfly? Are they the same? Is it the individual? What about the tree trunk that's turned into a boat?

Old Man Oak was chopped down and turned into an oak. Now he spent his time transporting small Indians to the other shore.

Old Man Oak was chopped down and turned into millions of tiny matches. Now he spent his time lighting candles and starting furnace fires.

Isn't identity just a man-made concept that is just as fluent as every other?

To say that sense S1 and S2 are the same, means that they have the same referent. Is that important? For a computer system? Only if there is a distinction between the sense layer and the referent layer. Most systems just have one layer. Is this the sense layer or the referent layer? Is the name of a customer in a database the sense of the person or the referent? It's usually taken to represent the referent. But we could also agree that the database is about senses. It depends on the application. However to represent the sentence "Clark Kent is Superman" in a database, we must not only represent Clark Kent and Superman as entities in the database, but also a third, more abstract entity $9822811. Only then we can say:

name($797483, "Clark Kent")
name($188294, "Superman")
referent($797483, $9822811)

referent($188294, $9822811)

and this would then represent the sentence

Clark Kent is Superman

Postscript 4:

I now think a name has a meaning, a subjective referent and an collective referent. And that none of these are required.

The meaning is defined by a logical expression. So this is different from the sense of Frege, which is just a thought, and thus individual to a person.

The referent is the entity the name refers to. One can hold that the referent can't be fictional, but to me this is just a matter of choice, and we could just as well agree that fictional entities can be referents.

A subjective referent is the entity that an individual means by a name. A collective referent is what's meant by the name in general. Usually they are the same, of course.

Some examples:

No meaning, only referent: Proper names like John McCarthy
No referent, only meaning: Jack the Ripper (before we knew who it was; well there always was the subjective referent of the murderer himself)
Both meaning and referent: The Morning Star (meaning: seen in the eastern sky before or at sunrise), Superman (meaning: the guy with the cape and the letter S on his chest), Octavianus (meaning: before 27 BC)
No meaning, no referent: Jabberwocky

Octavianus was later called Augustus, so when we name the man we must take into account the time of the event. Same for Cassius Clay / Muhammad Ali.

Postscript 5:

There's also a solution of this problem possible that I would call the cognitive variant. It would say that a referent is just a theoretical construct that has no basis in the human mind and no relevance to ordinary communication.

To a human, a name belongs to an idea. John McCarthy is an idea, and so is Superman and Jabberwocky. It's a mental entity with a name and possibly associated with perceptions and knowledge.

Leaving referents to scientists and logicians, we continue with just ideas. Can we solve all our problems with them?

Clark Kent is Superman: this means: there exists a two-way relationship between CK and S that inherits information from the other idea. Starting out with two ideas, Clark Kent and Superman, we learn that they are the same thing. At that point we form a mental connection between the two ideas, the connection is-a. This connection is an active connection. Now, whenever we need information about Clark Kent, and we don't find it, we also make a deduction: since Superman is-a Clark Kent, we can also take this information from him. It also works the other way. It may be called two-way prototypical inheritance.

Superman has been seen taking gold and gems from a jewelry store: this is not problematic at all: the idea of Superman in the minds of the newspaper readers is just extended with an extra, be it shocking, event. There is no referent in the mind, so there's no problem either.

After 27BC, Octavius was called Augustus: with the new name, Augustus, we also create a new entity. Everything we knew about Octavius is copied to the new entity. And we create the relation is. We may be able to do a shallow copy, which means that we don't actually copy anything, but when something changes in one entity, it won't change in the other entity.

So, is a name a causal chain (Kripke) or a description (Frege)? Well, there is a chain of ideas, but it doesn't end in the referent. It ends with the first person that had the idea. And there are also descriptions of ideas, but they don't describe the idea uniquely, and there may be many descriptions for the same idea.

An idea may be linked to a perception. And we form an idea based on our perception or on our thoughts. And we give this idea a name, to make it easier to talk about. This is the origin of the name, but it's one step away of the real referent, which is the object or person we perceived. But referents are problematic. Does a rainbow have a referent? It has no objective existence. Does that matter? Only when you're interested in objective existence.

Postscript 6

It's not idea, it's construct: https://en.wikipedia.org/wiki/Construct_(philosophy)

People use constructs to think. Constructs may be real, which means that they have a substrate in physical reality, or fictional.

The relationship between a construct and reality is an informational relation, no other connection exists. This relation is neither physical, nor imaginary, but exists nevertheless.

A construct can be an entity or a statement or relation about multiple entities.

A name is an attribute of a construct.

A construct refers to an entity. I think we must let go of the idea that a referent must be real. The entity may be real or fictional. A real entity has a physical substrate. A fictional entity has a mental substrate (it exists in the imagination of people).

The truth of a real construct should be found in reality. Only reality determines the truth value of a real construct.

The truth of a fictional construct is decided by the author of the entity. Only J.K. Rowling may decide what happens next to Harry Potter.

The identification of constructs (i.e. "Clark Kent is Superman") doesn't change the relationship between the constructs and the object in reality/imagination, because this relationship is purely informational, and it already existed. Instead, both constructs are extended with an equals relationship. This relationship can be used in deductions about the constructs from that moment on ("Since Clark Kent is Superman, Clark Kent was in London when Superman met his nemesis").

Chapter 3: A Puzzle about Referring to What Isn't There

The chapter is about names like "Sherlock Holmes". Sherlock Holmes doesn't exist, and so how can you refer to him?

Well, existence (reality) is contrasted with fiction. So you can't refer to the real Sherlock Holmes, but you can still refer to the fictional Sherlock Holmes. And anything you say about Sherlock Holmes can have a truth value in the fictional world:

Sherlock Holmes lived at 221B Baker Street (true)
Sherlock Holmes doesn't exist (true)
Sherlock Holmes is a detective (true)
Sherlock Holmes was first introduced in A Study in Scarlet (true)
Sherlock Holmes was a fictional character who was the basis for the television character Gregory House (I have no idea, but whatever)

In "Sherlock Holmes is a detective" the word "is" is a copula, that has no relation to existence.

In "Sherlock Holmes lived at 221B Baker Street", living seems to presuppose existence, but it is simply a very creative verb. "the living planet", "his ideas are still very much alive", "the life cycle of software components".

The book gives us 3 "Realist Views" (that claim that these names refer). Of these "Creationism" comes close to what I think is the case: "N-names refer to fictional characters". But then the book says "Because Sherlock Holmes is an abstract entity created by human activity, Sherlock Holmes is not a detective. Detectives are concrete, not abstract." Which sounds logical, but as abstract is simply another word for fictional in this paragraph, it flatly denies exactly what I am trying to propose, without reason. Then follows the odd claim that "creationists" don't believe in their own stance: "Creationists take typical in-fiction claims to be false."

The puzzle can simply be solved by allowing references to be about fictional entities. A thought that's simply unacceptable to the authors, it seems.

The reason for this probably goes back to Frege, who made reference about existing entities, because he aimed at scientific truths and reasoning. However, in the common use of language, references can very well be about fictional entities.

Sunday, September 10, 2023

Combining the logical and imperative paradigm in a programming language

For my natural language execution engine I created a logical programming language to define the meaning of natural language phrases, and to specify the details of the execution. The language was a simplified version of Prolog, but I added expressions, conditionals, loops and mutable variables later, to make it easier to use. It got easier to use, but it also became more unpredictable to the application programmer (which is just me for the moment).

The problem was that the logical constructs could create many bindings and one would just not expect that from an imperative language. A simple for loop would add multiple variables to the application and each goal of the body would create not just one set of bindings, but one for every iteration of the loop. The number of bindings could easily become unwieldy and extra care had to be taken to suppress them. This could be done by adding an extra set of mutable variables. It was possible, but very uncomfortable.

After rethinking the language I decided to split the logical part and the imperative part and gave them each a separate language construct.

Logical

A logical programming language has facts and inference rules, like this:

male(peter).
male(john).
female(mary).
parent(peter, john).
parent(mary, john).
father(X, Y) :- male(X), parent(X, Y).
mother(X, Y) :- female(X), parent(X, Y).

When a goal like father(X, john) is executed, X yields multiple variable bindings, one for each occurrence of male(X) in the system. If no males can be found, the inference rule stops, and the goal fails.

The essence of this paradigm is thus that each combination of variables in some scope, can have zero, one, or multiple bindings (values). A condition that can fail thus serves as a conditional for the rest of the goals. A condition that can have multiple bindings serves as a for loop for the rest of the goals.

My current syntax of the fact and inference rule are:

too_old(A) :- [ birth(A, Birth) A := age(Birth) A > 40 ]

The conditions can be only literals (birth(A, Birth)), assignments, and boolean expressions. It has no loops or conditionals. The variables are immutable and can hold multiple bindings.

This construct is great for definitions, and to interact with the database.

Imperative

A imperative programming language has functions (procedures) like this:

function hypothenuse(Width, Height) {
    WidthSquared = Width * Width
    HeightSquared = Height * Height
    Hypo = sqrt(WidthSquared + HeightSquared)
    return Hypo
}

Importantly, this paradigm yields only one binding for a function call. For loops and conditional much be named explicitly.

My current syntax of the function is:

hypothenuse(Width, Height) => Hypo {
    WidthSquared := Width * Width
    HeightSquared := Height * Height
    Hypo := go:sqrt(WidthSquared + HeightSquared)
}

Note that the returned value is specified by its variable, not its type, and that no explicit return statement is needed.

The body of the function can contain assignments, loops, conditional. It has no literals. All variables are mutable.

This construct is great for calculations.

Combining logical and imperative

To use logical literals in a function, use the explicit for-loop. This way it is clear that there can be multiple bindings, and any new variables have only the scope of the for loop.

for [size(E, S) E > 5] {
    /* statements */
}

Sunday, July 17, 2022

The necessity of consciousness

Why do living organisms experience consciousness? I hypothesize that living beings acquired consciousness because it was the only way they could learn about what is good and what is bad for their concerns.

Organisms have concerns like staying alive, stay well-nourished, stay healthy, reproduce, etc.

Consciousness as I use it here is the mental experience of information: seeing, smelling, touching, feeling. Not the physical aspect of neurons firing, but the mental aspect of experiencing.

Learning always requires feedback. This feedback can come from an external source, a teacher, or from yourself, as you set a goal to mimic someone or something else.

Experience

But how do you learn that sprouts taste awful? Or that being yelled at hurts? Or that a painting by Picasso is strange. Of course someone may tell you about it, and you may copy that opinion. But what if copying someone else's opinion is not good enough. And that it's better to find out for yourself.

In these cases you learn by trying. You learn from experience.

Can this learning be done by mental processes alone? Is the firing of neurons, the strengthening of the axons, or a bit higher up: the creation of a neural representation, is that enough to learn about the taste of pepper, the smell of poop, or the bewilderment caused by an Escher painting?

It isn't. What's missing is the judgement. Of good and bad, of awful and sweet. Of interest and boredom. This judgement is not in the representation. And it is not induced from its representation. The judgement is in the experience of the representation.

Smell a rotten egg. Certain chemicals enter the nose. The nose fires certain neurons. Certain areas in the cortex are activated. The amygdala is involved. Neural patterns are formed. So what? Why would instinctively cause you to retract? It doesn't. Not unless experience is involved. And no matter the kind of representation, it is no substitute for experience.

This experience is important. It connects your sensory input to your inner concerns. It teaches you what is good for you and what is bad for you.

Dualism

But if experience is different from a neuronal representation, what is it? I have to admit that my idea of this is not far from the view that Descartes described, 400 years ago. Dualism, formed from a neuronal mind / body on the one hand, and an experiencing psyche on the other hand. Experience (consciousness) is a different substance than matter.

For this hypothesis to work, the neural mind must influence the conscious substance. The idea that comes to mind is the way electricity influences magnetism, and vice versa: two different substances that turn out to be part of the same continuum. In this case matter and consciousness would both be part of a matter-consciousness universe.

This diagram attempts to show how mind and consciousness interact. Mind creates a representation. Certain representations cause a change in consciousness. In order for this experience to be functional, it needs to be noticed by the mind again. There needs to be perception of the conscious experience.

Necessity

But why is this necessary? Can't experience be emulated? It can be faked, for sure. But it can't be emulated, because experience is what it's all about. Experience drives existence. You would not buy a painting unless it was beautiful. You would not eat all that potato chips unless it tasted so damn good. Experience is the messenger of the deep concerns that makes each of us a living being.

%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22neural%26lt%3Bbr%26gt%3Brepresentation%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23d5e8d4%3BstrokeColor%3D%2382b366%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22260%22%20y%3D%22160%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%223%22%20value%3D%22conscious%26lt%3Bbr%26gt%3Bexperience%22%20style%3D%22ellipse%3Bshape%3Dcloud%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23f8cecc%3BstrokeColor%3D%23b85450%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22260%22%20y%3D%22250%22%20width%3D%22120%22%20height%3D%2280%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%224%22%20value%3D%22perception%20of%26lt%3Bbr%26gt%3Bconscious%20experience%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23d5e8d4%3BstrokeColor%3D%2382b366%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22260%22%20y%3D%22360%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%225%22%20value%3D%22%22%20style%3D%22endArrow%3Dclassic%3Bhtml%3D1%3Brounded%3D0%3BstrokeWidth%3D3%3BfillColor%3D%23dae8fc%3BstrokeColor%3D%236c8ebf%3B%22%20edge%3D%221%22%20source%3D%222%22%20target%3D%223%22%20parent%3D%221%22%3E%3CmxGeometry%20width%3D%2250%22%20height%3D%2250%22%20relative%3D%221%22%20as%3D%22geometry%22%3E%3CmxPoint%20x%3D%22120%22%20y%3D%22290%22%20as%3D%22sourcePoint%22%2F%3E%3CmxPoint%20x%3D%22170%22%20y%3D%22240%22%20as%3D%22targetPoint%22%2F%3E%3C%2FmxGeometry%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%226%22%20value%3D%22%22%20style%3D%22endArrow%3Dclassic%3Bhtml%3D1%3Brounded%3D0%3BstrokeWidth%3D3%3BfillColor%3D%23dae8fc%3BstrokeColor%3D%236c8ebf%3B%22%20edge%3D%221%22%20source%3D%223%22%20target%3D%224%22%20parent%3D%221%22%3E%3CmxGeometry%20width%3D%2250%22%20height%3D%2250%22%20relative%3D%221%22%20as%3D%22geometry%22%3E%3CmxPoint%20x%3D%22560%22%20y%3D%22360%22%20as%3D%22sourcePoint%22%2F%3E%3CmxPoint%20x%3D%22330%22%20y%3D%22260%22%20as%3D%22targetPoint%22%2F%3E%3C%2FmxGeometry%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%227%22%20value%3D%22concerns%22%20style%3D%22shape%3Dcylinder3%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BboundedLbl%3D1%3BbackgroundOutline%3D1%3Bsize%3D15%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22470%22%20y%3D%22350%22%20width%3D%22110%22%20height%3D%2280%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%228%22%20value%3D%22%22%20style%3D%22endArrow%3Dclassic%3Bhtml%3D1%3Brounded%3D0%3BstrokeWidth%3D3%3BfillColor%3D%23dae8fc%3BstrokeColor%3D%236c8ebf%3B%22%20edge%3D%221%22%20source%3D%224%22%20target%3D%227%22%20parent%3D%221%22%3E%3CmxGeometry%20width%3D%2250%22%20height%3D%2250%22%20relative%3D%221%22%20as%3D%22geometry%22%3E%3CmxPoint%20x%3D%22330%22%20y%3D%22340%22%20as%3D%22sourcePoint%22%2F%3E%3CmxPoint%20x%3D%22330%22%20y%3D%22370%22%20as%3D%22targetPoint%22%2F%3E%3C%2FmxGeometry%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E

Saturday, January 15, 2022

Processing SHRDLU demo sentence "Why?" with NLI-GO

For several years I have been working on a natural language understanding program called NLI-GO. It allows the user to interact with a database through natural language. Since I needed example sentences to extend its functionality, and SHRDLU is a great inspiration, I thought it would be a good idea to recreate SHRDLU's famous sample dialog in NLI-GO. And it is a good idea. Each sentence creates new challenges and puzzles that need to be solved, and this allows NLI-GO to grow.

I am still working on it. The dialog has 44 interactions, and the last few months I have been working on interaction 25. It goes like this:

Person: Why?
Computer: TO GET RID OF IT

As usual, this interaction proved more complicated than I had envisioned at first. And now that I finished it I would like to explain some of the techniques I used.

Some dialog context

The question "why?" follows a series of interactions. Of these interactions the ones that matter are:

Person: Will you please stack up both of the red blocks 
  and either a green cube or a pyramid?
Computer: OK.

This is followed by the sequence

Person: Had you touched any pyramid before you put the 
  green one on the little cube?
Computer: YES, THE GREEN ONE

Person: When did you pick it up?
Computer: WHILE I WAS STACKING UP THE RED CUBE, A LARGE RED BLOCK
  AND A LARGE GREEN CUBE

Person: Why?
Computer: TO GET RID OF IT

So the person asks the computer when it picked up the green pyramid, and then why it picked up that pyramid. SHRDLU remembers that it was stacking up some blocks and needed to get rid of this pyramid in order to move the block that was below it.

SHRDLU's actions

This is a lot to take in. These are not your average QA questions one would ask a database in an NLI system. So this is a good time to go into the action structure of SHRDLU.

When the user tells SHRDLU to perform some command ("build a stack", "pick up a block"), the actions needed to perform this command are executed. While they are executing, SHRDLU also stores the fact that such and such an action was performed in its database.

This is what the start of an action "pick up" looks like in NLI-GO:

pick_up(EventId, Subject, Object)
start_time(EventId, Start);
end_time(EventId, End);
parent_event(EventId, ParentEventId)

Note that there's no need for an event-based database, if such database would exist. A normal relational database can keep track of these events, but they need to be inserted explicitly by the application.

I am currently using an internal database (a data structure) to store these actions, but they could be stored in a relational database as well. `pick_up`, `start_time`,`end_time`, and `parent_event` would then be the names of tables.

Most commands are not simple. They consist of a hierarchy of actions. If an action "pick up" needs to move aside some object, it starts a "get rid of" action. This action will receive a parent pointer to its origination action. This is stored as the `parent_event`.

Because these actions are stored, SHRDLU has a memory of them. Because each action has a reference to the action that started it, SHRDLU can tell why it performed an action, simply by following the `parent_event` link.

SHRDLU has a simple discrete sense of time. Each time a basic action is performed, the internal time is updated by 1.

Ellipsis

To understand the question "Why?", an understanding system needs to find out what is meant here, exactly, as if asking: "Why what?" Clearly part of the question is left out. this is called ellipsis.

NLI-GO treats this problem by declaring the ellipsis explicitly in the grammar. Here is the grammar rule that deals with it:

{ rule: interrogative_clause(P1) -> 'why',
  ellipsis: [prev_sentence]//mem_vp(P1),
  sense: go:intent(why, P1) $mem_vp }

The rule " interrogative_clause(P1) -> 'why' " rewrites the clause to the single word "why". The value of `ellipsis`, [prev_sentence]//mem_vp(P1) is a path that leads to the missing part of the sentence.

In trying to match this rule, NLI-GO follows the path, starting by the current `interrogative_clause` node of the active syntax tree:

[prev_sentence] tells it to visit the previous sentence in the dialog); this is "When did you pick it up". NLI-GO is not at the root of this sentence
//mem_vp tells it to visit all mem_vp nodes anywhere below the current node

The syntax is somewhat similar to that of XPath (used to navigate XML).

It such an `mem_vp` is found, this node will be copied into the active sentence. The syntax tree of "Why?" is now extended and the complete sentence now looks like this:

Why did you pick it up?

Resolving "it"

Something that may seem self-evident, but which isn't really, is that "it" in the new sentence must refer to the same green pyramid as where "it" in the previous sentence referred to.

NLI-GO should not try to resolve this "it" anew; it should simply inherit the value of "it" that had been resolved in the "When" question. If it wouldn't, "it" might resolve to another object, and this is contra-intuitive.

(And yes, this is of course exactly what happened at an earlier stage of the software; "it" first referred to SHRDLU (which is not even an "it", but NLI-GO had no idea).

The response: "it" again

The response to the question should be

To get rid of it

However, the response that NLI-GO initially gave was this:

To get rid of the green pyramid

It proved not so simple to get NLI-GO to just use a pronoun.

Centering theory

Interaction 25 is the first interaction that uses a pronoun in its response. Why would SHRDLU suddenly use a pronoun in its response? To the reader this is obvious. The object in the sentence had been referred to as "it" before, so it is natural to keep referring to it as that. The fact that an object may be in the spotlight of a dialog has been worked out in centering theory. If an object was the preferred center of the previous sentence, it must the preferred center in the current sentence again. Also: a sentence' subject is more likely to become a new center than an object, and this in turn is more likely than any other entity.

Concluding remarks

This seemingly innocuous sentence has kept me busy for four months. I had not seen it coming. It has both been a source of joy for learning new concepts but also a continuous source of frustration. I needed to rewrite existing structures to enable these new features and this broke several existing interactions. Still, the new framework is not at all robust. It needs a lot of work to make it simple to use. At the same time, I still feel this is a great field to work in, and it will eventually provide a level of determinism that machine learning will never reach. So it's worth it.

Thursday, May 13, 2021

Designing Software with the Needs of Human Beings in Mind

Some developers seem to think that writing tests for a software application is the best and only thing needed to write good software. It isn't. There's a lot more to it than that.

When you design or architect a piece of software, you need to be aware of the needs of all people that are somehow involved in it.

By needs, I mean the human needs or concerns that form the purpose of the software's existence. To write good software you need to be aware of these.

Needs

A need is something that is essential to the well-being of a person. I use the term need in the sense of Maslow's hierarchy of needs. People have a variety of needs and they expect some of these needs to be fulfilled by the software application.

These are some of the human needs that are related to software:

Basic needs
Safety
Esteem / Respect
Belonging
Creativity

There several types of people involved in your software:

end-users: the people that actually use your software
developers: you, your fellow developers, others
business: managers, CEO's, business oriented people

These people have many different needs. Most of which you don't know about. But in some way they are reflected in different software aspects.

Aspects

Software has a number of needs-related characteristics. Let's call them aspects. These characteristics can be reduced to one or more basic human needs.

Functionality

This aspect simply says that the software should do what it's supposed to do. Have lots of features and no bugs.

The functions of a software application helps a person to fulfill all kinds of needs: from buying food to spiritual growth. This is part of why it's so interesting to be a developer. Your work can be used in almost all fields of life.

Lack of functionality brings disappointment to users. They will try competitive products, if possible. Good software helps people to reach their goals and fulfill their needs.

Needs: all types of needs

Elements: features, usability, documentation, short time-to-market, bug-free

People: end-users, business

Speed

Processing speed is a pervasive aspect of software. Faster is almost always better. Why? Because whatever you want from a piece of software, your need will be fulfilled earlier if the software is faster.

Slow software leads to frustration, an emotion that expresses that needs are not fulfilled fast enough. Only fast software can bring people in a state of flow.

Speed is tricky, because at the start of a project, the software you create is fast enough. It just becomes slower and slower as you add more features, or add more data. So you need to think ahead. Create a design with the working software in mind. Picture your customers using it, and estimate the amount of data they will be processing. Plan for scalability.

Needs: all types of needs

Elements: hardware, architecture, scalability, optimization

People: end-users, developers, business

Maintainability

Maintainability is about the time it takes for someone to change the software. The time it takes to learn it; the ease of making changes.

If your software is maintainable, you and other developers are able to be creative while developing it further. Maintaining software builds a community of developers and this leads to a sense of belonging. Working on well-maintainable software lifts your self-esteem and the esteem of the other developers.

Needs: esteem, belonging, creativity
Elements: elegance, simplicity, architecture, design patterns, documentation, tests, code conventions, logging

People: developers, business

Security

Security is both a software aspect and a basic human need. People need to be safe, and feel that their software is safe to use.

Needs: safety
Elements: privacy, risk analysis, input checking
People: end-users, business

Low Cost

And then there is cost: software development costs money, of course. I call this "low cost", because the lower the cost, the better.

You can also think of cost in a broader sense: the negative effects of the software to society, and to the environment.

The cost of software is not simply reducible to a human need.

Needs: any, indirectly
Elements: timely delivery, cost of development, cost of maintenance, environmental cost
People: end-users, business

About these aspects

These aspects are not cast in stone. I just made them up ;) I just picked some to waken your awareness to these things. You may find another subdivision more useful.

The aspects are orthogonal: they don't overlap, and the one cannot be reduced to the other.

But even though the aspects are orthogonal, they are not independent. Every new feature may reduce the speed. All aspects need maintenance. Everything that must be made costs money.

Note the "people" parts of the aspects: if "developers" are not listed among them, you will need to place yourself in the position of other people to understand their needs. Security, for example, is not important to the developer. The developer will need to consider what the end-user and the business will consider important when it comes to security.

In architecture design these aspects are usually called "stakes" and the people that are involved, stakeholders. The difference this blog tries to make is to emphasize that these stakes are based on human needs, and that makes it a bit more personal.

Balance

You will find your self thinking: yes, I understand. These are all worthy causes, but it takes way too much time to build software that fulfills all these aspects!

You are right. And this is exactly the point! First you need to be aware of all these aspects. Then you need to weigh all of them. Then you need to come to the right balance: which aspects are more important and which are less important, in the current project? You can just keep this in mind; or you might put it in writing if you intend to set a standard.

But you need to consider these aspects early on in the software trajectory. They are architectural decisions. Changing your stance on these issues may be very hard and costly.

Balancing aspects is the most important role of a software designer or architect.

Finally

You may have noticed that writing tests is just a small aspect of software design. It helps to write maintainable code, but it is not the whole picture. Some projects benefit more from focusing on other aspects. And in some projects automated tests are just a waste of time. It all depends.

Monday, February 1, 2021

How do I write my own parser? (for JSON)

This blog post first appeared on http://techblog.procurios.nl/k/n618/news/view/14605/14863/how-do-i-write-my-own-parser-(for-json).html

05 December 2008

If no parser is available for the file you need, writing one yourself may be easier than you think. What file-structures are managable? What would be the design of such a parser? How do you make sure it is complete? Here we describe the process for building a JSON parser in C#, and issue the source code.

By Patrick van Bergen

[Download the JSON parser / generator for C#]

The software is subject to the MIT license: you are free to use it in any way you like, but it must keep its license.

For our synchronisation-module (which we use to synchronize data between diverse business applications) we chose JSON for data exchange. JSON is just a little better suited for a PHP web-environment than XML, because:

The PHP functions json_encode() and json_decode() allow you to convert data structures from and to JSON strings
JSON can be sent directly to the browser in an Ajax request
It takes up less space than XML, which is important in server > browser traffic.
A JSON string can be composed of only ASCII characters, while still being able to express all UNICODE characters, thus avoiding all possible conversion issues a transport may carry.

So JSON is very convenient for PHP. But of course we wanted to be able to synchronize with Windows applications as well, and because C# is better suited to this environment, this part of the module was written in this language. The .Net framework just didn't have its own JSON parser / encoder and the open-source software written for this task often contained a whole package of classes and constraints and sometimes the JSON implementation wasn't even complete.

We just wanted a single class that could be imported and that used the most basic building blocks of our application: the ArrayList and the Hashtable. Also, all aspects of JSON should would have to be implemented, there should a JSON generator, and of course it should be fast.

More reasons to write our own parser weren't necessary. Writing a parser happens to be a very thing satisfying to do. It is the best way to learn a new programming language thoroughly. Especially if you're using unit-testing to guarantee the parser / generator matches the language specification exactly. JSON's specification is easy to find. The website http://www.json.org/ is as clear as one could wish for.

You start by writing the unit-tests. You should really write all test before starting the implementation, but such patience is seldomly found in a programmer. You can at least start by writing some obvious tests that help you to create a consistent API. This is an example of a simple object-test:

string json;
Hashtable o;
bool success = true;

json = "{\"name\":123,\"name2\":-456e8}";
o = (Hashtable)JSON.JsonDecode(json);
success = success && ((double)o["name"] == 123);
success = success && ((double)o["name2"] == -456e8);

Eventually you should write all tests needed to check all aspects of the language, because your users (other programmers) will assume that the parser just works.

OK. Parsers. Parsers are associated with specialized software: so called compiler compilers (of which Yacc is the most well known). Using this software will make sure that the parser will be fast, but it does not do all the work for you. What's more, it can be even easier to write the entire parser yourself than to do all the preparatoy work for the cc.

The compiler compiler is needed for languages with a high level of ambiguity. A language expression is parsed from-left-to-right. If a language contains many structures that cannot be identified at the start of te parse, it is advisable to use a tool that is able to manage the emerging complexity.

Unambiguous languages are better suitable for building the parser manually, using recursive functions to process the recursive nature of the language. The parser looks ahead one or more tokens to identify the next construct. For JSON it is even sufficient to look ahead a single token. This classifies it as an LL(1) language (see also http://en.wikipedia.org/wiki/LL_parser).

A parser takes as input a string of tokens. Tokens are the most elementary building blocks of a language, like "+", "{", "[", but also complete numbers like "-1.345e5" and strings like "'The scottish highlander looked around.'". The parse-phase is usually preceded by a tokenization phase. In our JSON parser this step is integrated in the parser, because to determine the next token, in almost all cases, it is enough to just read the next character in the string. This saves the allocation of a token table in memory.

The parser takes a string as input and returns a C# datastructure, consisting of ArrayLists, Hashtables, a number of scalar value types and null. The string is processed from left-to-right. An index (pointer) keeps track of the current position in the string at any moment. At each level of the parse process the parser performs these steps:

Look ahead 1 token to determine the type of the next construct
Choose the function to parse the construct
Call this function and integrate the returned value in the construct that is currently built.

A nice example is the recursive function "ParseObject" that parses an object:

protected Hashtable ParseObject(char[] json, ref int index)
{
  Hashtable table = new Hashtable();
  int token;

  // {
  NextToken(json, ref index);

  bool done = false;
  while (!done) {
    token = LookAhead(json, index);
    if (token == JSON.TOKEN_NONE) {
      return null;
    } else if (token == JSON.TOKEN_COMMA) {
      NextToken(json, ref index);
    } else if (token == JSON.TOKEN_CURLY_CLOSE) {
      NextToken(json, ref index);
      return table;
    } else {

      // name
      string name = ParseString(json, ref index);
      if (name == null) {
        return null;
      }

      // :
      token = NextToken(json, ref index);
      if (token != JSON.TOKEN_COLON) {
        return null;
      }

      // value
      bool success = true;
      object value = ParseValue(json, ref index, ref success);
      if (!success) {
        return null;
      }

      table[name] = value;
  }
}

return table;
}

The function is only called if a look ahead has determined that a construct starts with an opening curly brace. So this token may be skipped. Next, the string is parsed just as long as the closing brace is not found, or the end of the string is found (a syntax error, but one that needs to be caught). Between the braces there are a number of "'name': value" pairs, separated by comma's. This algorithm is can be found literally in the function, which makes it very insightful and thus easy to debug. The function builds an ArrayList and returns it to the calling function. The parser mainly consists of these types of functions.

If you create your own parser, you will always need to take into account that the incoming string may be grammatically incorrect. Users expect the parser to be able to tell on which line the error occurred. Our parser only remembers the index, but it also contains an extra function that returns the immediate context of the position of the error, comparable to the error messages that MySQL generates.

If you want to know more about parsers, it is good to know there consists a een standard work on this subject, that recently (2006) saw its second version:

Compilers: principles, techniques, and tools, Aho, A.V., Sethi, R. and Ullman ,J.D. (1986)

Semantic web marvels in a relational database - part II: Comparing alternatives

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34441/14863/Semantic-web-marvels-in-a-relational-database---part-II-Comparing-alternatives.html

15 June 2009

In this article I will compare the basic technical details of current relational database alternatives.

By Patrick van Bergen

In the first article I explained the relational database mapping of our semantic web implementation. In this article I will place this work into perspective by exploring related techniques.

The last few years developers are looking for ways to overcome certain shortcomings of relational database systems. RDBMSes are general purpose data stores that are flexible enough to store any type of data. However, these are several cases in which the relational model proves inefficient:

An object has many attributes (100+), many of which are optional. It would be wasting space to store all these attributes in separate columns.
Many attributes with multiple values. Since each of these attributes needs a separate table, the object data will be distributed over many tables. This is inefficient in terms of development time, maintenance, as well as query time.
Class inheritance. Since most software is Object Oriented these days the objects in code will need to be mapped to the database structure. In the case of class inheritance, where attributes are inherited from superclasses, it is a big problem to store objects in, and query them from, an RDBMS efficiently.
Types and attributes are not objects. In an RDBMS the data of a model is separate from the metadata (attribute names, datatypes, foreign key constraints, etc.). Types and attributes are not like normal objects. This is inefficient in areas where types and attributes need to be added, changed and removed regularly, just like any other data. It is inefficient to write separate code to manipulate and query types and attributes. In short, first order predicate logic no longer suffices for many new applications. The second order is needed.
Scalability. Is an aspect often named as the reason to leave RDBMS. However, since relational databases have been optimized for decades, they do scale. Nevertheless, in this age of global, real-time webapplications, techniques provided by RDBMS manufacturers may prove to be inadequate, or simply too expensive.

In the following I will provide a simple understanding of the basic principles of alternative database techniques, along with some pointers to more in-depth information. I hope you will forgive me my non-expert view on these subjects. For detailed information on any given subject, look elsewhere. This article is meant to be just a quick overview, aimed to waken some concepts provided by the examples.

RDBMS, or Row-Oriented database

In a relational database management system, pieces of data are grouped together in a record. In this article I will consider the case where the data stored is meant to represent the attributes of an object. Seen this way, a record is a group of attributes of an object. Here's an example of such a table of objects:

object id	color	width	height	name
3	red	100	100	my box
4	green	50	500	old beam

Metadata is shown in gray. Keys / foreign keys are shown in bold typeface.

Need more attributes? Add more columns. Need an attribute with multiple values? Add a table and link it to the first. The RDBMS chooses speed over flexibility. Speed was a big deal 40 years ago, when this database type was designed. And it still is a big deal today. For large amounts of simple data, there is absolutely no need to leave this model.

Semantic net

Storing semantic information as triples is an old idea in the field of Knowledge Representation. As early as 1956, semantic nets were used for this purpose. In this technique the relations between objects are represented by plain labels. Each "record" stores only a single attribute, or one element of an array-attribute. Most notable are the absense of metadata and the fact that object data is distributed over many records.

object id	predicate	value
3	color	red
3	width	100
3	height	100
3	name	my box
4	color	green
4	width	50
4	height	500
4	name	old beam

Need more attributes? No need to change the table structure. Need an attribute with multiple values? Same thing.

Entity-Attribute-Value

The Entity-Attribute-Value model of knowledge representation uses some form of triples, just like the semantic web. Its primary use is described by Wikipedia as "Entity-Attribute-Value model (EAV), also known as object-attribute-value model and open schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix."

Attribute metadata is stored in separate attribute tables, which are not triples. EAV is a sort of middle between semantic nets and semantic web: attributes have explicit properties, but these are fixed in amount.

EAV can be used to model classes and relationships as in EAV/CR.

EAV is used in Cloud computing databases like Amazon's SimpleDB and Google's App Engine.

object id	attribute id	value
3	1	red
3	2	100
3	3	100
3	4	my box
4	1	green
4	2	50
4	3	500
4	4	old beam

attribute id	name	datatype	unique
1	color	char(6)	true
2	width	double	true
3	height	double	true
4	name	string	true

Need more attributes? Add them in the attribute table. Attributes with multiple values? No extra work. The schema of the attributes is stored in the database explicitly, but attributes are treated different from the objects.

Column-Oriented databases

From wikipedia: "A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row."

object id	color
3	red
4	green

object id	width
3	100
4	50

object id	height
3	100
4	500

object id	name
3	my box
4	old beam

Google's BigTable is based, in part, on column-orientation. Their tables use reversed URI's as object and column identifiers, and have a "third dimension" in that older revisions of the data are stored in the same table.

References:

Correlation databases

A correlation database is "value based": every constant value is stored only once. All these values are stored together, except that values are grouped by datatype. All values are indexed. "In addition to typical data values, the data value store contains a special type of data for storing relationships between tables...but with a CDBMS, the relationship is known by the dictionary and stored as a data value."

I have not found a clear example of what this datastructure looks like, but we can infer that the internal structure must look something like the following. Note: I may be completely wrong here!

The values-table (actually there is one table per major datatype; i.e. integers, strings, dates, etc.)

value id	value
1	red
2	green
3	100
4	50
5	500
6	my box
7	old beam
8	<object 1>
9	<object 2>
10	<relationship color>
11	<relationship width>
12	<relationship height>
13	<relationship name>

and then there is at least a table containing the relationships (or: "associations") between the values. The relationships are stored as values themselves:

value id 1	association	value id 2
8	10	1
8	11	3
8	12	3
8	13	6
9	10	2
9	11	4
9	12	5
9	13	7

References:

Value-based storage (VBS) unleashes fast ad hoc query performance.

Hierarchical model, Network model, Navigational database

For the sake of completeness I have to name these models. The hierarchical model stores tree-like structures only, requiring each piece of data to have a single "parent". The network model allows a piece of data to have multiple parents. Both models were superseded by the relational model, but they are still used for special-purpose applications. A navigational database allows to traverse such trees / DAGs by following paths.

Object-Oriented databases

In an object-oriented database all attributes of a class are stored together. From what I've read on the internet I conclude that the actual storage structure of an OODBMS is sort of an implementation detail. This means that performance characteristics of the database will depend heavily on the type of implementation chosen. Development of this model was first in the hands of the ODMG, but control was transferred to the Java Community Proces that build the Java Data Objects specification. This specification names the conditions for such a database, but does not guide the implementation.

Some special properties:

Class inheritance is supported in the data model.
Object nesting: an object can contain (not just link to) other objects

Mapped to an RDBMS, a so called ORM (Object Relational Mapping), objects are commonly stored in a standard relational way: one column per (single valued) attribute. To implement inheritance, the columns of all base classes of an object are joined. This can be done at design-time (create a big table containing the columns of all parent classes) or at query-time (join parent class tables).

class id	object id	color	width	height	name
101	3	red	100	100	my box
101	4	green	50	500	old beam

class id	class name	parent class
101	Object
102	Bar	101

References:

Document based databases

A document based database is a different beast altogether. It lacks a database schema completely, and a complete object is stored in a single cell. In the case of CouchDB, this is done by encoding the object (or: document) in JSON. Real-time querying of the source table is thus impossible, one needs to create views on the data.

object id	document
3	{"color":"red","width":100,"height":100,"name":"my box"}
4	{"color":"green","width":50,"height":500,"name":"old beam"}

References:

CouchDB: Technical Overview

Triplestores

Some triplestores are publicly available. Commonly they have an RDF interface. Their performance can be measured using the Lehigh University Benchmark (LUBM). The most advanced open source triplestores are Sesame, and ARC.

object id	attribute id	value
3	101	red
3	102	100
3	103	100
3	104	my box
4	101	green
4	102	50
4	103	500
4	104	old beam
101	104	color
102	104	width
103	104	height
104	104	name

Very little has been made public about the way triplestores are implemented in a relational database. A laudable exception to this is the Jena2 database schema. Unfortunately, the schema appears to be very inefficient, since the URIs are not indexed but are used literally.

A charmingly simple implementation that seems resource intensive was made for expasy4j: triples are stored in a single table, but for query speed, a single column is reserved for each separate datatype.

Another, somewhat better implementation was made for OpenLink Virtuoso: it uses indexed uris, but all constants are placed in a single field datatyped "ANY".

Conclusion

I hope this article has shown you a little bit why developers are looking for alternatives for the familiar RDBMS and which forms these currently have taken. Currently the field is quite diverse and developments are being made by many different parties. It will be interesting to see how this evolves and which alternative(s) will eventually become the successor of the relational database.

Semantic web marvels in a relational database - part I: Case Study

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34300/14863/Semantic-web-marvels-in-a-relational-database---part-I-Case-Study.html

01 June 2009

You have heard about the semantic web. You know it is described as the future of the Web. But you are still wondering how this vision is going to make your applications better. Can it speed up application development? Can it help you to build complex datastructures? Can you use Object Oriented principles? This article shows how it can be done. And more.

By Patrick van Bergen

The semantic web is framework developed by the W3C under supervision of Tim Berners Lee. Its basic assumption is that data should be self-descriptive in a global way. That means that data does not just express numbers, dates and text, it also explicitly expresses the types of relationship these fields have for their objects. Using this uniform datastructure, it will be easier to interchange data between different servers, and most of all, data can be made accessible to global search engines.

That is a big thing. But is that all? Can't you just provide an RDF import / export tool for your data and be done? Are there any intrinsic reasons why you would base you entire datastructure on the semantic web?

In a series of two articles I will try to explain how we at Procurios implemented semantic web concepts, what the theoretical background of our implementation is, and what benefits a semantic web has over a traditional relational database. In this first article I will explain how we implemented a semantic web in a relational database (we used MySQL), added an object oriented layer on top, and even created a data revision control system from it.

Triples

In a classic relational database, data is stored in records. Each record contains multiple fields. These fields contain data that may belong to some object. The relation between the field and the object it belongs to is not represented as data in the database. It is only available as metadata in the form of the column (name, datatype, collation, foreign keys). An object is not explictly modelled, but rather via a series of linked tables.

A semantic web is a network of interrelated triples ("subject-predicate-object" triplets) whose predicates are part of the data themselves. Moreover, each object has an identifier that is not just an integer number that means only something inside the database only. It is a URI that may have a distinct meaning worldwide.

A triple is a record containing three values: either (uri, uri, uri) or (uri, uri, value). In the first form the triple relates one object to another, as in the fact "Vox inc. is a supplier" (Both "Vox inc.", "is a", and "supplier" are semantic subjects identified by a uri). In the second form the triple links a constant value to a subject, as in "Vox inc.'s phone number is 0842 020 9090". A naive implementation would look like this:

CREATE TABLE `triple` (
    `subject`                varchar(255) NOT NULL,
    `predicate`              varchar(255) NOT NULL,
    `object`                 longtext,
);

This table provides a complete implementation for the semantic web. However, it is too slow to be used in any serious application. Now, there are various ways in which this basic form can be optimized, but to my knowledge there is no best practise available. Several problems have to met:

How to identify a triple uniquely (If this is necessary for your application. The combination of subject, predicate, object itself is not unique)
How to search fast, given a subject and predicate? ("Give me the names of these set of people"?)
How to search fast, given a predicate and an object? ("Give me the persons whose name begins with `Moham`"?)

To solve these problems we came up with the following changes:

Create a single triple index table that only stores triple ids.
Create separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime)
The triple tables reference the index table by foreign key.
Add two extra indexes for the two ways the tables are used: a subject-predicate combined key and a predicate-object combined key.

Here's the triple index table (we are using MySQL):

CREATE TABLE `triple` (
    `triple_id`                int(11) NOT NULL auto_increment,
    PRIMARY KEY (`triple_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

and here's the triple table for the datatype "datetime" (the other datatypes are handled similarly)

CREATE TABLE `triple_datetime` (
    `triple_id`                int(11) NOT NULL,
    `subject_id`               int(11) NOT NULL,
    `predicate_id`             int(11) NOT NULL,
    `object`                   datetime NOT NULL,
    `active`                   tinyint(1) NOT NULL DEFAULT '1',
    PRIMARY KEY (`triple_id`),
    KEY (`subject_id`, `predicate_id`),
    KEY (`predicate_id`, `object`),
    CONSTRAINT `triple_datetime_ibfk_1` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The table definition should speak for itself, except for the field "active". This field is not necessary at this point, but I will need it in the next section.

The predicate_id refers to a separate "uri" table where the full uris of these predicates are stored. However, this is not necessary and the uris may be stored in the triple_longtext table as well.

The two combined keys have an interesting side-effect: the application developer never needs to be concerned again about using the right keys. Effective keys have been added by default.

To query this triplestore building SQL queries by hand may be a daunting task. It requires a special query language to be effective. More about that below.

All data of a given object can be queried by selecting all triples with a given subject id (one query per datatype triple table). That seems to be inefficient and it is: compared to the situation where an object can be stored in a single record, the triplestore is always slower. However, in a more complex situation a relational database requires you to join many tables to fetch all data. We use 5 separate queries (one per datatype table) to fetch all object data from the triplestore. This turned out faster than a single union query on the five queries. We use the same 5 queries to fetch all data of any desired number of objects. Here a the queries needed to fetch object data from three objects identified by the ids 12076, 12077, and 12078:

SELECT `object` FROM `triple_varchar` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_longtext` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_integer` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_double` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_datetime` WHERE `subject_id` IN (12076, 12077, 12078);

You can see that the object-data is fetched from the database without having to provide explicit type or attribute information. The type of the object is stored in one of the triples. This is useful in case of inheritance where the exact type of an object can only be determined at runtime.

Arrays and multilinguality

Many object attributes have an array datatype (an unordered set). To model these in a relational database you would need a separate table for each of these attributes. Querying all attributes of a series of objects including these array attributes is far from easy. In the triple store you can model an unordered set as a series of triples having the same subject and predicate and a different object. When you query all object data, you will get the array values the same way as you get the scalar values.

Multilinguality is also a hassle in relational databases. For each of the attributes that need to be available in more than one language the table structure needs to be adjusted and it is hard to avoid data duplication. In a triplestore you can treat a multilingual attribute almost like an array element. The only thing is that the predicates are similar but not the same. We use the following uri's for the representation of different language variants of an attribute: http://our-business.com/supplier/description#nl, http://our-business.com/supplier/description#en, http://our-business.com/supplier/description#de (in the tables these predicates are replaced by their integer ids for faster querying).

Data revision control

Version control is pretty common for developers when it comes to storing previous versions of their code. It allows you to track the changes of the code, revert to a previous version, and to work on the same file together. Still, when it comes to data, version control is very uncommon. And I think that is mainly because the overhead to create such a system is huge in a traditional relational database.

One of the requirements for our framework was that there should be some form of data-change history available. And when you think of it, it is actually really simple to keep track of all the changes that are made to the data if you use triples. And that's because from a version-control point of view, all that changes each revision is that some triples are added, and others are removed.

So all that is needed is two more tables, one to keep track of the revision-data, like, who made the change, when, and a short description for future reference, and another to track all the added and removed triples in this revision:

CREATE TABLE `revision` (
    `revision_id`                int(11) not null auto_increment,
    `user_id`                    int(11),
    `revision_timestamp`         int(11) not null,
    `revision_description`       varchar(255),
    PRIMARY KEY  (`revision_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `mod_lime_revision_action` (
    `action_id`                   int(11) NOT NULL AUTO_INCREMENT,
    `revision_id`                 int(11) NOT NULL,
    `triple_id`                   int(11) NOT NULL,
    `action`                      enum ('ACTIVATE', 'DEACTIVATE') NOT NULL,
    `section_id`                  int(11),
    PRIMARY KEY  (`action_id`),
    CONSTRAINT `revision_triple_ibfk_1` FOREIGN KEY (`revision_id`) REFERENCES `revision` (`revision_id`) ON DELETE CASCADE,
    CONSTRAINT `revision_triple_ibfk_2` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Each time a user changes the data, a new revision is stored in the database, along with a list of all triples that are added or deactivated, and a compact description of the change. A triple that was already available in an inactive state is made active. If no such triple was present, an active one is created. Triples are never really removed, they are only set to be inactive.

If you query the triplestore (the set of all triples), you need to ensure that only the active triples are queried.

With this information, you can:

List all revisions made to the data, showing who made the change and when, along with a small description of the change.
Revert changes back to a previous revision, by performing the revisions backwards: activate the deactivated triples, deactivate the activated triples. It is also possible to undo a single revision, that is not even the last one. But beware that revisions following it may have dependencies on it.
Work together on an object by merging the changes made by the two users using the difference in data between the start and end revisions.

Object database

Businesses are used to work with objects. A web of data needs to be structured first before it can be used for common business purposes. To this end we decided to build an object oriented layer on top of the triplestore. But even though the Web Ontology Language (OWL) was designed for this purpose, we did not use it, since we needed only a very small subset anyway and we wanted complete freedom for our modelling activities, because processing speed was very high on our priority list. I will not cover all the details here, since it is a very extensive project, but I want to mention the following features:

The database was set up as a repository: no direct database access is possible by the application developer. Object creation, modification, destruction, and querying is done via the repository API. This provided the OOP principles of information hiding, modularity.
Object types could be associated with PHP classes. This is no requirement, but it proved really easy to generate object types from PHP classes. This provided us with the principle of polymorphism.
Not only simple objects are modelled as objects (a set of triples, having the same subject), but object types as well. Furthermore, the attributes of the types are modelled as objects as well. Objects and their types can be used in the same query.
Object types can be subtyped. The triplestore allows us to query objects of a given type and all its subtypes in a straightforward way.
The attributes of objects can be subtyped as well. This allows you to add datatype restrictions to the attributes of subtypes that were not applicable on a higher level up the type hierarchy.

These features are very powerful. It is possible to build a real Object database using triples as a datastructure only. Types and attributes are treated the same as normal objects. This means that the same code can be used to manipulate normal data as wel as metadata. Also, to implement inheritance is relatively easy, since object data is not chunked in single rows any more.

Query language

After some time we felt that the simple queries we were performing on the object database were too constraining. We wanted the same power that SQL provides. And on top of that, since we continue to use normal relation tables as well, the object queries needed to be able combine the object database with the relational tables. For these reasons, the semantic web query language of choice, SPARQL, was insufficient for our purposes. We now build SQL-like queries using method chaining on a query object. The object then creates the SQL query.

I mention this because you really need to build or use a new query language when starting to work with either a triplestores or an object database. The underlying triple store is too Spartan for an application developer. The lower level SQL queries consist of many self-joins connecting the subject of one to the object of another. Very hard to understand.

Afterword

I wrote this article because I think this sort of detailed information about emerging datastructures is lacking on the internet. It is also great to work for a company (Procurios!) that agrees with me that knowledge should be given back to the community if possible. Gerbert Kaandorp from Backbase once asked me very seriously what I had done to promote the semantic web, like it was some kind of holy mission. I hope this article has made a small contribution and that it inspires some of you to build your own semantic web based object datastore. Let me know what you think!