Jump to content
Science Forums

A Paper on Computational Linguistic

Recommended Posts

Introduction to Computational Linguistics

The idea of talking to a computer utilizing common, everyday speech has intrigued most of us at one time or another. To converse in a natural language with a computer is the goal of computational linguistics. There have been many attempts from various angles at writing programs that interpret a natural language, but none have met with complete success. In fact, most seem to address only one major aspect of the problem, while admitting to flaws that generally pertain to the other approaches. However, we have learned much from these Natural Language Processing programs or NLPs. Most importantly, we have learned what it is a good NLP should do as well as identifying many of the problems facing us.

The ideal NLP should be a reflection of what another person does during a conversation. The NLP must listen, understand or interpret the conversation, and then respond in an intelligent manner.

Therefore, a ‘good’ NLP should accomplish as much of the following as possible. It should provide a source of input to obtain the input sentence or paragraph. Next it must utilize a method of interpretation, which requires many things. First a sentence(s) parse must be made, tokenizing the words and sentences. Now we use the lexicon to correct spelling errors and to locate our various identifiers or quantifiers of each word as well as the possible meanings for each. We then check for syntactic validity to eliminate ‘nonsense’ input and to supply any missing words or clarifying sentence structures. Then we must do the same for semantic issues.

At this point we should have an input sentence(s) with all the words identified by part-of-speech and a rough layout of what the input concept is, that is to say we have the verb, noun, subject, predicate, etc. all identified. Unfortunately, at this point we often still have many issues unresolved, which usually results in multiple meaning combinations, when we really want just one, namely, the correct or intended one.

And since every input sentence will generally have multiple interpretations, one of our main goals is to reduce the erroneous interpretations. After we have evaluated the input with our syntax checker, and then reduced our various interpretations even more with our admittedly poor semantics tests, my methodology utilizes a knowledge base to further eliminate any erroneous interpretations. Of course, if we still fail to resolve our input sentence, then we are reduced to asking the user for clarification. This is the only logical way to resolve some of the more difficult interpretive issues that arise.

It should be noted that we (people) do the same thing when we fail to understand or are confused by what somebody says to us. When we first encounter the sentence and begin identifying the words, we will generally have multiple interpretations. This is to be expected since part of our goal is to find the one and only correct interpretation. Keep this in mind while examining the methodologies discussed in the following sections.


The Lexicon

What is a lexicon? The lexicon, or dictionary, is actually fairly self-explanatory. The lexicon is an alphabetical listing of ALL words in the English language, e.g., a dictionary. The lexicon that I have employed has a few differences from what you would find in a standard dictionary. The primary difference is that I have a separate entry for each word definition or meaning.

This expansion is necessary for several reasons. I feel that we need to do more than simply deal with the words in a language. What we really want to deal with are the ideas and concepts that are expressed by the words. And by creating word nodes with relatively unique meanings we not only simplify our code, we also make our processes easier to understand and accomplish.

While this approach definitely increases the number of nodes substantially, the reduction in nodal connections and the savings in programming complexity make this approach far more desirable. Also, it is my belief that since we know that a word used as a verb will have one set of connections while the same word, when used as a noun, will be connected to a different set of words, (ex. Fly, as in a bug, vs. fly, as in to fly), we should treat them as different words. Basically, we want to consider a word with more than one meaning to be different words with the same spelling.

The file structure of the lexicon is also greatly simplified by this approach. We now have only one meaning to deal with, as opposed to several. My lexical entries look like the following example.


Word(Part of Speech and Number), Properties Field, Definition









Note that by incorporating the part of speech into the word and a different number for each definition, we have created a unique identifier for each word/definition entry. Also note that we further identify each word in the definition by a similar technique. We do this to uniquely identify interpretations when we are re-defining our words, as discussed later. Thus every word in our lexicon, whether it is an entry or is used in a definition, is uniquely identified.

This uniqueness based upon definition is very important. It allows us to express each sentence or idea uniquely, and we have a relatively unique definition for each word. This in turn simplifies word evaluation and reduces processing complexity and time.

The word properties flag field gives important additional information about each word. We can thus identify and distinguish words that have various properties such as animate vs. inanimate vs. not applicable. The following chart lists the field properties that I have implemented as of this writing.


Vocabulary Record Flags_Array(10) Descriptions


Flag 1:N/A 0, Animate A, Inanimate I, Either B

Flag 2:N/A 0, An Physical Obj P, Not Physical Obj N, Either B

Flag 3:N/A 0, Abstract A, Concrete C, Either B

Flag 4:N/A 0, A Physical Act P, A Non-Physical Act N, Either B

Flag 5:N/A 0, Past Tense P, Current or Future Tense C, Either B

Flag 6:N/A 0, Plural P, Singular S, Either B

Flag 7:N/A 0, Root Word R, Not Root Word N

Flag 8:N/A 0, Math Numeric M

Flag 9:N/A 0, Verb Priority Number

Flag 10:End Node 1, Otherwise 0


Now that the introduction to our lexicon has been accomplished, we will investigate the realities behind this approach. During the creation of my lexicon, I made some interesting discoveries. The most notable is that it is a very time consuming ordeal. The definitions of the first 500 words actually created about 1500 other words that had to be entered and therefore defined. And the 500 words had approximately 6.5 definitions each, which meant that 500 words became about 3250 unique lexical entries.

The 6.5 definitions per word are for the root words only, words with suffixes and prefixes that are used in a definition I have made into separate words, similarly for plural versions of each word. Thus the word achieve is considered a different word from achievement, achieves, or achieved. This method is simply a continuation of my approach to create different words for each definition of a word.

Another problem arose concerning the definition words. Sometimes a particular meaning is stated in more than one way. It became necessary to modify the definitions by omitting some definitions entirely or by rewording some definitions so as to utilize words already in the lexicon. My justification for these omissions is based upon the fact that I wish to start with the vocabulary of a child and that I only require a test lexicon, but ideally a complete lexicon will be available to the program.

Another problem to be examined is the fact that the words in the definition cannot actually be restricted to one lexical entry or definition for that word, as exemplified below. There are two solutions to this problem. We can include the part of speech and number for all valid choices for that word in the definition, as shown below in method one, or we can try to expand each definition, as in method two.



























Obviously method one is much more practical than method two, especially since method two would have to be expanded even further when each definition word has been similarly expanded. And we cannot simplify our definitions beyond method one due to the fact that many words have very different definitions or meanings, depending upon how they are used in a sentence or definition.

As you can see, this is a very complicated and labor intensive part of the process. And even with the help of several programs to assist in the process, it still ultimately takes the efforts of a person to correctly identify all the choices for each word. And each definition leads to numerous expansions very quickly. While we could just choose the best word for each word in our definitions, there are too many instances where more than one meaning is applicable.

Vocabulary and Vocabulary Nodal Descriptions

The vocabulary is an important aspect of the program. How the vocabulary is built up initially as well as how additional words are added to the finished product will be discussed here. There are approximately 500 words that can be used to define almost all other words. These will make up the initial vocabulary and will be used, in addition to the entries in the lexicon, to create our finished vocabulary.

We want our finished vocabulary to consist of about 20,000 of the most commonly used words. The vocabulary can be expanded to meet specific user requirements by simply introducing new words on an as-needed basis, or by specific requests of additional custom subsets. But at some point, we can be sure to exceed current machine capabilities or we could use ALL the words in the lexicon in our vocabulary, thus we have the fundamental deference between the lexicon and the vocabulary. Other differences also exist and will be discussed shortly. We will address the creation of the vocabulary network concurrently with the vocabulary nodes and definition processes.

We must take our list of 500 words and create one node for each unique instance of the word. We will then have to examine each nodal entry, verifying that our words are defined utilizing only words in this initial vocabulary, making modifications to the definitions in the lexicon as needed. Thus our initial vocabulary will be a self-contained subset of the finished vocabulary. And it is this subset that will ultimately be the stopping point for most of our redefinitions of the vocabulary definitions, as discussed later.

Some additions to this word base will occur; specific command words, for example, will be utilized to execute commands or perform some specific single action and therefore can be viewed as an end-node, so to speak. But most of these will be treated as belonging to a special category of words and will be utilized in a special manner covered later.

Next we create our expanded vocabulary from our lexicon by creating a node for each word / definition, filling in the word, part of speech / number, and the definition. (See Vocabulary Node description below.) After we have created all the word nodes, we then make our related word connections based upon the multiple entries, a related words list, and the definitions. We process the words one at a time and use their definition to create additional related word connections based on similar meanings. We then use our syntax to generate all possible previous and next words for each vocabulary node.

We also verify that we have included all words used in the definitions in our vocabulary. Words in the definitions that do not yet exist within our vocabulary should be handled immediately to ensure inclusion in the vocabulary. This can be a recursive action that, barring loops, should eventually proceed to completion. Thus our final vocabulary will also be a self-contained subset of the English language.

Additions to the vocabulary that are made after all the various networks have been created are not as difficult a task as it may seem. We merely create a new word node(s), fill in the appropriate initial information, as outlined above, and check our word definition to be sure we haven’t introduced any new words.

It should be noted that we will be referring to our vocabulary nodes as a network since that is the easiest way to view them. The word nodes must contain certain information necessary to correctly create the connections within the vocabulary network. They must also contain the word properties and definitions. The nodal structure below is what I have implemented for ease of use and structural identification.


Vocabulary Record Structure

WordPOS# : The actual word with part of

speech and number

Part Of Speech : The word’s part of speech

Flags Array(10) : 10 dimension array of word

properties flags

Definition : The definition section, 30

words max, holds

Definition Word Array(30) : one word in each cell of the


Definition Word Pointer Array(30,30) : and the word’s vocabulary node


Related Words Record : The related words record


Related Word : the related word

Relationship : the relationship

Related Word Pointer : pointer to the vocabulary node

Next Related Word Record Pointer : and a pointer to the next

related word record

Command Record : The record of command

information for the word

Command : The lines of code for actual


Flags Array(10) : Informational descriptors for

the command

Next Command Record Ptr : Pointer to the next command


Sentence Usage Record : The sentence usage record


Sentence Usage Pointer : a pointer to the sentence

containing the word

Usage Type Indicator Array(10) : word/sentence relationship

flags, fact, etc

Next Sentence Usage Record Pointer : and a pointer to the next


Alphabetical List Pointer : This points to the next word

based on alphabet


The word is self-explanatory. The part of speech can be abbreviated, but should follow immediately after the word as should the number, and in fact should be considered to be part of the word to uniquely identify word entries. This simplifies the task of identifying which word node we are looking at. For processing purposes, we consider the part of speech to be part of the word, ex. Fly(N1) would be the actual name for the node, not Fly. Similarly, the definition should contain the actual words of the definition, complete with part of speech and number, ex. Cat(N2), A(IndArt1) Small(Adj1,Adj2) Furry(Adj3,Adj4,Adj5) Mammal(N2). This applies to all the words in our entries in the vocabulary nodes and knowledge base. We also have the part of speech field and a flag array for our word properties and miscellaneous flags.

The definition section is also implemented as an array. And we include our word node pointer, which is actually set when the final program is run. The two dimensional array is for multiple word(pos#) combinations as described above. I have made the assumption that there will not be more than 30 different word(pos#) combinations for any given word in the definition section of a word.

Next, we have a related words list that allows us to make connections to the other word nodes for this word, closely related or synonymous words, and some additional words like opposites. This is how we link multiple definitions, plurality, different tense versions, etc. We are basically connecting all words with the same root word to each other, as well as words that have similar or opposite meanings. We will have to utilize several identifiers to distinguish between the various types of related words, as the following chart demonstrates.

RelatedWordsRec Relationship flags

Flag 1: Not Used

Flag 2: Not Used

Flag 3: N/A 0, Equivalent or Similar word = E

Flag 4: N/A 0, Opposite of word = O

Flag 5: N/A 0, Past = P, Current = C, or Future = F Tense of word

Flag 6: N/A 0, Plural = P, Singular = S Either = B

Flag 7: Not Used

Flag 8: Not Used

Flag 9: Not Used

Flag 10:Not Used


The command field is primarily for verbs. This is the meta-language, as I call it. It is the commands that are associated with the word, followed by some descriptors that allow the program to match the command’s attributes to the object’s attributes. Most commands that are executed for a word are other words, i.e. execute a series of other word’s commands. The exceptions are the base words, which do not call other words as commands. The command field of these words do not have any vocabulary words in them, their command field is the actual hard-code that executes a command. The meta-language is covered in more detail in its own section.

The next field is the sentence usage field. The sentence usage field is a linked list of pointers to every sentence where the word is used. They are not created until we actually implement the sentence utilization processes when we create our knowledgebase, as discussed later. This field is extremely large due to the size of our knowledge base. It is a critical field because it allows us to quickly recall information related to the word by locating all instances of the word in our knowledge base.

Also, by including a usage type field, we can facilitate the process of locating additional information by allowing us to identify how the word is used prior to actually examining the sentence. This field will be discussed in more detail in the KnowledgeBase section.

The last field is the alphabetical list pointer. This allows us to keep an alphabetical listing of all our word nodes and traverse the network alphabetically. It also guarantees that we can access every node, in the event that we have some isolated words. And this is the means by which we locate our word nodes for new sentences and user supplied input.




What is a network? A network, in its simplest form, can be viewed as a graph containing labeled nodes with interconnecting labeled, directed arcs.


In its more complicated form, a network will contain numerously many nodes and such a vast array of arcs that it is usually incomprehensible when viewed as a whole. The nodes of a network are usually objects, with the arcs representing the relationships or paths between them. Of course, many kinds of networks exist, each with it’s own node and arc definitions. The network is really just a convenient way for us to view what we hope is happening or what we want to happen, and is therefore arbitrarily defined to be whatever the programmer decides.

We will use several networks to allow us to accomplish our various tasks in a more comprehensible manner. We will first examine the definition network of one node of the vocabulary network. This sub-network should be viewed as a series of sequential inter-nodal connections from each word in the definition to the vocabulary node of each word.


As you can see, this can be viewed as a recursive issue since the words in the definition will generally have definitions of their own, and so on. But by leaving the definitions as they are defined in our dictionary, we can utilize the redefining (discussed in more detail later) to create relationships when we actually process user supplied input (sentences, questions, etc.) Also, we do not wish to include truths or facts about the word in the node, (ex. cats run), instead they will be handled in the sentence and idea networks described later.

Now let us expand the vocabulary network to its full extent. We will create ALL syntactically valid connections between our words. Here we define valid as the ability of a word to precede or follow a given word within a sentence. Thus each word node will have connections that point to it from other words and connections to other words. But we do not have connections to ALL other words because we will restrict the connections to those that actually can be made based upon syntax.

We accomplish this task by utilizing the syntax to generate the connections. It must be noted that our syntax must be capable of handling past, present, and future tense since our vocabulary structure implements different word nodes for these cases. This process must be repeated when adding new words as well.

We create all syntactically valid connections because we do not specifically store semantic information in the vocabulary network. The semantic knowledge is stored in our knowledgebase instead. This is because we only store the ability of one word to follow or precede another word, syntactically speaking, in our vocabulary network, not the knowledge of whether it makes sense, (ex. cats fly.) It is desirable to leave the possibility for these non-valid (from a semantic viewpoint) cases to exist since we also desire our program to be capable of handling unusual or ‘fantasy’ cases just as we humans can.


Semantic Networks


A semantic network is a network where we have further restricted the way we can use words in a sentence such that we only have semantically valid sentence structures, assuming the sentence has been evaluated for syntax already. This implies the ability to actually store the trace of nodal connections through the network, which is not what our vocabulary network is designed to do. Instead, we can create an additional network based on sentences and ideas by breaking up sentences into subject/predicate, action/object, etc.; basically identifying the intended meaning of the sentence. This requires us to create nodes that represent the main components of sentences. In this way we have a network of related sentence components that is also a knowledge base. (See diagram below.) We also establish connections between the words in the sentence and the word nodes in the vocabulary network. It is my opinion that semantic analysis can only be accomplished by the use of a substantial amount of knowledge upon which to evaluate the sentence from a semantic viewpoint. This is justified since we know that young children gain better sentence understanding as they increase their knowledge.


Thus we address the semantics issues in a round about method. Once we have introduced a substantial amount of information in the form of sentences, we have basically created a history of sentences, or a knowledge base, which is what the human mind utilizes in its semantic evaluation of data. I believe this knowledgebase which we have in our own minds is the basis for the feeling that we have when we say something ‘sounds right’.

Of course, we still require a vocabulary network to trace our sentences through, make syntactic analysis, and to create additional relationships, but we are not going to store the actual path in the vocabulary network. Instead, it is stored as a separate informational base or network with connections to the vocabulary network nodes. Thus our semantic rules, and even a copy of our syntax, can be stored as informational rules or sentences in our knowledgebase.

We should view the sentence network as the above diagram suggests, but the actual connections are far more complicated. Each word that has a sentence usage field, discussed earlier, actually connects the sentences through the vocabulary network. By our example, the predicate ‘ran fast’ has connections to all sentences and vocabulary words that contain the word ‘ran’ in any of its forms, similarly for ‘fast’. This is discussed in more detail shortly, but the true complexity should be obvious at this point. In this way we can trace related concepts back and forth, utilizing additional relationships from our vocabulary network to increase our ability to analyze user input.

We also store facts about topics and words in this network (discussed in more detail in the knowledgebase section), so it must be stressed that when we initially create our knowledge base, we must not introduce false information or incorrect semantic knowledge. Introduction of false knowledge can impact the processing of user supplied input sentences and our responses, as described later.

It is these connections that form the overall semantic network. One other aspect of the sentence network is also important to cover. We will require a method to identify actions/objects as well as descriptors/objects and other relationships within the sentence. This can best be accomplished similarly to the way we identified parts of speech in the vocabulary word nodes. How we identify and use this information is discussed later with the sentence and idea descriptions.

Thus, it is these connections that will, when enough knowledge has been introduced, form the heart of my NLP.


Sentence Concept


While the word is the basic unit of information transfer, it is the sentence that is the basic unit of conceptual or idea transfer. Thus our need to address language processing from a sentence viewpoint. The sentence nodal network consists of the sentence as the node identifier, along with a sentence type identifier and any additional flags. The sentence also has a word list of the sentence’s words and, by way of the vocabulary, their definitions. Also, each sentence is composed of a kernal, which is where the main noun and verb, as well as a secondary or auxiliary noun and verb, can be found. In my model I have also included modifiers to these words in the kernal. These modifiers are stored as a linked list due to the potential for an unlimited number of descriptors. Additionally, each sentence has a subject and predicate, a base redefinition, an idea interpretation, and a list of pointers to any groupings and paragraphs that the sentence is a member of. My sentence node description is covered in the Sentence Node Description section.

Now we take a look at each basic sentence type keeping in mind that we will only deal with simple sentences because complex sentences are to be broken up into simple sentences and grouped into paragraphs as discussed in the paragraph section. The types of sentences are declarative, imperative, exclamatory, and question.

A declarative sentence makes a statement and has a subject and predicate. These are the sentences that the knowledgebase is comprised of.

An imperative sentence is basically a command where, (for our purposes), the program is considered to be the subject performing the specified action. This type of sentence is one of the two types of user supplied input.

A question is the other type of user supplied input. It should be considered as either a declarative that is either true or false or a query into our knowledge base.

An exclamatory sentence is just a different way of expressing one of the other three, and as such we do not handle exclamatory sentences since they pose no bearing on our problem or our solution.

Therefore our sentence types can be reduced to three choices which we can represent as D, C, and Q for our identifier flags. These flags are important because ALL information in my model is linked through the vocabulary nodes, at least. And we wish to be able to distinguish between knowledgebase declarative sentences, user supplied commands, and queries when we are trying to find related information in our attempts to satisfy a user request, as discussed in the Response and Command Capabilities section.

One of the methods by which we interpret the sentence is by redefining the words in the sentence as far as is practical. We call this redefinition the sentence’s base definition. This is where the base 500 word vocabulary comes into play. The base 500 words are considered to be end nodes and can thus be used as the stopping point for the re-definition process. There are, however, exceptions. Notably, names and command words; these are defined, but are considered end nodes. Ex. Spot is defined as name for a dog, but this can be viewed as more of an identification or association process than a redefinition process that occurs for words.

Another important aspect of sentence interpretation is through the use of related sentences. The rules, properties, and virtually all related information for each word in the sentence must often be examined to correctly determine the right interpretation and sentence structure. Thus our need for an extensive knowledge base, i.e. one that contains both the information sought and a substantial number of instances of word usage.

The knowledge base should therefore contain a lot of informational sentences like ‘birds fly’, ‘birds have feathers’, ‘planes fly’, and ‘fish swim’. The relational information is the nodal traces that allow us to lump birds and planes together as things that fly. Thus an extensive knowledge base will give the program a large amount of relational information and increase the ability of our program to achieve the desired resultant action irregardless of how the sentence is phrased. The below example of a typical sentence node shows what these connections are and the next section discusses the sentence node in detail.


The Current Sentence Node Is:

OrigSent =AN(indart1)ADVERB(n3)DESCRIBES(v1)ANOTHER(adj4)ADVERB(n1)

SentIn =AN(indart1)ADVERB(n3)DESCRIBES(v1)ANOTHER(adj4)ADVERB(n1)

Sent Type=D


SentWords=AN(indart1) ADVERB(n3) DESCRIBES(v1) ANOTHER(adj4)



Subject =AN(indart1) ADVERB(n1)


Predicate=DESCRIBES(v1) ANOTHER(adj4) ADVERB(n1)


SubjIdea= ADVERB(n1) DOES(v1) SOMETHING(n1)


Kernal MainNoun=ADVERB(n1)

Kernal MainVerb=DESCRIBES(v1)

Kernal AuxNoun=ADVERB(n1)

Kernal AuxNounModifier=ANOTHER(adj4)



The following Sentences Have Related Ideas:









Sentence Node Description


The sentence nodal structure is viewed as a record with the following fields:

Sentence Node Record Structure

Sentence Record : Record header

Sentence : The processing copy of the sentence

Type : Type of sentence

Original Sentence : Copy of the original sentence as inputted

Informational Flags : 10 Flag indicators for sentence

Word List : Sequential list of individual words and

their pointers in the sentence

Word1 : Sentences with more than 30 words are

Word2 : probably run-on sentences and are broken

: : up into the individual sentences that

Word30 : they are comprised of. Thus 30 words

WordPtr1 : Is the maximum allowable for a sentence

WordPtr2 :

: :

WordPtr30 :

SubjectString : The words of the subject as one string

Subject : Sequential listing of the subject words

Word1 : and their vocabulary node pointers

Word2 : Up to 30 words are allowed

: :

Word30 :

WordPtr1 :

WordPtr2 :

: :

WordPtr30 :

PredicateString : The words of the predicate as one string

Predicate : Sequential listing of the predicate words

Word1 : and their vocabulary node pointers

Word2 : Up to 30 words are allowed

: :

Word30 :

WordPtr1 :

WordPtr2 :

: :

WordPtr30 :

Idea Interpretation : Idea interpretation section header

Sentence Idea : The idea interpretation of the sentence

Subject Idea : The idea interpretation of the subject

Predicate Idea : The idea interpretation of the predicate

Related Ideas : Linked list of pointers to related ideas

Idea Pointer : Pointer to the related sentence

Relationship : The relationship flag field of the

related sentence

Next Pointer : Pointer to the next related idea

Kernal : Kernal header

Main Noun : Main noun of the sentence

Main Noun Ptr : Pointer to the vocabulary node for the

main noun

AuxNoun : Auxiliary noun

AuxNounPtr : Pointer to the vocab node for the

auxiliary noun

Main Verb : Main verb of the sentence

Main Verb Ptr : Pointer to the vocabulary node for the

main verb

AuxVerb : Auxiliary verb of the sentence

AuxVerbPtr : Pointer to the vocab node for the

auxiliary verb

MainNounModifier : Main noun modifier

MainNounModPtr : Pointer to the vocabulary node for the

main noun modifier

AuxNounModifier : Auxiliary noun modifier

AuxNounModPtr : Pointer to the vocab node for the

auxiliary noun modifier

MainVerbModifier : Main verb modifier

MainVerbModPtr : Pointer to the vocab node for the main

verb modifier

AuxVerbModifier : Auxiliary verb modifier

AuxVerbModPtr : Pointer to the vocab node for the

auxiliary verb modifier

Command Record : The command information for the sentence

SentenceCommand : The actual command

DescriptorFlags(10) : Informational descriptors for the command

Next Command Rec Ptr : Pointer to the next command record

Paragraph Member Record : Linked list of pointers to any paragraph


Paragraph Pointer : Actual pointer to the paragraph

Paragraph Flags : 10 flag indicators for relationship

Next Paragraph Pointer : Pointer to the next paragraph usage Entry Order : Sentence entry order into knowledge base Date Time Stamp : Date and Time stamp of sentence

Note: I have omitted the base redefinition section for now.


The sentence nodal record has the full sentence with the sentence type as the record identifier. The various fields and their descriptions are reasonably self-explanatory. But we will elaborate on some of the more difficult ones.

The Informational Flags are there because as we develop and implement the code we will discover a need to include additional information about the sentence. These flags actually mirror the vocabulary sentence usage flag field as well as covering additional information. How many flags this will entail will be decided as development progresses, so we should leave room for potentially many flags. The primary use has been implemented so far is the sentence structure type. This is stored as a number in the ninth and tenth fields. The below chart shows some of the structure types have been implemented and by what number.


1 something1 is something2 KernalNoun is an element of AuxNoun

2 something2 is something1 AuxNoun is an element of KernalNoun

3 something1 describes something2 KernalNoun is a property of AuxNoun

4 something2 describes something1 AuxNoun is a property of KernalNoun

5 something1 changes something2 KernalNoun modifies AuxNoun

6 something2 changes something1 AuxNoun modifies KernalNoun

7 something1 does something2 KernalNoun performs action

KernalVerb on AuxNoun

8 something1 does some action KernalNoun performs action KernalVerb

9 something1 does some action called action2

KernalNoun performs action AuxVerb

10 something1 is equal or is the same as something2

KernalNoun equals/is the same as AuxNoun


The idea representation will be discussed in detail shortly. The related idea pointers are a linked list that allows us to quickly locate other sentences with related ideas. The kernal is simply the parts of the sentence as listed.

The command section is similar to the command section for the vocabulary word node commands, but are generally considered to be complex commands, i.e. more than one command. As mentioned, commands are covered in detail under Response And Command Capabilities.

The paragraph section is simply a list of sentences. We have a pointer to the paragraph containing the sentence. We also have a flag field that allows us to categorize how the sentence is used in the paragraph as well as storing information on the properties of the paragraph.

The entry order field is necessary for portability. When we actually export our sentence file with the intention of reloading it into the knowledge base, it is imperative that we retain the correct ordering for the same reason that we deliberately chose to start with preschool sentences and work up to more difficult sentences and knowledge. And when used in conjunction with the time and date field, we have a history of when information was introduced, which is of particular importance for user supplied input.


Sentence Processing


The best way to view sentence processing is as a multi-layer compiler. The base 500 word vocabulary, and any special command words, should all have commands associated with them, similar to a common programming language compiler. And since, by definition of the 500 base words, every other word in our vocabulary can be defined in terms of the 500 base words, we should therefore have a command interpretation for ALL words in our vocabulary. Of course, the reality is that this approach does not work for many sentences. Thus each sentence also requires that a command capability be available as part of the record structure.

Therefore, initial sentence processing consists of filling in all the fields of information in the sentence record structure, and then evaluating the command field. It sounds simple, but identifying all the informational fields for a sentence is a daunting task that requires many methods for each field.

We must have several ways of getting this information, all of which, except the last recourse of asking the user, are dependent on examining and accessing the information in the knowledgebase and vocabulary record of each word.

The various methods are actually based on the structure of the knowledgebase. Namely, we can find the information through existing previous instances, through various existing relations, or through the use of our sentence rules on language, i.e. English 101. The first two are fairly obvious and actually set up the third. By finding all instances of the primary words in our sentence, as well as tracing all related words, we should also have found all instances of sentences that pertain to these words, including our English 101 sentence rules. The inclusion of all the pointers in the vocabulary node and the sentence node and their necessity should now be apparent.


The Idea Concept


Isolating the intended meaning or idea behind each sentence is an important aspect of how we understand or comprehend a sentence. Finding the intended meaning has also been a major problem. I prefer a two-pronged approach to the problem. Our first method is through the use of a syntax-like set of rules. Our second method is the implementation of our knowledge base, which should have all the information required already in place. We can now discuss each methodology in greater detail.

The syntax methodology is actually fairly straightforward. This methodology allows us to accomplish our goal by utilizing a syntax-like set of rules to locate the object, action performed, and any descriptors. And when used in combination with the subject, predicate, and idea fields for each word, we have the ability to create an idea interpretation for the subject, and predicate. We then put these together to form the idea for the sentence. The idea interpretation for each sentence and subject/predicate is, of course, also part of our network with traceable connections back to the vocabulary word nodes. Thus we also have connections to other sentences and their idea interpretations through the use of the pointers in the word’s usage field. We shall also have a related ideas field in our idea section of our sentence structure. In this way we will be creating an idea network which in turn will give us the full power and capabilities associated with a network. The related ideas are determined through the commonality of our vocabulary. This field also helps to determine our groupings, as discussed later.

The knowledge base methodology is actually straightforward in theory, but difficult in practice, i.e. programming it is a nightmare! Once we have established our substantial knowledge base, we will already have the information in place that will allow us to determine the idea inherent in a given sentence (English 101 is a MUST for our knowledge base). But the problem is accessing the knowledge stored in the knowledge base and when. Initially, we want to have the idea knowledge incorporated in each sentence, but we have the obvious problem of no knowledge of English 101 to draw upon. The solution to this is to set the information for the idea fields in our input as a separate field. Basically, we have to spell out everything, at least initially.

By determining the base definitions and the idea intent of the sentence, we have created additional interpretations upon which we can evaluate the user supplied input and enhance our ability to create a suitable response or perform the required action. In this way we can more accurately interpret a given command or query, regardless of how it is stated.


The Next Step – Sentence Paragraphs and Themes


The next step with NLPs is to examine the sentence differently. We want to look at a sentence not by itself, but as one of many and what that implies. Each sentence is actually one (or more) idea concepts that are part of many, instead of ‘just one sentence’.

The paragraph theme methodology is basically a methodology that allows us to group sentences into paragraph structures. We create these paragraphs from our input when we encounter either extended sentences or multiple sentences that are actually already implemented as a paragraph.

Basically, paragraphs are handled by keeping a list of sentences making sure to retain their order. We process each sentence as described above. We then take our list of sequential sentences with all the appropriate informational fields filled in and analyze them so as to create a contiguous list of ideas that form a theme for our paragraph. This allows us to implement contextual inference in our analysis of each sentence. Thus pronouns can easily be traced back to their corresponding noun.

This is also how we handle sentences that express more than one concept or idea. We break up the sentence into as many sentences as necessary, but we keep them grouped together, forming a new paragraph of sentences.

The paragraph network nodal structure actually consists of a paragraph identifier consisting of the main paragraph theme. A typical node in the theme network contains a pointer to each sentence, and a copy of the pertinent sentence information, namely, the sentence property and idea fields. We also have pointers to other paragraphs containing related themes and a flag field to indicate the type of relationship. We now have our theme items tied into our sentence and idea networks, thus creating the ability to find related themes and, in essence, creating a paragraph theme network as well.

The exact nodal description or methodology is described below.




Paragraph Record : Record Header

Paragraph Number : Unique identifying number

Theme : Theme for the paragraph, 1 sentence

Type : Paragraph type

Paragraph Flags : 10 Flags of paragraph properties

Sentence List Pointer : Array of Pointers to the sentences

in the paragraph

Related Themes Record Ptrs : These are the pointers to the theme

related paragraphs

NextParagraphPtr : Pointer to the next paragraph record


One other important aspect of the paragraph record structure should be mentioned. Originally, it was my intention to only allow one copy of a specific sentence to exist in the knowledgebase, and only one instance of a particular paragraph. But I have modified this for paragraphs so that the sentences in each paragraph may exist whether or not they exist elsewhere in the knowledgebase. The justification for this is simple, there are some situations where the meaning of the sentence is determined contextually by the sentences associated with it. And in the case of pronouns, the nouns that are represented by them are determined by another sentence in the paragraph.

While the theme and idea networks are very powerful and can be quite capable, there is one particular achievement that I feel should be specifically mentioned. These networks allow the NLP to have a history, a present, and even a future since we can utilize our history to match with our current theme to predict where a conversation will progress. We also have the ability to relate our input with our knowledge base to evaluate the input and make future predictions as well as generating intelligent responses. In this way we can even contribute to conversations with different users based on our previous experience and knowledge.


The KnowledgeBase


The knowledgebase is the heart of the memory and therefore our NLP. We store all knowledge except vocabulary in our knowledgebase. This various command capabilities, situational responses, and ‘how to’ methodologies are but a part of the information stored here. Everything that we teach our program, and anything that will be taught, is stored in the memory of the knowledgebase.

This will require an extremely large amount of information in the form of potentially millions of sentences. We build the knowledge base by entering sentences, starting with rudimentary sentences typical for a preschooler, gradually increasing the difficulty level until we reach (hopefully) the level of the average high school graduate. At some point, we begin to teach the program what to do with the knowledge, i.e. English 101.

Once we have implemented a substantial amount of knowledge or sentences, we will have a great deal of relational information, but we have yet to deal with any types of commands or rules. The easiest way to accomplish this is to include in our knowledge base statement sentences that tell the program to respond in a certain way when it encounters a certain condition or input. Ex. when someone says ‘thank you’, you say ‘you are welcome’. It is through the implementation of many simple commands that we will eventually get to the greater abilities of windows commands and advanced data searches. If this approach sounds familiar, it should. This is the way WE learned. As you can imagine, the greatest difficulty will be deciding what to teach and when to teach it. But if we mimic our own schooling methods I believe we will be able to overcome this obstacle with minimum difficulty.

Thus our knowledge base program has rules hard coded into it that only deal with how to process sentences and create the relationships as described above. We do not hard code any knowledge or rules that deal with language processing. Instead, we teach our program through the input sentences. In this way we eliminate the need for interpretive coding because we remove the need to make our code self-programming. At the same time, we create a knowledge base that is portable by simply downloading and transporting the entire sentence listing. In this way we also have a methodology to substantially increase or upgrade our knowledge and command capabilities with additional sentence packages.

One important aspect of the knowledgebase should be mentioned. For most applications we want our knowledgebase to be read-only. By this we mean that user queries and commands do not increase the number of sentences or knowledge stored in the system. This is important to ensure the integrity of our knowledgebase, thus preventing a number of undesirable results. The most notable of these is the prevention of falsehoods or lies creeping into the system, and by having a locked down knowledgebase we can have predictable and duplicable results.

There is one possible exception to this rule. Not all user interaction needs be in the form of a query or a command to perform an immediate computer-related action. We can use this system for analysis of information as well as a reference lookup. If we were to input a series of books, we would be able to ascertain or assimilate specific information that may be stored in the books. The relational connections inherent in this system allow for unprecedented searched capabilities. In the event of this type of advanced search for information, we should have a separate flag that will allow us to keep the new knowledge available while at the same time distinguishable from our locked-down knowledgebase. This way we have the best of both worlds. The users can customize their knowledgebase, while the software supplier maintains liability only as regards to the original knowledgebase.


Responses and Queries


How the program actually responds to a given input is also an important aspect of the program. When the initial 500-word vocabulary is built, an initial given set of responses and capabilities must also be built into the system. Through sentence type analysis, we can evaluate our input into various categories. This will determine the response for user input.

Responses consist of queries back to the user for more information, acknowledgments that a command have been performed, responses containing the information required in response to a query, or even a comment concerning related topics to the subject matter in a dialogue. Through our learning methodology, we have basically created the ability to allow the user to phrase the input in an unlimited number of ways while still causing the same action to be performed. And, through our knowledge base, we have the equivalent of a command generator already in place to handle all the ‘little’ steps needed to perform a given task when that task requires multiple steps.

There are basically two types of user input currently employed in my program. They are commands and queries. Commands will be discussed in the next section. There are many types of queries, but queries can basically be considered as data retrievals. A query is a question that we would like answered. The types of queries are not unlimited. There are three basic categories, those that can be answered with information stored in the vocabulary, those whose answer can be found within the knowledgebase, and those queries whose information requires a command or set of commands to be executed to locate the answer or information.

The first type of query is handled by accessing the information from our vocabulary. The following excerpt is an example of this type of query.


Query- WHAT(pn2)IS(v1)A(indart1)DOG(n1)?(n2)

Response -A(indart1)DOG(n1)IS(v1)AN(indart1)ANIMAL(n2)THAT(adj4)HAS(v1)




As you can plainly tell, this is a direct lookup of information as it is stored in any dictionary. This should be distinguished from a command query by the following example.


Query- WHAT(pn2)IS(v1)8(n2)+(v1)6(n2)?(n2)

Response- 8(n2)+(v1)6(n2)=(v3)14(n2)


Above, we have a command being executed for the word node +(n2).


The next type of query is more complicated as it seeks information stored in the knowledgebase. This type of query requires us to examine the subject and predicate and find related sentences through the vocabulary words of the query. The next two examples derive their answers from the information stored in the knowledgebase.





Query- CAN(v1)A(indart1)DOG(n1)FLY(v1)?(n2)

Response- NO, A(indart1)DOG(n1)CAN(v1)NOT(adv1)FLY(v1)


Query- DOES(v1)AN(indart1)AIRPLANE(n1)FLY(v1)?(n2)

Response- YES, AN(indart1)AIRPLANE(n1)FLIES(v1)


This type of query is once more a simple knowledge lookup. But what do we do with more complicated queries, where the information is not directly available? The following example, combined with the sentences specified, exemplifies this problem.





Query- DOES(v1)A(indart1)MACHINE(n1)FLY(v1)?(n2))


Response1- I am sorry, but I do not know the answer.

Response2- YES, SOME(adj5)MACHINES(n1)FLY(v1)


Response 1 is generated if we restrict ourselves to a simple knowledge lookup. But we would obviously prefer Response 2 as an answer. Response 2 can only be achieved through what I call first degree logic. It is a very tricky methodology to employ because without exercising extreme care, the results can easily yield incorrect responses, as exemplified below.






Query- DOES(v1)AN(indart1)ALARM(n2)FLY(v1)?(n2)

Response- YES, SOME(adj5)ALARMS(n1)FLY(v1)


Obviously, the response is incorrect, yet it is a logical derivation of the sentences in the knowledgebase. The above example can easily be avoided by the inclusion of the below sentence into the knowledgebase.




The third type of query, although touched upon briefly in the math example, is much more complicated. This query type often requires the program to go outside the vocabulary and the knowledgebase for the answer. An example of this type of query would be a search of the internet, or maybe a search of the documentation available on the user’s PC. How these are accomplished will be touched upon shortly.

Command Responses and Capabilities

The desired capabilities and the use to which the program is to be put will determine some of the vocabulary and the commands. The commands that each word has available are stored in the node as mentioned above. The exact command, its requirements, and its functionality are stored in the Command Record section of a word’s vocabulary node. And just as we have stored commands for single words in our vocabulary nodes, we also store commands in some of our sentence nodes.

Some commands are actually just system commands. Many mathematical problems can best be solved this way. Also, many commands are computer related, and should also be handled as system calls since they inter-react with other programs through a set of established commands. It should also be noted that some sentence commands may have to implement this methodology for cases that deal with Windows commands or machine related commands. Commands are covered in more detail the next section on Meta-Language, but we shall look at a simple example now.



CommandWord: +(v1)

WordCommand: SysCall

DescripFlags(10): MMM0000099

NextCommandWordRec PTR to the next CommandWordRec


The command word is ‘+(v1)’, and the command is a system call. The Ms in the flag flied specify that previous, next, and result words are or should be mathematical numeric. The 99 at the end of the flag field signifies that this is a mathematical command.

And the flag field similarly identifies the argument word's features, i.e. the same word characteristics as are found in the vocabulary flag field. It is by matching the arguments and their properties to the command specified arguments and properties that we can often determine the appropriate command to be used under a given set of circumstances.

There is one other approach that I believe I have basically been orienting toward. When we create our lexicon, we not only define each word, we also make it actually mean something. In other words, we re-make our definitions into idea concepts for each word. Thus we must have an idea even for words that are not command words, i.e. adjectives, adverbs, prepositional phrases, etc. And it will be these idea concepts that form our ‘idea network’. And by creating an ‘idea syntax’, we can eventually categorize the sentences and queries to enable us to determine the idea meaning for each based on historical information within the knowledge base.


The Meta - Language


The meta-language is what allows us to execute code that processes the user input, whether it is a command or a query. The base 500 words are the key. They are the only words that can actually cause some action to take place, i.e. they cause some hard-coded action to take place. The commands of all the other words are other words, not code. These words can be any word in the vocabulary, but ultimately we reach a point where all the command words are the base 500 and we have finally executed all the code necessary for the user’s input to be satisfactorily handled.

The primary focus is, of course, the 500 key words. What they are and what makes them unique is as important as the hard code they execute. These relationships are what determine whether you get circular never-ending loops in your processing or reliable output. The commands are like word definitions, they are a list of other words that represent the actions taken when a sentence is processed. When you evaluate the words in a sentence you identify all the parts of the sentence, and you have the information needed to execute the sentence. The main verb is the starting point.


Example 1: WHAT CAN FLY? is our user query.

WHAT - when the first word in the sentence is WHAT

FIND(OBJ1A, OBJ1B) OBJ1A = rest of query line REPLY(OBJ1B) OBJ1B = the answer


Starting with the main verb, in this case a query whose main verb is the word WHAT, we want to execute the commands for WHAT. The command that matches in this case is the command where WHAT is also the first word in the query’s structure. In this case we know that the rest of the sentence/query is comprised of descriptions of what we are looking for. So the action is quite generic in the sense that we merely redefine the problem into a FIND command and send the answer back with a REPLY command.


Example 2: FIND(OBJ1A, OBJ1B) is our command.

FIND - when the first word in the sentence is WHAT

Call FindRelatedKBSentences

Call ElimBogusRelatedKBSentences

If OBJ1B is only one sentence, then answer found

Otherwise multiple sentences have the answer. (or none do)



The command for the word FIND that we execute for WHAT is a simple noun/subject or verb/predicate match on related sentences which is then restricted by matching restricting or modifying words and checking for negations and conflicting sentences. These are already part of the whole concept of the relational setup already in place, they just need to be checked and evaluated.

Of course, this is where the problems become complex. How many commands will there be? At the minimum, will we have to program the 500 words, and will each have more than one? Probably, since each word general can be used to mean more than one thing. However, many of the 500 base words may not have any bearing on what the program can do. Thus it may be possible to restrict our concern to those words that are actually required to allow the program to function.

What about the other verbs? The other verbs will have a command section that is composed of other words, preferably comprised of words from the base 500.

How many of them will be the cause of a tremendous number of machine-boggling commands that seem to be looping into infinity? At this point, who can say? But through careful attention and testing, I don’t believe that this will be a problem. The best solution to this problem is to only use words from the base 500 to form the command section for words.

Can we use our dictionary, evaluating our commands based on what we find there? No. While it might seem logical, you must consider that a word’s definition often has nothing to do with the actions or commands associated with the use of the word in a sentence. But what we can do is put a list of commands for each unique node in the vocabulary. This means we have created a pseudo-dictionary for each word. And yes, this would become extreme real quick if we were required to do this many times for each word in the vocabulary. Another interesting possibility is what if we re-wrote our dictionary in terms of the base 500 words? Could we possibly use the dictionary then? For the same reasons as before, NO. But we may be able to base our commands from the words stored there.

Can we use knowledge stored in our knowledgebase to form commands? Yes. We can actually re-teach what the words are and do by essentially redefining what the word is and what it means, as well as when to use it. This is the hard part. Knowledge must be built upon knowledge one layer at a time. This layering of knowledge is what forms the knowledgebase. It is what determines the program’s ability to handle information beyond the abilities of a normal relational system. Thus the knowledge base can actually allow the program to supercede the original program and the original data through the use of the defining sentences and paragraphs that can be used to form an adequate reply to a query or command.

Unfortunately, my meta-language is far from complete, as will be my discussion of the meta-language. In fact, I have but a small fraction of the data that what will be required to make the program fully capable. However, the method seems to work correctly with the limited test data currently available.




As you may have noticed, there are three main sections to my NLP. The first is the creation of a vocabulary. The second is the creation of the knowledgebase. And the third is the user interaction section. And while all three of these deserve several chapters, if not whole books, I have had to condense their information into this single paper for introductory reasons.

The most important aspect of the vocabulary is that it must be self-contained. And it must have the following features. It must have unique word/meaning nodes, and each node should have all the pertinent information normally included in a dictionary listing plus the word features such as animate vs. inanimate, a physical object vs. a non-physical object, etc.

The most notable feature of the knowledgebase is its ability to hold information, information that can allow the program to exceed its original programming by redefining information to a newer form. The next-most notable feature of the knowledgebase is its size. The more sentences the better. And of course the different representations of each sentence that allow for additional links to the information stored here. The ability of the knowledgebase to contain paragraphs and idea based groupings is also a step forward in our quest for a perfect NLP.

The last step is the user interface. The restrictions to which we put our NLP are going to help determine our commands and queries. The commands executed as a result of user-interfacing, while extensive and mostly undiscovered as yet, enable the program to achieve extraordinary results and capabilities.

In conclusion, while many of the problems associated with NLPs have been researched and resolved, where possible, the semantic issues as well as the actual thinking process has eluded our ability to resolve, for the most part. By creating a knowledge base in the same way that we humans build our own, it is hoped that many of these issues can be resolved. Furthermore, through the use of this knowledge base, it is hoped that we can incorporate methodologies and rules that will allow the program to perform as an adult human in most respects. By evaluating input into its base idea or intent, and then reflecting on the content through associated words and concepts, I believe we can create a superior simulation of the processes of the human mind.


Printout of a Typical Output for the Program


The following is a list of the output.

























I Am Sorry. I Can Not Execute The Command.



I Am Sorry. I Do Not Know The Answer.



I Am Sorry. I Do Not Know The Answer.























CHAIR(n1) Is Not In The Vocabulary

I Can Not Process
































I Am Sorry. I Do Not Know The Answer.



1 + 0 = 1



2 + 1 = 3



3 + 2 = 5



5 + 3 = 8



1 + 4 = 5



7 + 5 = 12



8 + 6 = 14



1 + 7 = 8



9 + 8 = 17



1 + 9 = 10



1 - 0 = 1



2 - 1 = 1



5 - 3 = 2



1 - 4 = -3



7 - 5 = 2



1 - 7 = -6



19 - 18 = 1






1 DOG - 9 DOGS = -8 DOGS



1 DOG - 2 DOGS = -1 DOG



1 DOG + 1 DOG = 2 DOGS



3 DOGS - 2 DOGS = 1 DOG



8 DOGS - 6 DOGS = 2 DOGS



1 DOG - 9 DOGS = -8 DOGS



1 DOG - 2 DOGS = -1 DOG



1 DOG + 1 DOG = 2 DOGS



3 DOGS - 2 DOGS = 1 DOG



8 DOGS - 6 DOGS = 2 DOGS



1 - 7 = -6



Link to post
Share on other sites

Being somewhat familiar with the topic, I found this interesting...


...however, its almost a full paper, but without enough context and too much detail for those who are not experts to make any sense of it at all. Conversely, to those of us who are familiar with the topic, its mostly an overview, or at least seems so from a cursory read through.


We're all about discussion here, and its often best to either keep your explanations brief with pointers elsewhere for background information, but most importantly, try to pose a question or issue quickly so we can decide what to do with your thread!


Lexed and parsed, but not groked,


Link to post
Share on other sites
In all fairness, Scott asked me where to post this. Since our Member articles section currently is closed, I told him to post it in the most relevant forum.


So this is very welcome stuff, Scott! :)

I missed that, so apologies, Scott.


But I'd like to know what you'd like us to do with this. Are there issues you want to bring out or input you'd like to get?


What did that human say,


Link to post
Share on other sites

Hello Buffy,

Since you brought it up, yes there are some things I could use help with. My lexicon needs reveiwing and correcting. My knowledge base is in sore need of expansion. And I need an adequate syntax checker/parser. Also, I get the feeling my paper belongs posted someplace other than where it is within the framework of this forum. And there I will be able to include the diagrams that didn't want to paste when I placed it here. I do not mind if the powers that be move the paper to it's rightfull place, even if some feel it is the circular file:hihi:


Link to post
Share on other sites
And I need an adequate syntax checker/parser.
For your paper or for your linguistic analysis program? :hihi: :hihi: Sorry! Couldn't resist!


I do not mind if the powers that be move the paper to it's rightfull place, even if some feel it is the circular file:hihi:

I think your paper is really interesting--definitely not destined for /dev/null! I think it would be really great if you could lead a discussion on it! We do have some changes coming around here that will give these longer pieces a "place to go," but for now this is a good enough approach.


So, what would be really nice now is if you can point out a few issues contained within the paper so we can talk about them. We're all about discussion around here, and the person that starts the thread is the owner and has to decide where to take it, so that 's your job.


Cheers, :)


Link to post
Share on other sites
  • 4 weeks later...


in case you haven't noticed, i have been extremely busy lately, 12+ hrs a day m-sat & 8 on sun. it is my busy season. but i will try to make a few hrs a week available. As to where i want to go with this paper, i want to interface it with windows in such a way that the program can execute commands and perform tasks not only in windows but on the internet. Also, i want to implement a few more aspects of the program that i have fudged - like the syntax checker/identifier. As to a discussion, that is possible as long as i have plenty of notice. as for a topic, how about what type of knowledge should be incorporated into the knowledge base and how indepth does it have to be to actually be usefull?


Link to post
Share on other sites


I am not using Lex/Yacc

Actually, the is no hardcoding for grammer. Grammer is (will be) stored in the knowledge base. This is one of the methods that have enabled me to use compiled code instead of interpreted. As far as the code is concerned, any structure is allowed, at least until it is analyzed for syntax.

The structure of the code should not be an issue as far as implementation goes, the input/output are what I need a windows interface to. The only issue might be the language - IBM PLI. Right now I simply read a file representing user input and write the output to a file.

But implementation is only one of the issues I want to address soon. I really need a good syntax checker/analyzer for the user input and for the generation of the knowledge base. I say generation of the knowledgebase because the knowledgebase file is more than a list of simple sentences, it also has the sentence structure and fields all laid out to reduce computation time on startup.

The other major area actually is the grammer. I was hoping to get the syntax analyzer program so that I wouldn't have to manually set the fields within the knowledgebase for the sentences that contain the rules for grammer. The 115? sentences in the test knowledgebase were quite tedious:doh:


Link to post
Share on other sites

I think you misunderstood my question: lex and yacc will do nothing for analyzing the linguistic input to the system, and I understand maintaining (and manipulating!) the language input grammars in a database. lex and yacc are only good for "well-structured" languages like computer languages.


But its clear that you've developed one that you use in your paper that is a structured translation of the input language, and *that* was what I assumed you needed a "syntax checker" for--something that lex and yacc are really good at.


Maybe you want to start with a quick overview of your architecture: what are all the pieces and how are they supposed to work together?


Also, PL/I is indeed problematic: Not many people are familiar with it, and getting compilers that are compatible may be hard to do: there's a Gnu PL/I project that seems rather incomplete, and I think IBM's PL/I for Windows costs a small fortune for most folks (I see references to a "Personal" edition, but I can't find it). Of course, I have to pass along that my compiler professor liked to say that "PL/I is like a gigantic Swiss Army Knife: its got a tool for everything, but by the time you've found the one you want, you've cut yourself."


Is there a reason you picked PL/I?


Language Diva,


Link to post
Share on other sites


The exact structure is something that I will have to get back to you on, it is slightly to moderately complex.

The reason I picked IBM PLI is because I have a couple of copies of it from when I worked as a senier IBM mainframe programmer in the Factory at Sabre, they were a gift so that I could work for free at home. And since the two languages in primary use at IBM shops are Cobal and PLI, I chose pli.

And you are right, they are a little expensive, about the price of a good used car. But it is the best compiler that I have, and it is the language that I am best at.


Link to post
Share on other sites


is this what you meant by structure? There are only three main sections.

CALL CREATEVOCAB; /* CreateFlatVocabFile;*/

/* This procedure sets up the vocabulary network and fills in the */

/* various fields */



/*** Set Definition Words Pointers ***/

/*** Adding Related Words ***/









/* This procedure sets up the initial knowledge base using the */

/* initial sentences as input */



CALL FINDSENTWORDS; /*set vocab word ptrs for sentence words*/

CALL SETSENTWORDS2; /*set vocab word ptrs for sentence words*/






/*CALL ExpandKnowledgeBase;*/

















I am light on the user input section, but i think you get the idea from the knowledge base as to how this thing works.


Link to post
Share on other sites

is this what you meant by structure? There are only three main sections.

This looks like the structure of your program as far as specifying the modules. The structure I was referring to though is that in your paper you have an obvious syntax for the knowledge rules as well as a "translation" (?) of the user input, that is well structured, and could be parsed with a simplistic lexical analyzer (possibly akin to your FINDSENTWORDS), and LALR-or-somesuch-style parser (possibly akin to your FINDSENTPARTS), these two pieces are what lex and yacc do, and they do what you ask for when you said you were looking for a "good syntax checker/analyzer for the user input and for the generation of the knowledge base."


You might want to dive into both a discussion of your approach to building your knowledge base and how you use it (a really facinating topic all on its own). Also, how you make sense of the "user input."


Apropos of nothing I'll mention that in terms of linguistic parsing, the most popular approach these days has to do with creating a knowledge base of existing, pre-translated text that is used to match input, rather than the older and in the past much more popular approach of trying to fully parse and analyze the text and only later looking for idioms and context within the knowledge base for disamibugation.


Infinite lookahead,


Link to post
Share on other sites

Thx, am looking forward to more critiques!

And don't knock yerself fer bein a layperson, i do home improvements myself for a living:-)

But on a more serious note, i wrote this paper for the general comp sci undergrad, i wanted to reach more than just those few who are specialists in the field. Besides, when i worked at a place where i was surrounded by 2000+ programmers, i learned that the best way to avoid creating more questions than i answered was to use simpler words, fact is, i had to go down the vocabulary of a sixth grader to make almost everything clear. The fact that you find this paper understandable and interesting combined with your admittance to being a layperson as far as comp ling goes is a testament to my success.


Link to post
Share on other sites


where can i find yacc and lex?

But as far as looking up in the knowledgebase as a precursor to parsing user input, i don't think i agree that it is better, at least for my approach, (i am old school, after all). But who can tell? Maybe when the knowledgebase has enough sentences, although you should also realize that i also keep track of all user input and do refer to previous user input when parsing/processing input.

One of my main issues is the knowledgebase. The construction of knowledge, ie where do you start, what order do you introduce the sentences so as to be able to build upon earlier knowledge, and the fact that it is my opinion that there should be 1 - 5 million sentences in the knowledgebase has me concerned a little. The intended size of the knowledgebase is why I created so many pointers as shortcuts to move around quickly, no matter how big a programming headache this gave me! But the size has me a little concerned about the hardware this may require. I had to buy a new pc more than once to run the program over the years. 4 yrs ago the vocabulary network took 12 hrs to initialize, until i rewrote it slightly to include alpha shortcuts that brought the time down to 45 minutes. My current pc (< 6 mos old) runs the entire main prog in about 20 secs, so hopefully i will be able to expand the vocab and KB quite a bit before hitting the hardware wall again.

But i digress, what i what yacc/lex for is the expansion of the KB. I do this as a seperate program that creates the flat file that is used by the main, therefor the program layout shouldn't be a concern.


Link to post
Share on other sites
where can i find yacc and lex?
"lex" ("lexical analyzer") and "yacc" ("yet another compiler compiler") are the original names of the programs built into Berkeley Unix, and are embedded in the licensed versions of this OS (see FreeBSD or OpenBSD). Good freeware versions are:

But as far as looking up in the knowledgebase as a precursor to parsing user input, i don't think i agree that it is better, at least for my approach, (i am old school, after all). But who can tell?
A lot of the people who have worked on this problem are old school, and its taken them a long time--and a lot of practical testing--to realize that blind sentence fragment matching provides better translation of language, due both to idiomatic meanings as well as "world knowledge" (implied understanding based on the context or related concepts to the actual language content).
Maybe when the knowledgebase has enough sentences, although you should also realize that i also keep track of all user input and do refer to previous user input when parsing/processing input.
And this is where the implementation issue comes in. Matching is better than Parsing only if you have an *enormous* knowledge base at least to start with (its a point of debate as to whether these knowledge bases can be pruned over time to eliminate redundancy in some automated fashion).


It actually sounds to me like you're backing into this approach too:

The construction of knowledge, ie where do you start, what order do you introduce the sentences so as to be able to build upon earlier knowledge, and the fact that it is my opinion that there should be 1 - 5 million sentences in the knowledgebase has me concerned a little.
I guess I'm still a little bit unclear as to what exactly you're putting in the knowledgebase. This quote seems to indicate you're actually storing the original input (that's the Matching approach), as opposed to the distilled "meaning" of the input.


Maybe you want to describe a bit about what the process is that you're using for both building the knowledge base and then how you use it.


Skipping whitespace,


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...