Jump to content
Science Forums

Recursive Equation For Identifying The Most Likely Pattern From Input


Recommended Posts

I recently had to create a bail predictor for work in order to save our underwriters time on finding out exactly what the bail amount is going to be for any given charge. I admit that I took the easy way out on the pattern recognition search algorithm by using exponents, but in the process I discovered something that may be useful in the future.

 

What I did was take an input (like "pattern"), look at its length, and then calculate the length of the input as well as all of the patterns within the input. The recursive math looks something like this.

 

Pattern(length of 7)

7

6,6

5,5,5,5

4,4,4,4,4,4,4,4

3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

 

As you see, it gets exponentially larger, but there's no mathematical equation or operator that I know of that will give you the addition of all of these numbers (it's hierarchical).

 

I just took the easy way out and took the length of the pattern and raised it to the power of its own length. It works the same way, maybe even better... that's why I don't know why I'm writing this out to begin with but whatever.

 

The hierarchical recursive pattern above comes out to a strength 247 whereas just doing the exponent comes out to a much higher number.

Link to comment
Share on other sites

Not really clear on why the standard summation ([math]\sum\nolimits[/math]) operator doesn't get you what you need here. Too lazy to work it out but I bet if you did a sum over n to 1 you could do it.

 

 

I had a polynomial once. My doctor removed it, :phones:

Buffy

Link to comment
Share on other sites

I'd do reverse order n to 1 since the initial item has to be n. Also, many people forget that concatenation is actually a legitimate mathematical operator (well, except for some stuck-in-the-19th-century-purists).

 

My mathematics is simple: one plus one = one, :phones:

Buffy

Link to comment
Share on other sites

7

6,6

5,5,5,5

4,4,4,4,4,4,4,4

3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

The hierarchical recursive pattern above comes out to a strength 247 whereas just doing the exponent comes out to a much higher number.

As you see, it gets exponentially larger, but there's no mathematical equation or operator that I know of that will give you the addition of all of these numbers (it's hierarchical).

Using sigma notation, you could write the equation for this

[math]f(n) = \sum_{i=0}^n (n-i) 2^i[/math]

 

It can be written more simply as this exponental equation:

[math]f(n) = 2^{n+1} –n -2[/math]

 

It’s a fun family of series. I wonder what it's like for sequences than a simple n, n-1 … 0 and bases other than 2?

Link to comment
Share on other sites

Wow you guys just took this wayyyy over my head. I deal with computational linguistics and the most basic algebra. I really need to take advanced algebra to understand some more stuff that is necessary to my profession... I just can't find the motivation.

 

I'm wondering if your equations are correct though. Like I said though, taking the length of the pattern and raising it to the power of itself is sufficient for the task that I'm trying to perform. I just wanted to point out the hierarchical nature of calculating the exact statistical likelihood of an input matching a pre-existing pattern.

 

It goes like this.

 

                              input = good

 

                                 good = 4

                            /                      \

                 goo=3                            ood=3

                /          \                            /        \

       go=2            oo=2              oo=2         od=2

     /        \            /      \              /      \         /        \ 

g=1        o=1  o=1     o=1       o=1  o=1   o=1   d=1

 

 

26 is the answer.

Link to comment
Share on other sites

The reason it has to be done this way (or the easier way of using exponents) is because you don't want some arbitrary pattern coming along and matching to the input when the only thing that the pattern and the input have in common is a bunch of single letters.

 

An example is like so-

 

Input= "Hi how are ya?"

 

matching pattern should be "Hi, how are you?"

 

pattern matched = "Hello ivan, are you ready to have some fun?"

 

A response to the pattern matched could be "yes!" when you want a relevant answer like "I'm fine."

 

The reason it would match the wrong pattern is because the one matched is longer than the one it should be matched to and contains a lot of matching letters and patterns from the input therefor making the statistical strength higher for the wrong pattern.

Link to comment
Share on other sites

To make this more complicated (and since I have the time), I'll calculate the example above.

 

input = "Hi how are ya?"

 

matching pattern = "Hi, how are you?"

 

"Hi" is in the matching pattern, "how are y" is there too, and "a" and "?" is there which gives us the sum exponent of 4 + 387420489 + 1 + 1 = 387420495

 

in the bad pattern ("Hello ivan, are you ready to have some fun?") we only get "H"(1), "i"(1), "h"(1), "o"(4), "are y"(3125), "a"(3), "?"(1). This comes to a sum of 3136. Obviously not a good match if you're using exponents which therefor gives you a relevant response.

Edited by Poppins
Link to comment
Share on other sites

I think you need to back up and tell us what problem you're trying to solve, since actually the answers we all gave are literal solutions to the problem statement "how do you generate the sequence of integers '7 6,6 5,5,5,5...'" which clearly does not have a whole lot to do with the higher level problem you're trying to solve.

 

Even in the last several posts there are a bunch of possible interpretations of what you're trying to accomplish:

 

  • Are you trying to come up with a computationally simple mechanism to *estimate* the probability of two patterns having a close match of the sequence of characters?
  • Are you trying to come up with a computationally simple mechanism to *estimate* the probability of two sentences being *semantically* similar?
  • Are you trying to come up with a statistical approximation of the distribution of characters in two strings based on locality of individual characters?
  • Something completely different?

 

People are often amazed at the statistics that software projects fail 80% of the time: as a software professional, I can tell you that usually has a lot more to do with bad specifications for the problem one wants to solve than the inability of the programmers to build it (and I say that as someone who's spent most of their time in marketing coming up with such specifications and seeing how horribly botched they can get before a single line of code is written).

 

 

That's the thing about people who think they hate computers. What they really hate is lousy programmers, :phones:

Buffy

Link to comment
Share on other sites

Wow you guys just took this wayyyy over my head. I deal with computational linguistics and the most basic algebra. I really need to take advanced algebra to understand some more stuff that is necessary to my profession... I just can't find the motivation.

I agree you should master at least practical, engineering math to be a good computer programmer, Poppins, but should note I didn’t use any algebra to finding the expressions for the numeric series you described.

[math]f(n) = \sum_{i=0}^n (n-i) 2^i[/math]

is just a simple program written in an old, widely understood nonprocedural programming language. Written in a generic procedural language, it’s

function f(n) {
 i=0, s=0
  loop while i<n {
    s = s + (n-i)*2^i
    i = i + 1
  }
}
return s
I found

[math]f(n) = 2^{n+1} –n -2[/math]

simply by noticing that the difference between the value f(n) differed by the power of 2 term by n +2, by writing and running this tiny, 1-line MUMPS program:

f N=1:1 x "s S=0,J=1 f I=0:1:N s S=N-I*J+S,J=2*J" w N,?4,S,?12,2**(N+1),?20,2**(N+1)-S,!

which produced output starting:

1 1 4 3

2 4 8 4

3 11 16 5

4 26 32 6

5 57 64 7

6 120 128 8

 

Whether the exponential expression is better – that is, computationally more efficient – than the Sigma notation equivalent program, is a slightly deeper question, involving how the “^” or “**” exponentiation operator is implemented. For a good, and most of commercial and public domain computer language implementations, the answer is yes.

 

I'm wondering if your equations are correct though ...

good = 4 … 26 is the answer.

2^(4+1) -4 -2 = 32 -4 -2 = 26

 

As a quick, easy check, I compared the expression to the output of f(n) up to the limits of limits of my language interpreter’s built-in math precision, so unless I misunderstand what you’re sketching, the equation is correct.

 

Finding short expressions for numeric sequences if fun, but I don’t think it has much to do with determining how closely strings match for a given rule, which I gather is what you’re interested in, Poppins. On the subject of string matching...

input = "Hi how are ya?"

matching pattern = "Hi, how are you?"

 

"Hi" is in the matching pattern, "how are y" is there too, and "a" and "?" is there which gives us the sum exponent of 4 + 387420489 + 1 + 1 = 387420495

How did you calculate 387420489? f(9)=1013, not 387420489. Your 4, 1, and 1 are what I expect.

 

Are you familiar with the diff algorithm? It’s a venerable (and IMHO beautifully elegant) solution to “pattern matching” or “similarity of strings” problems I think are similar to the one you’re thinking of.

 

The diff function’s input are 2 sequences (usually of lines of newline-delimited text, but stings can also be used), its output a sequence of delete and inserts to change one input into the other. A simple “goodness of match” value can be gotten by counting the number of deletes and inserts. Here’s such a count for your examples:

Hi how are ya? | Hi, how are you? | 4

Hello ivan, are you ready to have some fun? | Hi, how are you? | 35

 

So, according to diff, “Hi how are ya?”, which needs only 4 character deletes and inserts to be transformed into “Hi, how are you?” is a better fit than “Hello ivan, are you ready to have some fun?”, which needs 35.

Link to comment
Share on other sites

I think you need to back up and tell us what problem you're trying to solve, since actually the answers we all gave are literal solutions to the problem statement "how do you generate the sequence of integers '7 6,6 5,5,5,5...'" which clearly does not have a whole lot to do with the higher level problem you're trying to solve.

 

Even in the last several posts there are a bunch of possible interpretations of what you're trying to accomplish:

 

  • Are you trying to come up with a computationally simple mechanism to *estimate* the probability of two patterns having a close match of the sequence of characters?
  • Are you trying to come up with a computationally simple mechanism to *estimate* the probability of two sentences being *semantically* similar?
  • Are you trying to come up with a statistical approximation of the distribution of characters in two strings based on locality of individual characters?
  • Something completely different?

 

People are often amazed at the statistics that software projects fail 80% of the time: as a software professional, I can tell you that usually has a lot more to do with bad specifications for the problem one wants to solve than the inability of the programmers to build it (and I say that as someone who's spent most of their time in marketing coming up with such specifications and seeing how horribly botched they can get before a single line of code is written).

 

 

That's the thing about people who think they hate computers. What they really hate is lousy programmers, :phones:

Buffy

 

Here is and example of the problem that I was trying to solve:

 

The search was domestic violence and every search produces 10 results. The results would be something like this-

 

1. Burglary

2. Criminal Sexual Conduct 1st degree

3. Criminal Sexual Conduct 2nd degree

4. Criminal Sexual Conduct 3rd degree

5. Criminal Sexual Conduct 1st degree (Multiple variables)

6. Domestic Violence

7. Breaking & Entering

8. Operating a chop shop

9. Larceny from the person > $250 < $1000

10. Indecent Exposure

 

What I wanted was a better list for all of the Domestic Violence charges that can be associated with Domestic Violence. (I compiled them based on statute numbers).

 

My solution was like a band-aid. Take the string, find what is in the string. Take len(string) and ** (exponentiate it) to len(string) (which, btw, is how I got the number 387420489 because I took len("how are y") and ** to len("how are y").

 

I assume that I need nothing more. I have a pretty interesting algorithm for this task (as a few professors have pointed out so far). I'm going to show it to you guys because I like the fun in sharing code... and because of how interesting the output is.

poi = 'Awesome dude'
npoi = ''
while len(poi + npoi) != 0:
    if len(poi) != 0:
        npoi = poi[-1] + npoi
    if len(poi) == 1:
        poi = ''
    else:
        poi = poi[:-1]
    if len(poi) == 0:
        poi = npoi[1:]
        npoi = ''

poi stands for "point of interest", npoi stands for "next point of interest". I'm probably going to have to write another post to show the output so that's what I'm going to do.

Link to comment
Share on other sites

I made it print spaces instead of character for the poi and the npoi.

while len(poi + npoi) != 0:
    print ' ' * len(poi), '||'
    print ' ' * len(npoi), '||'
    ...
             ||
 ||
            ||
  ||
           ||
   ||
          ||
    ||
         ||
     ||
        ||
      ||
       ||
       ||
      ||
        ||
     ||
         ||
    ||
          ||
   ||
           ||
  ||
            ||
            ||
 ||
           ||
  ||
          ||
   ||
         ||
    ||
        ||
     ||
       ||
      ||
      ||
       ||
     ||
        ||
    ||
         ||
   ||
          ||
  ||
           ||
           ||
 ||
          ||
  ||
         ||
   ||
        ||
    ||
       ||
     ||
      ||
      ||
     ||
       ||
    ||
        ||
   ||
         ||
  ||
          ||
          ||
 ||
         ||
  ||
        ||
   ||
       ||
    ||
      ||
     ||
     ||
      ||
    ||
       ||
   ||
        ||
  ||
         ||
         ||
 ||
        ||
  ||
       ||
   ||
      ||
    ||
     ||
     ||
    ||
      ||
   ||
       ||
  ||
        ||
        ||
 ||
       ||
  ||
      ||
   ||
     ||
    ||
    ||
     ||
   ||
      ||
  ||
       ||
       ||
 ||
      ||
  ||
     ||
   ||
    ||
    ||
 
 
It looks like DNA :)
Link to comment
Share on other sites

I agree you should master at least practical, engineering math to be a good computer programmer, Poppins, but should note I didn’t use any algebra to finding the expressions for the numeric series you described.

[math]f(n) = \sum_{i=0}^n (n-i) 2^i[/math]

is just a simple program written in an old, widely understood nonprocedural programming language. Written in a generic procedural language, it’s

function f(n) {
 i=0, s=0
  loop while i<n {
    s = s + (n-i)*2^i
    i = i + 1
  }
}
return s
I found

[math]f(n) = 2^{n+1} –n -2[/math]

simply by noticing that the difference between the value f(n) differed by the power of 2 term by n +2, by writing and running this tiny, 1-line MUMPS program:

f N=1:1 x "s S=0,J=1 f I=0:1:N s S=N-I*J+S,J=2*J" w N,?4,S,?12,2**(N+1),?20,2**(N+1)-S,!

which produced output starting:

1 1 4 3

2 4 8 4

3 11 16 5

4 26 32 6

5 57 64 7

6 120 128 8

 

Whether the exponential expression is better – that is, computationally more efficient – than the Sigma notation equivalent program, is a slightly deeper question, involving how the “^” or “**” exponentiation operator is implemented. For a good, and most of commercial and public domain computer language implementations, the answer is yes.

 

2^(4+1) -4 -2 = 32 -4 -2 = 26

 

As a quick, easy check, I compared the expression to the output of f(n) up to the limits of limits of my language interpreter’s built-in math precision, so unless I misunderstand what you’re sketching, the equation is correct.

 

Finding short expressions for numeric sequences if fun, but I don’t think it has much to do with determining how closely strings match for a given rule, which I gather is what you’re interested in, Poppins. On the subject of string matching...

How did you calculate 387420489? f(9)=1013, not 387420489. Your 4, 1, and 1 are what I expect.

 

Are you familiar with the diff algorithm? It’s a venerable (and IMHO beautifully elegant) solution to “pattern matching” or “similarity of strings” problems I think are similar to the one you’re thinking of.

 

The diff function’s input are 2 sequences (usually of lines of newline-delimited text, but stings can also be used), its output a sequence of delete and inserts to change one input into the other. A simple “goodness of match” value can be gotten by counting the number of deletes and inserts. Here’s such a count for your examples:

Hi how are ya? | Hi, how are you? | 4

Hello ivan, are you ready to have some fun? | Hi, how are you? | 35

 

So, according to diff, “Hi how are ya?”, which needs only 4 character deletes and inserts to be transformed into “Hi, how are you?” is a better fit than “Hello ivan, are you ready to have some fun?”, which needs 35.

 

 

The generic code you posted looks almost exactly like R. I'm not familiar with the diff algorithm but I'm about to check it out. All I'm doing today is some fun database maintenance and making sure that my programs are gathering data correctly. I've gathered over 100,000 records over the weekend without supervision, pretty amazing huh?

Link to comment
Share on other sites

If you guys got python try this one out, it's pretty fun to look at.

poi = raw_input('>')
npoi = ''
while len(poi + npoi) != 0:
	if len(poi) > len(npoi):
		print ' ' * len(npoi), '||', '=' * (len(poi)-len(npoi)), '||'
	else:
		print ' ' * len(poi), '||', '=' * (len(npoi)-len(poi)), '||'
	if len(poi) != 0:
		npoi = poi[-1] + npoi
	if len(poi) == 1:
		poi = ''
	else:
		poi = poi[:-1]
	if len(poi) == 0:
		poi = npoi[1:]
		npoi = ''
 || ================== ||
  || ================ ||
   || ============== ||
    || ============ ||
     || ========== ||
      || ======== ||
       || ====== ||
        || ==== ||
         || == ||
          ||  ||
         || == ||
        || ==== ||
       || ====== ||
      || ======== ||
     || ========== ||
    || ============ ||
   || ============== ||
  || ================ ||
 || ================= ||
  || =============== ||
   || ============= ||
    || =========== ||
     || ========= ||
      || ======= ||
       || ===== ||
        || === ||
         || = ||
         || = ||
        || === ||
       || ===== ||
      || ======= ||
     || ========= ||
    || =========== ||
   || ============= ||
  || =============== ||
 || ================ ||
  || ============== ||
   || ============ ||
    || ========== ||
     || ======== ||
      || ====== ||
       || ==== ||
        || == ||
         ||  ||
        || == ||
       || ==== ||
      || ====== ||
     || ======== ||
    || ========== ||
   || ============ ||
  || ============== ||
 || =============== ||
  || ============= ||
   || =========== ||
    || ========= ||
     || ======= ||
      || ===== ||
       || === ||
        || = ||
        || = ||
       || === ||
      || ===== ||
     || ======= ||
    || ========= ||
   || =========== ||
  || ============= ||
 || ============== ||
  || ============ ||
   || ========== ||
    || ======== ||
     || ====== ||
      || ==== ||
       || == ||
        ||  ||
       || == ||
      || ==== ||
     || ====== ||
    || ======== ||
   || ========== ||
  || ============ ||
 || ============= ||
  || =========== ||
   || ========= ||
    || ======= ||
     || ===== ||
      || === ||
       || = ||
 
This one is fun to look at too.
 
while True:
    poi = raw_input('DNA> ')
    npoi = ''
    while len(poi + npoi) != 0:
        if len(poi) > len(npoi):
            print ' ' * len(npoi), '0=' + '=' * (len(poi)-len(npoi)) + '=0'
        else:
            print ' ' * len(poi), '0=' + '=' * (len(npoi)-len(poi)) + '=0'
        if len(poi) != 0:
            npoi = poi[-1] + npoi
        if len(poi) == 1:
            poi = ''
        else:
            poi = poi[:-1]
        if len(poi) == 0:
            poi = npoi[1:]
            npoi = ''
Edited by Poppins
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...