Jump to content
Science Forums

Parsing an inputed text into individual words


Recommended Posts

I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far:

 

while( scanf(fpname, %s, stun) !=EOF) )

{

******

****

***

pname= AllocateName( stun);

*******

*****

***

*

 

}

 

 

char* AllocateName( char *stun)

{

char*name;

char let;

int num;

num= strlen(stun);

--num;

let=stun[num];

if(let=='.' ||let==',' || let==':')

{

stun[num]='o';

}

if(!(name=(char*)malloc( strlen(stun)+1, sizeof(char))))

{

printf("problem allocating namen");

exit(2);

}

strcmp(name, stun);

 

return name;

}

 

Yes, yes.. It only care for words that ends with a comma, or a period.

So it is a not a perfect solution. There can be words like:

 

(log)(base4)(12) <-- consider one word

 

but not this:

 

I have the utter-most-hatred-this-lab, where "utter", "most", "hatred", "this" , "lab" are consider individual words, without the god damn '-'. In other word, if i save "utter-most-hatred-this-lab" as a string as stun, than it must be borken up into some unknown number of pieces individually allocate in the heap!

 

example of the input text:

 

 

..........data are used, consisting ......

 

.................so-called B-trees.............

 

.................................nodes (leaves)........................

........in O(log(n)) time............................

 

......................................(Used in internet routers.)...................

 

 

The "O(log(n))" is consider to be one word. I am not sure how i should proceed.

Link to comment
Share on other sites

I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far:

 

...

example of the input text:

 

 

..........data are used, consisting ......

 

.................so-called B-trees.............

 

.................................nodes (leaves)........................

........in O(log(n)) time............................

 

......................................(Used in internet routers.)...................

 

 

The "O(log(n))" is consider to be one word. I am not sure how i should proceed.

 

I have no evaluation of your code, however in your example no word begins with a symbol.

I think you meant 'parsing' not 'pausing', yes?:eek:

Link to comment
Share on other sites

I have no evaluation of your code, however in your example no word begins with a symbol.

I think you meant 'parsing' not 'pausing', yes?:eek:

 

yes

 

do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string?

Link to comment
Share on other sites

yes

 

do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string?

Perhaps by refering to a library of the specified symbols & then specifying that what follows them begins a word unless it too is a symbol?:eek:

 

PS That only takes care of the begin of a word!?

Link to comment
Share on other sites

One way is to scanf for a string, put it in to "stun".

 

I try to find open braces. and if found, i put push it into a stack.

 

when i spot a closing braces, i pop t from a stack.

 

if the stack is empty, that means i consider the whole god damn stun as one string.

 

If the stack is not empty, i need to only keep the the letters, and discard the open and close braces.

Link to comment
Share on other sites

If, by word, you mean consecutive alpha characters, wouldn't it be enough to cycle through the text using a function such as:

 

isalpha(char c){return c >= 'a' && c <= 'Z' && (c <= 'z' || c >= 'A');}

 

to find where a word starts and when it has ended?

 

Of course, if you want "one word" to be one word you can accomodate that to, but I don't see how you could work around cases where something starts with a single " unless you can suppose that it's the last " in the text.

Link to comment
Share on other sites

Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts.

 

So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word. Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended?

Link to comment
Share on other sites

Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts.

 

So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word.

 

 

By design, there is not word like : O(log(n-m)) or any space between the '(' and the next character.

 

Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended?

 

by design, things like: (log)base4)(12) is consider a single word.

Link to comment
Share on other sites

Feel free to check my javascript word counter.

 

http://st10.startlogic.com/~thedawgs/mostuff/southie/extras/CountWords.html

 

The "Count Words" button does just that. But the "Clean Text" button only deletes hard returns to allow natural word wrapping and limits returns between paragraphs to two. It doesn't flag special characters, but it does differentiate in a way you can utilize. ASCII codes. Here's a taste.

 

   // --- validate at least one character
  for ( i = 0; i < text.length; i++ )
  {
     if ( text.charCodeAt( i ) > 32 )
     {
        // --- count first word
        wordCount++;
        break;
     }
  }

So say you were going to pipe each word in a string into individual strings or an array you would do it kinda like:

 

initialize vars

loop through string

---if charCode > 64 && < 91 || > 96 && < 123 (alpha chars only)

------pipe me into var (put sequential alpha chars into string)

---else (we have a non-alpha char)

------pipe var into array and clear var (save string without this and start over)

repeat

 

You could be more specific of course and look for individual chars such as spaces or increment a counter or whatever.

Link to comment
Share on other sites

if there are words like : (home

 

The '(' would be deleted, and home would be a word to be allocated.

 

If it was : (home) , then : (home) would be one word.

 

Here is my thought( in c)

 

let ary be the string with everything in it.

 

here is my code:

 

1) set ptr to ary

2) loop( *ptr not equal to '0')

2.1) if *ptr is '(', then put it in a stack.

2.2) if *ptr is ')' , then pop ')' from the stack

2.3) increment ptr by 1.

3) end loop

 

4) if( stack is empty)

4.1) set det to ary.

4.2) loop ( *det!= '0')

4.2.1)if(isalpha(*det)) || det*= '(' )

4.2.2) strung[ num] = *det

4.2.3) end if

4.2.1) increment num

4.2.2) increment det

4.3) end loop

5) end if

 

Is This a viable way? Is there a better way to do this?

Link to comment
Share on other sites

Oh sorry. I misunderstood your situation. You could go the long route and attempt to assertain the purpose of each '(' or ')'... Or you could just flag the 'delimiters', the characters that will always constitute word breaks, such as spaces.

 

My counter, for example only counts spaces and carriage returns. It would be just as easy, though, to count multiple delimiters at the same time; like spaces, hyphens, commas, and periods. If a character such as a parenthesis does not always act as a delimiter, though, ignore it. The counter would then count ((x)(base10)log(y)) as one word and (this phrase) as two words because of the space alone. Just be careful not to over-increment.

 

I'm no pro, but I would set an incrementor and a flag: wordCount/doCount or similar. (I hate little names, they confuse me.) You would initialize both of these and then create a flip-flop situation.

 

wordCount = 0;
doCount = true;

loop through string
if (char == alphabetical) {
  if (doCount == true) {
     wordCount++;
     doCount = false;
  }
}
else if (char == delimiter) { // but ignore parentheses
  doCount = true;
}
end loop

 

This will catch double counting. You just need to specify your alpha characters and delimiters to define words and word breaks.

 

wordCount = 0;
doCount = true;
  
while ( char = ary.charCodeAt(ptr) ) {
  if ( char > 64 &&   // between 'A' and
       char < 91 ||   // 'Z' or
       char > 96 &&   // between 'a' and
       char < 123 ) { // 'z'
     if ( doCount ) {
        wordCount++;
        doCount = false;
     }
  }
  else if ( char == 32 ||   // space or
            char == 44 ||   // comma or
            char == 45 ||   // hyphen or
            char == 46 ||   // period or
            char == 58 ) {  // colon
            // and so on (ignore parentheses)
     doCount = true;
  }
  ptr++;
}

This is javascript, though. I don't know c very well, yet. But the same logic would apply.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...