Jump to content
Science Forums

Software Risk Analysis


Pyrotex

Recommended Posts

SOFTWARE PROBABILISTIC RISK ANALYSIS (SPRA)

 

I have finally determined what my new job is all about. They want me to come up with a way of assigning a probability to a module of software--the probability that it will execute in such a way that loss of assets will occur (LOA).

 

This isn't Quality Assurance. This isn't Software Reliability Analysis. This isn't Software Verification and Validation. So, don't go there! :bow: :)

 

You may ask, "what's the difference between SW PRA and SQA, SRA, SVV?" Excellect question. When I have a good answer, I'll get back to you. But for the nonce, just accept that it is a different animal altogether.

 

--What is the probability that LOA will occur because of SW?--

 

Hardware PRA is fairly straightforward, and I've been learning a lot about that lately. But software PRA is... is... currently undefined. The best I have done so far is build the following metaphor:

 

The "mission" takes place in "TEFO", a 4-dimensional space, with dimensions of time, environment, function and (human) operation. Knowing what you want to do, you carve out a cavern of "nominality" (nominal, expected). Knowing what you want the mission to accomplish, you settle on regions within TEFO which define the expected Environment the mission will deal with (high G's, high speeds, etc), the Functions the mission will have to perform (takeoff, navigation, communication, etc), the human Operations the people (including crew) involved will execute -- and all as a function of Time. Within TEFO, we carve out a cavern that makes room for the mission.

 

Within the TEFO cavern, we assemble our Requirements--like a scaffolding, that defines the outer limits of our mission. Hung from that scaffolding, the actual mission is constructed out of hardware, software and people. Within the volume of the actual mission itself, we define a smaller volume--let's call it the 99% envelope--and this defines the boundaries of all our testing and verification. So, SQA, SRA and SVV all apply to that portion of the mission within the 99% envelope.

 

A "mission" would be the actual flight. At any given instant in time, there is a point in the TEFO that defines the "mission event state". This point traces out the "mission event timeline" which starts at say, ignition, and goes all the way to say, landing.

 

--What is the probability that LOA will occur because of SW?--

 

We can assume the SQA, SRA and SVV can (mostly) eliminate flaws and bugs within the 99% envelope.

 

But SW PRA, as I currently understand it, addresses the probability that the mission event timeline will drift out of the 99% envelope, maybe even outside the requirements envelope--AND result in LOA--because of a failure of the software to adequately respond to these boundary edge conditions.

 

NOTE: The software doesn't have to FAIL, necessarily. It could be doing exactly what it was designed to do. Remember, there are other dimensions: the Environment and the Operations of human beings.

 

THE BIG Q: Does anyone out there know of resources, people, books, projects or have personal experience applicable to this problem???

 

In the meantime, I will use this thread as kind of a blog to record whatever I find out.

Link to comment
Share on other sites

What is the probability that LOM or LOC will occur because of SW?--

 

We can assume the SQA, SRA and SVV can (mostly) eliminate flaws and bugs within the 99% envelope.

 

But SPRA, as I currently understand it, addresses the probability that the mission event timeline will drift out of the 99% envelope, maybe even outside the requirements envelope--AND result in LOM or LOC--because of a failure of the software to adequately respond to these boundary edge condtions.

 

NOTE: The software doesn't have to FAIL, necessarily. It could be doing exactly what it was designed to do. Remember, there are other dimensions: the Environment and the Operations of human beings.

 

Excellent question Pyro, but I'm having trouble understanding the differentiation of SPRA from the collective others. I'm sure the answer is in the last paragraph I quoted above, but it's a bit elusive still. Can you give a practical example using, perhaps, the "Operations of human beings" dimension?

 

:bow:

Link to comment
Share on other sites

I know this from nothing...

 

But if you want to gain some legitimacy, it may be worthwhile to look at the "accepted" approach for hardware, to wit: the basic idea on the hardware side is you have a countable number of components and a mean-time-before-failure and a system analysis that takes into account criticality and redundancy to determine time before system failure (ignoring whether its mission or crew for the moment: that's an implementation detail!).

 

It would appear that "software is completely different," but I'll argue that its not: "components" in software are "program modules" and while its hard to assess specific "MTBF" to a specific module since its one of a kind, what you can do is pull out one of those weird computer science concepts that not many people pay attention to any more: the good ol' "function point."

 

While its been a long time since I've gone near them, I'm pretty sure you can scare up statistics on "MTBF-per-function point" with appropriate analysis of the types of FPs and so on and end up using a methodology that's exactly like what's done on the hardware side.

 

Now if I were actually an astronaut, I'd be scared to death of this kind of "analysis," ("That's a bunch of crap," is what Gus Grissom would probably say) but its probably just as justifiable as the hardware stuff is...

 

You pulled it out of where? :bow:

Buffy

Link to comment
Share on other sites

Thanks dudes and dudettes.

 

The most famous incident where SW killed someone was the Therac 25.

A software error passed through exhaustive testing, and never caused any problems until a rare sequence of operations enabled it.

 

SW PRA is different from the other analyses because the output of SW PRA (just like HW PRA) is to assign a number -- the probability that that particular piece of SW will fail or enable a failure, and cause damage. It's not trying to fix the SW or even test it. It's not a methodology of creating good SW. It's not even about FINDING bugs or flaws.

 

What is the probability (per mission or per hour of operation) that software will cause hurt or damage? That's it. And maybe, what is the probable number of serious bugs still existing?

Link to comment
Share on other sites

The most famous incident where SW killed someone was the Therac 25.

A software error passed through exhaustive testing, and never caused any problems until a rare sequence of operations enabled it.

 

Great link!

 

SPRA is different from the other analyses because the output of SPRA (just like HW PRA) is to assign a number -- the probability that that particular piece of SW will fail or enable a failure, and cause damage. It's not trying to fix the SW or even test it. It's not a methodology of creating good SW. It's not even about FINDING bugs or flaws.

 

What is the probability (per mission or per hour of operation) that software will cause hurt or damage? That's it.

 

I understand now, at least partially. :bow:

Link to comment
Share on other sites

How about this approach?

  1. Write a simulation of the spacecraft, human operators, and environment in which it operates
  2. Install a particular collection of software in the simulation
  3. Run the simulation many (N) times for a particular mission or collection of missions
  4. Count the number (F) of runs that result LOM/LOC ;
  5. The estimated probability of LOM/LOC for any reason is P=N/F
  6. Vary the installed software by replacing a single component with new component C.
  7. Repeat steps 3-5. Assuming P is low, the estimated probability of C having caused the failure is [math]P_{\mbox{(C caused failure)}}=P_{\mbox{(with C)}}-P_{\mbox{(without C)}}[/math]

It’s a completely empirical approach – nothing about the software must be known. You can test completely bad or incompatible components. If they are critical, [math]P_{\mbox{(C caused failure)}}[/math] will be nearly 1.

Link to comment
Share on other sites

How about this approach?
  1. Write a simulation of the spacecraft, human operators, and environment in which it operates
  2. Install a particular collection of software in the simulation
    ...

It’s a completely empirical approach – nothing about the software must be known. ....

Thanks for the suggestion. It seems well thought out. But maybe here is a clue. My "bosses" have tried and failed to come up with a way of assigning risk to software (SW). :D :shrug: :)

 

Your approach and Buffy's approach to hardware (HW) risk analysis has been going on since at least the 1960's--nearly 50 years, maybe more. The empirical approach. Put 10,000 widgets in the fottasite, and run for 1,000 hours. 2,081 widgets failed. Probability of failure per hour is 2,081/10,000,000 or 0.00021 or 0.021%.

 

If the widget has to run for 50 hours during a mission, then Total Probability of failure is about 1 in 100.

 

But when you try to do SW that way, you can download 10,000 copies of it into a computer, and the computer HW may fail, but every copy of the SW will behave the same. They may all fail. They may all never fail. They may all fail once every 7 hours for unknown reasons. But there's no difference among identical copies of 1s and 0s, given that they are subjected to the same test. The empirical approach produces bad, nonsensical, untrustworthy or uninterpretable results.

 

My bosses have committed that they will, for the first time ever, come up with a way of assigning a probability of failure to SW. So far, it appears to me that they have tried and failed. :D The ball now appears to me to be in my court.

 

Now, my data may be incomplete as hell or just dead wrong. In case anybody is reading this post, I hereby disavow everything I have said so far. But I am left with the conclusion that SW PRA is not easy and not straightforward.

Link to comment
Share on other sites

But when you try to do SW that way, you can download 10,000 copies of it into a computer, and the computer HW may fail, but every copy of the SW will behave the same.
Sure, and that's why counting function points and comparing it to a *huge* sample of software that's all different is such a useful thing: Since software can be copied perfectly, there's no point in testing multiple copies, but its perfectly reasonable to assume that your average programmer will produce an average of x bugs per function point of each type. Each program will have a unique combination of number and type of function points, so you can use that to determine the number of expected bugs, and their severity.

 

I know this has been going on for a long time, and the question is why the response to the people who are doing it is still "I dunno"... The development managers I know have been keeping statistics like this (usually in lines of code rather than FPs) for ages just to rate programmers for when promotion/layoff time comes. Even if there were more complexities than I've mentioned here (and I know there are!), there's got to be enough data to start to give "fuzzy" numbers that are probably no less "justifiable" than the "test 2000 vacuum tubes" method...

But I am left with the conclusion that SW PRA is not easy and not straightforward.
No disagreement there! :D

 

Tell me you're really doing this just so they let you do Hypography at work... :D

 

Work expands to fill the available time, :)

Buffy

Link to comment
Share on other sites

But I am left with the conclusion that SW PRA is not easy and not straightforward.
Confidence bolstered by the failure of the almighty google search engine to identify S[W][ ]PRA with anything more appropriate than “software project risk analysis”, “solid waste planning”, or (my favorite) the Scottish Plastic & Rubber Association, I’m left with the suspicion that SPRA is especially difficult because it’s a discipline you NASA folk are making up independently of the larger world – claims of its 20 year antiquity notwithstanding. :)

 

Not to disparage the invention of acronyms and disciplines to accompany them (some of the high points of my career have involved the invention of acyronyms :D) but a lot of definition will be necessary before anybody can understand or contribute much in detail to any conversation about it.

 

The number and scope of questions that spring to mind in an effort to create this definition overwhelm my ability to render them very coherently, so I’ll just let fly with one:

 

Does the software failure have to be a “sin of commission or omission” – eg: turning on/off or failing to turn on/off some piece of the spacecraft at a critical moment such that all goes boom – or are “sins of ignorance” – eg; not detecting and alarming or reacting to an anomalous situation – also grounds to conclude that failure is due to software? If this is the case, arguable any failure can be cast as a failure of software: for example, the 1986 Challenger disaster could have been made survivable had sensors and software detected and reacted to the SRB abnormalities by separating the orbiter from them prior to the explosion.

 

If this is not the case, what is to prevent the probability of failure due to software from being made effective zero by not having any software, even though such a vehicle would almost certainly have much lower performance and much greater probability of failure due to non-software causes?

Link to comment
Share on other sites

Excellent question Pyro, but I'm having trouble understanding the differentiation of SPRA from the collective others. I'm sure the answer is in the last paragraph I quoted above, but it's a bit elusive still. Can you give a practical example using, perhaps, the "Operations of human beings" dimension?

 

:D

 

I sent this to Pyro yesterday.

 

These men have just finished placing solid steel pillars in concrete to stop vehicles from parking on the pavement outside a downtown sports bar. They are cleaning up at the end of the day. And..........

 

Pyro responsed with the following:

You see, the mission failed!! Loss of Vehicle (LOV). Beautiful

But the truck did not fail.

The barricades did not fail.

The operations performed by the people were correct.

 

And yet! LOV. This is the problem we face.

Link to comment
Share on other sites

 

If this is not the case, what is to prevent the probability of failure due to software from being made effective zero by not having any software, even though such a vehicle would almost certainly have much lower performance and much greater probability of failure due to non-software causes?

 

Beautiful point.

 

I'll reference this related quote again:

The question of whether computers can think is like the question of whether submarines can swim

- Edsger Wybe Dijkstra

 

And to my experience defining a limit like this is very useful. Though it annoys the hell out of the guys that live by their "credibility" factor. In conversation it's something like asking "ok well let's consider homicide in this case..." It does a great job of showing there really is a limit even when other thought that there was not one. It's just not one that falls within the "common sense" boundary.

 

Common Sense is not so common

~Voltaire

 

And really I bet it is those "beyond common sense" items that Pyro is trying to detect. The truth is we are all very ignorant relative to those that will live 60 years into the future. In a similar way to those guys putting up the steel posts were ignorant relative to the guy with the camera 60 feet up. I can guarantee that our two fellows will never make that mistake a 2nd time. But the challenge seems to be in figuring out what we don't know so that we can avoid that 1st mistake in real life.

 

It is very helpful to have a mentor that can advise us (much as Hypography offers, or Google Sets) that can give us that view from 60 feet up. "Have you considered ____" ex "Have you considered how you will get the truck out when you are done?"

 

It is also helpful to think similarly to chess "several moves ahead".

1) Bring up vehicle

2) Get vehicle into good work position out of traffic so it wont get hit

3) Mix cement

4) Install poles

5) Pack up gear

6) Drive home.... oh wait a minute

 

A classic flaw for some of us is not to "think it all the way through". And it can be especially difficult when you have a few million decision points to "think all the way through".

 

It is an accomplishment to make something foolproof,

because fools are so ingenious.

- N. Kohn

Link to comment
Share on other sites

I know this from nothing...

 

But if you want to gain some legitimacy, it may be worthwhile to look at the "accepted" approach for hardware, to wit: the basic idea on the hardware side is you have a countable number of components and a mean-time-before-failure and a system analysis that takes into account criticality and redundancy to determine time before system failure (ignoring whether its mission or crew for the moment: that's an implementation detail!).

 

It would appear that "software is completely different," but I'll argue that its not: "components" in software are "program modules" and while its hard to assess specific "MTBF" to a specific module since its one of a kind, what you can do is pull out one of those weird computer science concepts that not many people pay attention to any more: the good ol' "function point."

 

While its been a long time since I've gone near them, I'm pretty sure you can scare up statistics on "MTBF-per-function point" with appropriate analysis of the types of FPs and so on and end up using a methodology that's exactly like what's done on the hardware side.

 

Now if I were actually an astronaut, I'd be scared to death of this kind of "analysis," ("That's a bunch of crap," is what Gus Grissom would probably say) but its probably just as justifiable as the hardware stuff is...

 

You pulled it out of where? :D

Buffy

 

Excellent excellent point. I had talked to Pyro about looking across the code in mass for things like IF, LOOP, I/O etc but had not realized that really each of those things are Function Points.

 

I went looking on the RUP site for a list of what coding objects are Function Points but couldn't find the list. I see your link points to a paper for $19. Does it include that list? If so (and its the only place that lists them) I will pony up the dough :)

Link to comment
Share on other sites

Symby:

 

Congratulations on getting me to fall out of my chair from laughing at your photo! I now have to get my margarita and go out to the hot tub to sooth my poor tail bone!

 

You will find that Function Point Analysis is one of those black arts of computer science, where "those who tell don't know, and those who know don't tell," and they all want you to cough up a ton of money for consulting. Its been so long since I've thought about them, that I can't really point you to a good source, but good luck! I do hope you guys have some R&D budget to spend on this! :D

 

Onngh Yanng,

Buffy

Link to comment
Share on other sites

You might find Predictive Dynamix slide show of interest - especially slides 11-21 on the different kind of results that can be obtained with various forms of analysis. In talking to their lead programmer, I was realizing though that really what you are trying to figure out which dots are Y and which dots are N or more accurately what their score is. Once they have been scored then these other forms of analysis can be put on top to find a predictive model of the behavior.

 

But I am thinking that if some forms of negative condition can be found then the predictive model might be able to estimate what other conditions might also create a negative outcome.

 

From what I have learned from guys coding them, neural nets for process control are run it in 2 directions. First they create a simulator for the existing environment. For these X in puts you get Y outputs.

 

Then you turn it around and train a reverse model... for Y outputs you need X inputs.

 

Then you toss the model an optimal value of Y that you want and the reverse model gives you the inputs you need to get there.

 

In your case you are looking for the conditions that will get you a negative result. Then test your code on those input values and debug for them.

 

As Buffy said... you pulled that out from where? :rolleyes:

Link to comment
Share on other sites

Symby:

 

Congratulations on getting me to fall out of my chair from laughing at your photo! I now have to get my margarita and go out to the hot tub to sooth my poor tail bone!

 

You will find that Function Point Analysis is one of those black arts of computer science, where "those who tell don't know, and those who know don't tell," and they all want you to cough up a ton of money for consulting. Its been so long since I've thought about them, that I can't really point you to a good source, but good luck! I do hope you guys have some R&D budget to spend on this! :rolleyes:

 

Onngh Yanng,

Buffy

 

Sounds like building a good chess program. The key is what the developer identifies as a good move vs a bad move - more technically how he scores the move. And you have identified one of the best ways I have seen through the years... identify the metadata - the abstraction. That magical set of spectacles that lets one see patterns in the structure.

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...