Cell Architecture

alexander · March 24, 2005

Ok, there has been some discussion about cell architecture on the forums lately, nothing too big and serious, because people dont realize just how cool cell is, so I've taken it upon me to try to explain cell, not too much detail, but for a user who knows what computers are and how they operate in basic, this should be a read to enjoy.

All the information here is taken from http://www.blachford.info/computer/Cells/Cell0.html , the only truely informative resource about how cell powered workstations might look in their near future, well aside from the patent from 2002 which the author gets the information from, but as he says, that actually needs to be decyphered, because it was "written by a robot lawyer running Gentoo in text mode" :)

Ok, first of all, if you have read anything about Sonys PlayStation 3 gaming console that is sceduled to come out in the early 2006, and actually read specs about it, you have heard of a cell processor, maybe even heard that it is supposed to be faster than a PC architecture processor due to their use of vector processing units, that is a start. An alliance formed by Sony, Toshiba and IBM has been spending billions of dollars on this project, IBM is building 2 65nm facilities, Sony paid IBM hundreds of millions to setup an assembly line for cell procs for PS3 and the research on cell has been costing hundreds of millions of dollars, so you can see that something big is about to go off.

So what is Cell?

Cell is an architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as apulets), these are sent out to the hardware Cells where they are computed and results returned.

This architecture is not fixed in any way, if you have a computer, PS3 and HDTV which have Cell processors they can co-operate on problems. They've been talking about this sort of thing for years of course but the Cell is actually designed to do it. I for one quite like the idea of watching "Contact" on my TV while a PS3 sits in the background churning through a SETI@home [sETI] unit every 5 minutes. If you know how long a SETI unit takes your jaw should have just hit the floor, suffice to say, Cells are very, very fast [Calc].

It can go further though, there's no reason why your system can't distribute software Cells over a network or even all over the world. The Cell is designed to fit into everything from PDAs up to servers so you can make an ad-hoc Cell computer out of completely different systems.

Scaling is just one capability of Cell, the individual systems are going to be potent enough on their own. The single unit of computation in a Cell system is called a Processing Element (PE) and even an individual PE is one hell of a powerful processor, they have a theoretical computing capability of 250 GFLOPS (Billion Floating Point Operations per Second) [GFLOPS]. In the computing world quoted figures (bandwidth, processing, throughput) are often theoretical maximums and rarely if ever met in real life. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure.

Cell architecture:

http://www.hypography.com/scienceforums/attachment.php?attachmentid=157&stc=1&thumb=1

The PPE or Processor Unit (PU)

As we now know the PU is so far destined to become a 64bit "Power Architecture", multi thread, muti core processor. Power Architecture is a catch all term describe both PowerPC and POWER processors. Currently there's only 3 CPUs which fit this description: POWER5, POWER4 and the PowerPC 970 (aka G5) which itself is a derivation of the POWER4.

The IBM press release indicates the Cell processor is "Multi-thread, multi-core" but since the APUs are almost certainly not multi-threaded it looks like the PU may be based on a POWER5 core - the very same core as in Apple machines in the form of the G6 [G6] in the not too distant future, IBM have acknowledged such a chip is in development but as if to confuse us call it a "next generation 970".

There is of course the possibility that IBM have developed a completely different 64 bit CPU which it's never mentioned before. This isn't a far fetched idea as this is exactly the sort of thing IBM tend to do, i.e. the 440 CPU used in the BlueGene supercomputer is still called a 440 but is very different from the chip you find in embedded systems.

If the PU is based on a POWER design don't expect it to run at a high clock speed, POWER cores tend to be rather power hungry so it may be clocked down to keep power consumption down.

The PlayStation 3 is touted to have 4 Cells so a system could potential have 4 POWER5 based cores. This sounds pretty amazing until you realise that the PUs are really just controllers - the real action is in the APUs...

SPE or Additional Processing Unit (APU)

http://www.hypography.com/scienceforums/attachment.php?attachmentid=158&stc=1&thumb=1

The first thing you notice on the diagram is the absence of Cache, and there is a good reason for it:

Conventional Cache

Conventional CPUs perform all their operations in registers which are directly read from or written to main memory, operating directly on main memory is hundreds of times slower so caches (a fast on chip memory of sorts) are used to hide the effects of going to or from main memory. Caches work by storing part of the memory the processor is working on, if you are working on a 1MB piece of data it is likely only a small fraction of this (perhaps a few hundred bytes) will be present in cache, there are kinds of cache design which can store more or even all the data but these are not used as they are too expensive or too slow.

If data being worked on is not present in the cache the CPU stalls and has to wait for this data to be fetched. This essentially halts the processor for hundreds of cycles. It is estimated that even high end server CPUs (POWER, Itanium, typically with very large fast caches) spend anything up to 80% of their time waiting for memory.

Dual-core CPUs will become common soon and these usually have to share the cache. Additionally, if either of the cores or other system components try to access the same memory address the data in the cache may become out of date and thus needs updated (made coherent).

Supporting all this complexity requires logic and takes time and in doing so this limits the speed that a conventional system can access memory, the more processors there are in a system the more complex this problem becomes. Cache design in conventional CPUs speeds up memory access but compromises are made to get it to work.

APU local memory - no cache

To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any. Instead they used a series of local memories, there are 8 of these, 1 in each APU.

The APUs operate on registers which are read from or written to the local memory. This local memory can access main memory in blocks of 1024 bits but the APUs cannot act directly on main memory.

By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache. The local memory can only be accessed by the individual APU, there is no coherency mechanism directly connected to the APU or local memory.

This may sound like an inflexible system which will be complex to program and it most likely is but this system will deliver data to the APU registers at a phenomenal rate. If 2 registers can be moved per cycle to or from the local memory it will in it's first incarnation deliver 147 Gigabytes per second. That's for a single APU, the aggregate bandwidth for all local memories will be over a Terabyte per second - no CPU in the consumer market has a cache which will even get close to that figure. The APUs need to be fed with data and by using a local memory based design the Cell designers have provided plenty of it.

Stream Processing

A big difference in Cells from normal CPUs is the ability of the APUs in a Cell to be chained together to act as a stream processor [stream]. A stream processor takes data and processes it in a series of steps. Each of these steps can be performed by one or more APUs.

A Cell processor can be set-up to perform streaming operations in a sequence with one or more APUs working on each step. In order to do stream processing an APU reads data from an input into it's local memory, performs the processing step then writes it to a pre-defined part of RAM, the second APU then takes the data just written, processes it and writes to a second part of RAM. This sequence can use many APUs and APUs can read or write different blocks of RAM depending on the application. If the computing power is not enough the APUs in other cells can also be used to form an even longer chain.

Steam processing does not generally require large memory bandwidth but Cell will have it anyway. According to the patent each Cell will have access to 64 Megabytes directly via 8 bank controllers (it indicates this as an "ideal", the maximum may be higher). If the stream processing is set up to use blocks of RAM in different banks, different APUs processing the stream can be reading and writing simultaneously to the different blocks.

It is where multiple memory banks are being used and the APUs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform over an order of magnitude more calculations per second than any desktop processor currently available.

If over clocked sufficiently (over 3.0GHz) and using some very optimised code (SSE assembly), 5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing as a single Cell - Admittedly, this is purely theoretical and it depends on the Cell achieving it's performance goals and a "perfect" application being used, it does however demonstrate the sort of processing capability the Cell potentially has.

The PlayStation 3 is expected to have have 4 Cells.

General purpose desktop CPUs are not designed for high performance vector processing. They all have vector units on board in the shape of SSE or Altivec but this is integrated on board and has to share the CPUs resources. The APUs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory. Add to this the fact there are 8 of them and you can see why their computational capacity is so large.

Such a large performance difference may sound completely ludicrous but it's not without precedent, in fact if you own a reasonably modern graphics card your existing system is be capable of a lot more than you think:

"For example, the nVIDIA GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing. In comparison, the theoretical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6GFlops." [GPU]

Actually something differenct from the article, in the wonderful world of Linux, there is already a project that utilises the vector processors on the new NVidia and ATI cards, you can make gcc work with the processor on your video card, and since most tasks gcc asks processor to do is exactly the kinds of tasks vector processors are good for and thus i've heard amasing stories of unimaginable compile times foor packages that take hours on a 3GHZ P4 take literaly minutes... (No, no, i dont think that there is anything that is impossible to do to your computer with Linux...)

The DMAC The DMAC (Direct Memory Access Controller) is a very important part of the Cell as it acts as a communications hub. The PU doesn't issue instructions directly to the APUs but rather issues them to the DMAC and it takes the appropriate actions, this makes sense as the actions usually involve loading or saving data. This also removes the need for direct connections between the PU and APUs.

As the DMAC handles all data going into or out of the Cell it needs to communicate via a very high bandwidth bus system. The patent does not specify the exact nature of this bus other than saying it can be either a normal bus or it can be a packet switched network. The packet switched network will take up more silicon but will also have higher bandwidth, I expect they've gone with the latter since this bus will need to transfer 10s of Gigabytes per second. What we do know from the patent is that this bus is huge, the patent specifies it at a whopping 1024 bits wide.

At the time the patent was written it appears the architecture for the DMAC had not been fully worked out so as well as two potential bus designs the DMAC itself has different designs. Distributed and centralised architectures for the DMAC are both mentioned.

It's clear to me that the DMAC is one of the most important parts of the Cell design, it doesn't do processing itself but has to contend with 10's of Gigabytes of memory flowing through it at any one time to many different destinations, if speculation is correct the PS3 will have 100GByte / second memory interface, if this is spread over 4 Cells that means each DMAC will need to handle at least 25 Gigabytes per second. It also has to handle the memory protection scheme and be able to issue memory access orders as well as handling communication between the PU and APUs, it needs to be not only fast but will also be a highly complex piece of engineering.

Memory As with everything else in the Cell architecture the memory system is designed for raw speed, it will have both low latency and very high bandwidth. As mentioned previously memory is accessed in blocks of 1024 bits. The reason for this is not mentioned in the patent but I have a theory:

While this may reduce flexibility it also decreases memory access latency - the single biggest factor currently holding back computers today. The reason it's faster is the finer the address resolution the more complex the logic and the longer it takes to look it up. The actual looking up may be insignificant on the memory chip but each look-up requires a look-up transaction which involves sending an address from the bank controller to the memory device and this will take time. This time is significant itself as there is one per memory access but what's worse is that every bit of address resolution doubles the number of look-ups required.

If you have 512MB in your PC your RAM look-up resolution is 29 bits*, however the system will read a minimum of 64 bits at a time so resolution is 26 bits. The PC will probably read more than this so you can probably really say 23 bits.

* Note: I'm not counting I/O or graphics address space which will require an extra bit or two.

In the Cell design there are 8 banks of 8MB each and if the minimum read is 1024 bits the resolution is 13 bits. An additional 3 bits are used to select the bank but this is done on-chip so will have little impact. Each bit doubles the number of memory look-ups so the PC will be doing a thousand times more memory look-ups per second than the Cell does. The Cell's memory busses will have more time free to transfer data and thus will work closer to their maximum theoretical transfer rate. I'm not sure my theory is correct but CPU caches use a similar trick.

What is not theoretical is the fact the Cell will use very high speed memory connections - Sony and Toshiba licensed 3.2GHz memory technology from Rambus in 2003 [Rambus]. If each cell has total bandwidth of 25.6 Gigabytes per second each bank transfers data at 3.2 Gigabytes per second. Even given this the buses are not large (64 data pins for all 8), this is important as it keeps chip manufacturing costs down.

100 Gigabytes per second sounds huge until you consider top end graphics cards are in the region of 50 Gigabytes per second already, doubling over a couple of years sounds fairly reasonable. But these are just the theoretical figures and never get reached, assuming the system I described above is used the bandwidth on the Cell should be much closer to it's theoretical figure than competing systems and thus will perform better.

APUs may need to access memory from different Cells especially if a long stream is set up, thus the Cells include a high speed interconnect. Details of this are not known other than they transfer data at 6.4 Gigabits / second per wire. I expect there will be busses of these between each Cell to facilitate the high speed transfer of data to each other. This technology sounds not entirely unlike HyperTransport though the implementation may be very different.

In addition to this a switching system has been devised so if more then 4 Cells are present they too can have fast access to memory. This system may be used in Cell based workstations. It's not clear how more than 8 cells will communicate but I imagine the system could be extended to handle more. IBM have announced a single rack based workstation will be capable of up to 16 TeraFlops, they'll need 64 Cells for this sort of performance so they have obviously found some way of connecting them.

Memory Protection

The memory system also has a memory protection scheme implemented in the DMAC. Memory is divided into "sandboxes" and a mask used to determine which APU or APUs can access it. This checking is performed in the DMAC before any access is performed, if an APU attempts to read or write the wrong sandbox the memory access is forbidden.

Existing CPUs include hardware memory protection system but it is a lot more complex than this. They use page tables which indicate the use of blocks of RAM and also indicate if the data is in RAM or on disc, these tables can become large and don't fit on the CPU all at once, this means in order to read a memory location the CPU may first have to read a page table from memory and read data in from disc - all before the data required is read.

In the Cell the APU can either issue a memory access or not, the table is held in a special SRAM in the DMAC and is never flushed. This system may lack flexibility but is very simple and consistently very fast.

This simple system most likely only applies to the APUs, I expect the PU will have a conventional memory protection system.

Software Cells

Software cells are containers which hold data and programs called apulets as well as other data and instructions required to get the apulet running (memory required, number of APUs used etc.). The cell contains source, destination and reply address fields, the nature of these depends on the network in use so software Cells can be sent around to different hardware Cells. There are also network independent addresses which will define the specific Cell exactly. This allows you to say, send a software Cell to hardware Cell in a specific computer on a network.

The APUs use virtual addresses but these are mapped to a real address as soon as DMA commands are issued. The software Cell contains these DMA commands which retrieve data from memory to process, if APUs are set up to process streams the Cell will contain commands which describe where to read data from and where to write results to. Once set up, the APUs are "kicked" into action.

It's not clear how this system will operate in practice but it would appear to include some adaptively so as to allow Cells to appear and disappear on a network.

This system is in effect a basic Operating System but could be implemented as a layer within an existing OS. There's no reason to believe Cell will have any limitations regarding which Operating Systems can run.

ne of the main points of the entire Cell architecture is parallel processing. Software cells can be sent pretty much anywhere and don't depend on a specific transport means. The ability of software Cells to run on hardware Cells determined at runtime is a key feature of the Cell architecture. Want more computing power? Plug in a few more Cells and there you are. If you have a bunch of cells sitting around talking to each other via WiFi connections the system can use it to distribute software cells for processing. The system was not designed to act like a big iron machine, that is, it is not arranged around a single shared or closely coupled set of memories. All the memory may be addressable but each Cell has it's own memory and they'll work most efficiently in their own memory or at least in small groups of Cells where fast inter-links allow the memory to be shared.

Going above this number of Cells isn't described in detail but the mechanism present in the software Cells to make use of whatever networking technology is in use allows ad-hoc arrangements of Cells to be made without having to worry about rewriting software to take account of different network types.

The parallel processing system essentially moves a lot of complexity which would normally be handled by hardware and moves it into software. This usually slows things down but the benefit is flexibility, you give the system a set of software Cells to compute and it figures out how to distribute them itself. If your system changes (Cells added or removed) the OS should take care of this without user or programmer intervention.

Writing software for parallel processing is usually highly difficult and this helps get around the problem. You still, of course have to parallelise the program into cells but once that's done you don't have to worry if you have one Cell or ten.

In the future, instead of having multiple discrete computers you'll have multiple computers acting as a single system. Upgrading will not mean replacing an old system anymore, it'll mean enhancing it. What's more your "computer" may in reality also include your PDA, TV and Camcorder all co-operating and acting as one.

---snap---

this is it for now, i'll work more on it eventually, its a good read, so read it :)

(Edit: Also, the reason i stopped here is because this gives you a thorough understanding of Cell, the post below just describes implementation, software and why Cell is better and might beat x86 architecture. This is the more true part as below is a more philosophical, debatable, maybe a possibility type of deal. Oh and i added pictures :) )

geko · March 27, 2005

I reckon cell sounds likes a really good idea, but it looks like sony has just moved hardware complexity to the burden of software developing. This might be all well and good for companies with close ties to sony and it's ps3 who will have a better chance of harnessing the power of cell, but what about companies that write once for the least common denominator and then port many times? Is cell a good 'base' platform to develop for?

There's also the arguments about a compatible ISA. As we know the x86 is so strong because of it's backward compatibility - which is what's kept other ISA's out of the market so far. Intel's future multi-core 32nm manufacturing process sounds a lot like cell at a high level so it looks like they're seeing somethign alot more technical than what cell is at the moment, not to mention Intel already have a massive market share, so this might actually be the deciding factor for cell and it's uphill battle

I reckon Cell could excel in games because those developers are said to not mind the burden of deathmarching :) , but outside this im not too excited about Cell (even though cell like processing power of offloading instructions to other cell processors is a really neat idea). But hey, i dont think i know enough to know :)

With this said im wondering if Intel's future dual and multi-core platforms with bring enough parallism and performance for more realistic physics and AI, or will AGEIA's PPU bring cell like architecture to the pc platform. Time will tell i suppose.

Also, the TI's mvp 320c8x is a multi-core cpu that sounds a lot like cell architecture as well, so maybe it's nothing new (i dont know) -> http://focus.ti.com/docs/military/catalo/general/general.fhtml?templateld=5603&path=templatedata/cm/miligen

alexander · March 27, 2005

Well, geko software developement is actually better with cell as now there will be no such thing as op code or p code, all languages will be compattible with all operating systems, as the compilers will compile the code directly into machine, with no intermediate "OS" junk that is added to your programs today, so software compiled for linux will work on osx and windows and solaris and any other os in the world, the only reason they will not is if they have any kind of interaction with core OS files. I should probably continue where i left off, so:

--snap--

The Biggest Misconception (as shown above by geko)

The Cell is not a fancy graphics chip, it is intended for general purpose computing. As if to confirm this the graphics hardware in the PlayStation 3 is being provided by Nvidia [Nvidia]. The APUs are not truly general purpose like normal microprocessors but the Cell makes up for this by virtue of including a PU which is a normal PowerPC microprocessor.

Games

Games are an obvious target, the Cell was designed for a games console so if they don't work well there's something wrong! The Cell designers have concentrated on raw computing power and not on graphics, as such we will see hardware functions moved into software and much more flexibility being available to developers. Will the PS3 be the first console to get real-time ray traced games?

3D Graphics

Again this is a field the Cell was largely designed for so expect it to do well here, Graphics is an "embarrassingly parallel", vectorisable and streamable problem so all the APUs will be in full use, the more Cells you use the faster the graphics will be. There is a lot of research into different advanced graphics techniques these days and I expect Cells will be used heavily for these and enable these techniques to make their way into the mainstream. If you think graphics are good already you're in for something of a surprise.

Video

Image manipulations can be vectorised and this can be shown to great effect in Photoshop. Video processing can similarly be accelerated and Apple will be using the capabilities of existing GPUs (Graphics Processor Units) to accelerate video processing in "core image", Cell will almost certainly be able to accelerate anything GPUs can handle.

Video encoding and decoding can also be vectorised so expect format conversions and mastering operations to benefit greatly from a Cell. I expect Cells will turn up in a lot of professional video hardware.

Audio

Audio is one of those areas where you can never have enough power. Today's electronic musicians have multiple virtual synthesisers each of which has multiple voices. Then there's traditionally synthesised, sampled and real instruments. All of these need to be handled and have their own processing needs, that's before you put different effects on each channel. Then you may want global effects and compression per channel and final mixing. Many of these processes can be vectorised. Cell will be an absolute dream for musicians and yet another headache for synthesiser manufacturers who have already seen PCs encroaching on their territory.

DSP (Digital Signal Processing)

The primary algorithm used in DSP is the FFT (Fast Fourier transform) which breaks a signal up into individual frequencies for further processing. The FFT is a highly vectorisable algorithm and is used so much that many vector units and microprocessors contains instructions especially for accelerating this algorithm.

There are thousands of different DSP applications and most of them can be streamed so Cell can be used for many of these applications. Once prices have dropped and power consumption has come down expect the Cell to be used in all manner for different consumer and industrial devices.

SETI

A perfect example of a DSP application, again based on FFTs, a Cell will boost my SETI@home [sETI] score no end! As mentioned elsewhere I estimate a set of 4 Cells will complete a unit in under 5 minutes [Calc]. Numerous other distributed applications will also benefit from the Cell.

Scientific

For conventional (non vectorisable) applications this system will be at least as fast as 4 PowerPC 970s with a fast memory interface. For vectorisable algorithms performance will go onto another planet. A potential problem however will be the relatively limited memory capability (this may be PlayStation 3 only, the Cell may be able to address larger memories). It is possible that even a memory limited Cell could be used perfectly well by streaming data into and out of the I/O unit.

GPUs are already used for scientific computation and Cell will be likely be useable in the same areas: "Many kinds of computations can be accelerated on GPUs including sparse linear system solvers, physical simulation, linear algebra operations, partial difference equations, fast Fourier transform, level-set computation, computational geometry problems, and also non-traditional graphics, such as volume rendering, ray-tracing, and flow visualization."[GPU]

Super Computing

Many modern supercomputers use clusters of commodity PCs because they are cheap and powerful. You currently need in the region of 250 PCs to even get onto the top 500 supercomputer list [Top500]. It should take just 8 Cells to get onto the list and 560 to take the lead*. This is one area where backwards compatibility is completely unimportant and will be one of the first areas to fall, expect Cell based machines to rapidly take over the Top 500 list from PC based clusters.

There are other super computing applications which require large amounts of interprocess communication and do not run well in clusters. The Top500 list does not measure these separately but this is an area where big iron systems do well and Cray rules, PC clusters don't even get a look-in. The Cells have high speed communication links and this makes them ideal for such systems although additional engineering will be required for large numbers of Cells. Cells may not only take over from PC clusters but also expect them to do well here also.

If the Cell has a 64 bit Multiply-add instruction (I'd be very surprised if this wasn't present) it'll take 8000 of them to get a PetaFlop**. That record will be very difficult to beat.

** Based on theoretical values, in reality you'd need more Cells depending on the efficiency.

Servers

This is one area which does not strike me as being terribly vectorisable, indeed XML and similar processing are unlikely to be helped by the APUs at all though the memory architecture may help (which is unusual given how amazingly inefficient XML is). However servers generally do a lot of work in their database backend.

Commercial databases with real life data sets have been studied and found to have been benefited from running on GPUs. You can also expect these to be accelerated by Cells. So yes, even servers can benefit from Cells.

Stream Processing Applications

A big difference from normal CPUs is the ability of the APUs in a cell to be chained together to act as a stream processor [stream]. A stream processor takes a flow of data and processes it in a series of steps. Each of these steps can be performed by a different APU or even different APUs on different Cells.

An Example: A Digital TV Receiver

To give an example of stream processing take a Set Top Box for watching Digital TV, this is a lot more complex process than just playing a MPEG movie as a whole host of additional processes are involved. This is what needs to be done before you can watch the latest episode of Star Trek, here's an outline of the processes involved:

* COFDM demodulation

* Error correction

* Demultiplexing

* De-scrambling

* MPEG video decode

* MPEG audio decode

* Video scaling

* Display construction

* Contrast & Brightness processing

These tasks are typically performed using a combination of custom hardware and dedicated DSPs. They can be done in software but it'll take a very powerful CPU if not several of them to do all the processing - and that's just for standard definition MPEG2. HDTV with H.264 will require considerably more processing power. General purpose CPUs tend not to be very efficient so it is generally easier and cheaper to use custom chips, although highly expensive to develop they are cheap when produced in high volumes and consume miniscule amounts of power.

These tasks are vectorisable and working in a sequence are of course streamable. A Cell processor could be set-up to perform these operations in a sequence with one or more APUs working on each step, this means there is no need for custom chip development and new standards can be supported in software. The power of a Cell is such that it is likely that a single Cell will be capable of doing all the processing necessary, even for High definition standards. Toshiba intend on using the Cell for HDTVs.

So, everything vectorised will see a giant leap in performance, everything not is still in question due to the unclarity of the pattent.

The Sincerest Form of Flattery is Theft

20 years ago an engineer called Jay Miner who had been working on video games (he designed the Atari 2600 chip) decided to do something better and produce a desktop computer which combined a video game chipset with a workstation CPU. The prototype was called Lorraine and it was eventually released to the market as the Commodore Amiga. The Amiga had hardware accelerated high colour screens, a GUI based multitasking OS, multiple sampled sound channels and a fast 32 bit CPU. At the time PCs had screens displaying text, a speaker which beeped and they ran MSDOS on a 16 bit CPU. The Amiga went on to sell in millions but the manufacturer went bankrupt in 1994.

Like many other platforms which were patently superior to it, the Amiga was swept aside by the PC.

The PC has seen off every competitor that has crossed paths with it, no matter how good the OS or hardware. The Amiga in 1985 was years ahead of the PC, it took more than 5 years for the PC to catch up with the hardware and 10 years to catch up with the OS. Yet the PC still won, as it did against every other platform. The PC has been able to do this because of a huge software base and it's ability to steal the competitors clothes, low prices and high performance were not a factor until much later. If you read the description of the Amiga I gave again you'll find it also describes a modern PC. The Amiga may have introduced specialised chips for graphics acceleration and multitasking to the desktop world but now all computers have them.

In the case of the Amiga it was not the hardware or the price which beat it. It was the vast MSDOS software base which prevented it getting into the business market, Commodore's ability to shoot themselves in the foot finished them off. NeXT came along next with even better hardware and an even better Unix based OS but they couldn't dent the PC either. It was next to be dispatched and again the PC later caught up and stole all it's best features, it took 13 years to bring memory protection to the consumer level PC.

Cell V's x86

This looks like a battle no one can win. x86 has won all of it's battles because when Intel and AMD pushed the x86 architecture they managed to produce very high performance processors and in their volumes they could sell them for low prices. When x86 came up against faster RISC competitors it was able to use the very same RISC technologies to close the speed gap to the point where there was no significant advantage going with RISC.

Three of what were once important RISC families have also been dispatched to the great Fab in the sky. Even Intel's own Itanium has been beaten out of the low / mid server space by the Opteron. Sun have been burned as well, they cancelled the next in the UltraSPARC line, bought in radical new designs and now sell the Opteron which threatened to eclipse their low end. Only POWER seems to be holding it's own but that's because IBM has the resources to pour into it to keep it competitive and it's in the high end market which x86 has never managed to penetrate and may not scale to.

To Intel and AMD's processors Cell presents a completely different kind of competition to what has gone before. The speed difference is so great that nothing short of a complete overhaul of the x86 architecture will be able to bring it even close performance wise. Changes are not unheard of in x86 land but neither Intel or AMD appear to be planning a change even nearly radical enough to catch up. That said Intel recently gained access to many of Nvidia's patents [intel+Nvidia] and are talking about having dozens of cores per chip so who knows what Santa Clara are brewing. [Project Z]

Multicore processors are coming to the x86 world soon from both Intel and AMD [MultiCore], but high speed x86 CPUs typically have high power requirements. In order to have 2 Opterons on a single core AMD have had to reduce their clock rate in order to keep them from requiring over a hundred watts, Intel are doing the same for the Pentium 4. The Pentium-M however is a (mostly) high performance low power part and this will go into multi-core devices much easier than the P4, expect to see chips with 2 cores arriving followed by 4 & 8 core designs over the next few years.

Cell will accelerate many commonly used applications by ludicrous proportions compared to PCs. Intel could put 10 cores on a chip and they'll match neither it's performance or price. The APUs are dedicated vector processors, x86 are not. The x86 cores will no doubt include the SSE vector units but these are no match for even a single APU.

Then there's the parallel nature of Cell. If you want more computing power simply add another Cell, the OS will take care of distributing the software Cells to the second or third etc processor. Try that on a PC, yes many OSs will support multiple processors but many applications do not and will need to be modified accordingly - a process which will take many, many years. Cell applications will be written to be scalable from the very beginning as that's how the system works.

Cell V's Software

The main problem with competing with the PC is not the CPU, it's the software. A new CPU no matter how powerful, is no use without software. The PC has always won because it's always had plenty of software and this has allowed it to see off it's competitors no matter how powerful they were or the advantages they had at the time. The market for high performance systems is very limited, it's the low end systems which sell.

Cell has the power and it will be cheap. But can it challenge the PC without software? The answer to this question would have been simple once, but PC market has changed over time and for a number of reasons Cell is now a threat:

The first reason is Linux. Linux has shown that alternative operating systems can break into the PC software market against Windows, the big difference with Linux though is that it is cross platform. If the software you need runs on linux, switching hardware platforms is no problem as much of the software will still run on different CPUs.

The second reason is cost, other platforms have often used expensive custom components and have been made in smaller numbers. This has put their cost above that of PCs, putting them at immediate disadvantage. Cell may be expensive initially but once Sony and Toshiba's fabs ramp up it will be manufactured in massive volumes forcing the prices down, the fact it's going into the PS3 and TVs is an obvious help for getting the massive volumes that will be required. IBM will also be making Cells and many companies use IBM's silicon process technologies, if truly vast numbers of Cells were required Samsung, Chartered, Infineon and even AMD could manufacture them (provided they had a license of course).

The third reason is power, the vast majority of PCs these don't need the power they provide, Cell will only accentuate this because it will be able to off load most of the intensive stuff to the APUs. What this means is that if you do need to run a specific piece of software you can emulate it. This would have been impossibly slow once but most PC CPUs are already more than enough and with today's advanced JIT based emulators you might not even notice the difference.

The reason many high end PCs are purchased is to accelerate many of the very tasks the Cell will accelerate. You'll also find these power users are more interested in the tools and not the platform, apart from Games these are not areas over which Microsoft has any hold. Given the sheer amount of acceleration a Cell (or set of Cells) can deliver I can see many power users being happy to jump platforms if the software they want is ported or can be emulated.

Cell is going to be cheap, powerful, run many of the same operating systems and if all else fails it can emulate a PC will little noticeable difference, software and price will not be a problem. Availability will also not be a problem, you can buy playstations anywhere. This time round the traditional advantages the PC has held over other systems will not be present, they will have no advantage in performance, software or price. That is not to say that the Cell will walk in and just take over, it's not that simple.

Attack

IBM plan on selling workstations based on the Cell but I don't expect they'll be cheap or sold in any numbers to anyone other than PlayStation developers.

Cell will not just appear in exotic workstations and PlayStations though, I also expect they'll turn up in desktop computers of one kind or another (i.e. I know Genesi are considering doing one). When they do they're going to turn the PC business upside down.

Even with a single Cell it will outgun top end multiprocessor PCs many times over. That's gotta hurt, and it will hurt, Cell is going to effectively make traditional general purpose microprocessors obsolete.

Infection inside

Of course this wont happen overnight and there's nothing to stop PC makers from including a Cell processor on a PCI / PCIe card or even on the motherboard. Microsoft may be less than interested in supporting a competitor but that doesn't mean drivers couldn't be written and support added by the STI partners. Once this is done developers will be able to make use of the Cell in PC applications and this is where it'll get very interesting. With computationally intensive processing moved to the Cell there will be no need for a PC to include a fast x86, a low cost slow one will do just fine.

Some companies however will want to cut costs further and there's a way to do that. The Cell includes at least a PowerPC 970 grade CPU so it'll be a reasonably fast processor. Since there is no need for a fast x86 processor why not just emulate one? Removing the x86 and support chips from a PC will give big cost savings. An x86 computer without an x86 sounds a bit weird but that's never stopped Transmeta who do exactly that, perhaps Transmeta could even provide the x86 emulation technology, they're already thinking of getting out of chip manufacturing [Transmeta].

Cell is a very, very powerful processor. It's also going to become cheap. I fully expect it'll be quite possible to (eventually) build a low cost PC based around a Cell and sell it for a few hundred dollars. If all goes well will Dell sell Cells?

Game on

You could argue gamers will still drive PC performance up but Sony could always pull a fast one and produce a PS3 on a card for the PC. Since it would not depend on the PC's computational or memory resources it's irrelevant how weak or strong they are. Sony could produce a card which turns even the lowest performance PC into a high end gaming machine, If such a product sold in large numbers studios developing for PS3 already may decide they not need to develop a separate version for the PC, the resulting effect on the PC games market could be catastrophic.

While you could use an emulated OS it's always preferable to have a native OS. There's always Linux However Linux isn't really a consumer OS and seems to be having something of a struggle becoming one. There is however another very much consumer ready OS which already runs on a "Power Architecture" CPU: OS X.

Actually Linux runs on PowerPC architecture, and it does even better with PowerPC than OSX does due to the fact of not having a large, shiny GUI that is very much resource hungry. OSX can be run in Linux with no emulation, no lag, nothing (I had the experience of playing with one at LinuxWorld on a G4) thus opening up endless possibilities for OS speciffic applications, so not only will and are you able to run any Linux apps, you can run OSX ones in a matter of : start OSX, install the app, use it with 0 lag or performance loss (because once again OSX runs natively with no emulation and uses the machines hardware vs translation of any calls :) ). I guess we'll see how that plays out though in the cell world...

Cell V's Apple

The Cell could be Apple's nemesis or their saviour, they are the obvious candidate company to use the Cell. It's perfect for them as it will accelerate all the applications their primary customer base uses and whatever core it uses the the PU will be PowerPC compatible. Cells will not accelerate everything so they could use them as co-processors in their own machines beside a standard G5 / G6 [G6] getting the best of both worlds.

The Core Image technology due to appear in OS X "Tiger" already uses GPUs (Graphics Processor Units) for things other than 3D computations and this same technology could be retargeted at the Cell's APUs. Perhaps that's why it was there in the first place...

If other companies use Cell to produce computers there is no obvious consumer OS to use, with OS X Apple have - for the second time - the chance to become the new Microsoft. Will they take it? If an industry springs up of Cell based computers not doing so could be very dangerous. When the OS and CPU is different between the Mac an PC there is (well, was) a big gap between systems to jump and a price differential can be justified. If there's a sizeable number of low cost machines capable of running OS X the price differential may prove too much, I doubt even that would be a knockout blow for Apple but it would certainly be bad news (even the PC hasn't managed a knockout).

PC manufacturers don't really care which components they use or OS they run, they just want to sell PCs. If Apple was to "think different" on OS X licensing and get hardware manufacturers using Cells perhaps they could turn Microsoft's clone army against their masters. I'm sure many companies would be only too happy to get released from Microsoft's iron grip. This is especially so if Apple was to undercut them, which they could do easily given the 400% + margins Microsoft makes on their OS. Licensing OS X wouldn't necessarily destroy Apple's hardware business, there'll always be a market for cooler high end systems [Alien]. Apple also now has a substantial software base and part of this could be used to give added value to their hardware in a similar manner to that done today. Everyone else would just have to pay for it as usual.

In "The Future of Computing" [Future] I argued that the PC industry would come under threat from low cost computers from the far east. The basis of the argument was that in the PC industry Microsoft and Intel both enjoy very large margins. I argued that it's perfectly feasible to make a low cost computer which is "fast enough" for most peoples needs and running Linux there would be no Microsoft Tax, provided the system could do what most people need to do it could be made and sold at a sufficiently low price that it will attack the market from below.

A Cell based system running OS X could be nearly as cheap (depending on the price Apple want to charge for OS X) but with Cell's sheer power it will exceed the power of even the most powerful PCs. This system could sell like hot cakes and if it's sufficiently low cost it could be used to sell into the low cost markets which PC makers are now beginning to exploit. There is a huge opportunity for Apple here, I think they'll be stark raving mad not to take it - because if they don't someone else will - Microsoft already have PowerPC experience with the Xbox2 OS...

Cell will have a performance advantage over the PC and will be able to use the PC's advantages as well. With Apple's help it could also run what is arguably the best OS on the market today, at a low price point. The new Mac mini already looks like it's going to sell like hot cakes, imagine what it could do equipped with a Cell...

In short it will be very stupid of Apple not to take the opporunity to license their OS under Cell, if they succeed in porting their OS fast enough to beat microsoft (which they should, because Microsoft is not compattible with PowerPC architecture so they will need to spend loads of time working on how to get their systems to run on that platform. If Apple chokes microsoft in the OS market, it might be the last of Ms superpower, and Apple will lead the way to the future. (And personally I have no problems with that, OSX is a great OS, its BSD with a large, shiny hat GUI (little too shiny for me, but i'm sure that i wont mind running it too much.), I just hope they dont screw up and make a mistake that they have already made in allowing Microsoft to use Apples technologies to choke Apple)

The PC Retaliates: Cell V's GPU

The PC does have a weapon with which to respond, the GPU (Graphics Processor Unit). On computational power GPUs will be the only real competitors to the Cell.

GPUs have always been massively more powerful than general purpose processors [PC + GPU][GPU] but since programmable shaders were introduced this power has become available to developers and although designed specifically for graphics some have been using it for other purposes. Future generations of shaders promise even more general purpose capabilities[DirectX Next].

GPUs operate in a similar manner to the Cell in that they contain a number of parallel vector processors called vertex or pixel shaders, these are designed to process a stream of vertices of 3D objects or pixels but many other compute heavy applications can be modified to run instead [EE-GPU].

With aggressive competition between ATI and Nvidia the GPUs are only going to get faster and now "SLI" technology is being used again to pair GPUs together to produce even more computational power.

GPUs will provide the only viable competition to the Cell but even then for a number of reasons I don't think they will be able to catch the Cell.

Cell is designed from the ground up to be more general purpose than GPUs, the APUs are not graphics specific so adapting non 3D algorithms will likely mean less work for developers.

Cell has the main general purpose PU sharing the same fast memory as the APUs. This is distinct from PCs where GPUs have their own high speed memory and can only access main system memory via the AGP bus. PCI Express should speed this up but even this will be limited due to the bus being shared with the CPU. Additionally vendors may not fully support the PCI Express specification, existing GPUs are very slow at moving data from GPU to main memory.

There is another reason I don't think Nvidia or ATI will be able to match the Cell's performance anytime soon. Last time around the PC rapidly caught up with and surpassed the PS2, I think it is one of Sony's aims this time to make that very difficult so, as such Cell has been designed in a highly aggressive manner.

The Cray Factor

The "Cray factor" is something to which Intel, AMD, Nvidia and ATI may have no answer to.

What is apparent from the patent is the approach the designers have taken in developing the Cell architecture. There are many compromises that can be taken when designing a system like this, in almost every case the designers have not compromised and gone for performance, even if the job of the programmers has been made considerably more difficult.

The Cell design is very different from modern microprocessors, seemingly irremovable parts have been changed radically or removed altogether. The rule of computing, fundamental to modern computing - abstraction - is abandoned altogether, no JITs here, you get direct access to the hardware. This is a highly aggressive design strategy, much more aggressive than you'll find in any other system, even in it's heyday the Alpha processor's design was nowhere near this aggressive. In their quest for pure, unadulterated, raw performance the designers have devised a processor which can only be compared to something designed by Seymour Cray [Cray].

To understand why the Cell will be so difficult to catch you have to understand a battle which started way back in the 1960s.

From the 60s to the 90s IBM and Cray battled each other in trying to build the fastest computers. Cray won pretty much every time, he raised the performance bar to the point that the only machines which eventually beat Cray's designs were newer Cray designs.

IBM made flexible business machines, Cray went for less flexible and less feature rich designs in the quest for ultimate speed. If you look at what is planned for future GPUs [DirectX Next] it is very evident they are going for a flexible-features approach - exactly as you'd expect from a system designed by a software company. They are going to be using virtual memory on the GPU and already use a cache for the most commonly used data, in fact GPUs look like they are rapidly becoming like general purpose CPUs.

The Cell approach is the same as the Cray's. Virtual memory takes up space and delays the access to data. Virtual memory is present in the Cell architecture but not at runtime, the OS keeps addresses virtual until a software Cell is executed at which point the real addresses are used for getting to and from memory. Cell also has memory protection but in a limited and simple fashion, a small on-chip memory holds a table indicating which APU can access which memory block, it's small and never flushed, this means it's also very fast.

CPUs and GPUs use a cache memory to hide access to main memory, Cray didn't bother with cache and just made the main memory super fast. Cell uses the same approach, these is no cache in the APUs, only a small but very fast local memory is present. The local RAM does not need concurrency and is directly addressable, the programmer will always know what is present because they had to specify the load. Because of this reduced complexity and the smaller size the local RAM will be very fast, much faster than cache. If it can transfer 2 (256bit) words per cycle at the clock speed they have achieved (4.6GHz) they'll be working at 147 Gigabytes per second - and they'll never have a cache miss...

The aggressiveness in the design of the Cell architecture means that it is going to be very, very difficult to produce a comparably performing part. x86 has no hope of getting there, they ultimately need to duplicate the Cell design in order to match it. GPUs will also have a hard time, they are currently at a 10 fold clock speed disadvantage, generate large amounts of heat and the highest performance parts are made in tiny numbers compared to what cell will be made. It will require a complete rethink of the GPUs design in order to get even close to the Cell's clock rate.

The Cell designers have not made their chips out of gallium arsenide or dipped them in a bath of fluorinert so they're not quite as aggressive as Seymour Cray, but then again there's always the PlayStation 4...

The Alternative

There is the possibility that some company out there will produce a high power multi core vector processor using a different design philosphy. This could be done and may get close to the Cell's power. It is possible because the Cell has been designed for a high clock rate and this poses some limitations on the design. If an alternative used a lower clock rate, it would allow the use of slower and more importantly smaller transistors. This means the number of vector units included could be increased and more importantly the amount of on-chip memory could be made much greater. These will make up for the higher clock rate and the smaller memory bandwidth necessary would allow slower but lower cost RAM.

This may not be as powerful as the Cell but could get fairly close due to the processors being better fed with all the additional RAM. Power consumption would be lower than Cell and the scalability wouldn't be needed for all markets. There are plenty of companies in the embedded space who stand to lose a lot from the Cell so we may see this sort of design coming from that sector. The companies in the PC CPU and VPU are certainly capable of this sort of design but how it could be made to work in the existing PC architecture is open to question.

The Result

Cell represents the largest threat the PC has ever faced. The PC can't use it's traditional advantage of software because the Cell can run the same software. It can't get an advantage in price or volume as Cell will also be made in huge volumes. Lastly it can't compete on the basis of Cell being proprietary because it's being made be a set of companies and they can sell to anyone. x86 is no less propriety than Cell. It looks like the PC may have finally met it's match.

The effect on Microsoft is more difficult to judge, if Cells take off MS will have difficulty supporting them as it will not allow the same level of control. Because Cells are a distributed architecture you could end up using a Windows machine as a client and having everything else running Linux or some other OS. Multiple machines not running Windows? I don't think that's something Microsoft is going to like.

Then there's also the issue that the main computations may be performed by the Cell with Windows essentially providing an interface. Porting the interface may take time but anything which runs on the Cell's itself is separate and will not need porting to different OSs, software cells are OS agnostic. I can't see that Microsoft are going to like this either.

Nothing is certain and it's not even clear if going up against the PC is something the STI partners are even interested in. But we can be sure Cell and the PC will eventually clash in one way or another.

However even if Cell does take over as the dominant architecture it's going to do so in a process which will take many years or even decades. Then there are areas where Cells may not have any particular advantage over PCs so irrespective of the outcome you can be sure x86 will still be around for a very, very long time. [/url]

--End--

And again, I do not take the credit for the above, I have added and removed things in there, mainly though the text has stayed the same and all points that i thought were interesting and crutial are all in there, so enjoy the read and again the source is : http://www.blachford.info/computer/Cells/Cell0.html

maddog · April 5, 2005

Thanks Alexander for the history on Cell Architecture and pre-knowledge on PS3. I want one!

Actually, aerospace has been using the Cell and Custom Cell architecture for about a decade. I was on

a project recently where we were getting computing power exceeding 1 TFLOP (Floating Pt. Operation)

=>

TFLOP = 1 Tera (Trillion) FLOPs in a surface area less than 1 sq Centimeter. In fact you can find more

information by doing a Google search using the string: "cell architecture" "data flow" (include quotes

for better search). We used Data Flow in our Cell concept. Though we did find routing problems

mapping FFT computation components in the Cell(s) and found utilization to be typically less than 33 %.

This made a requirement to always have more Cells than you think you need. Sorry I have no more

details as to be more specific would divulge company proprietary data.

Maddog

alexander · April 7, 2005

Thats really cool maddog, i think that the beauty of cell on the merket would be in that you will no longer need a huge very powerful pc on a space station, and can instead put a small cell-powered computer with a constant connection to earth, then you can facilitize the network on the ground to do a lot of computing for you, with you...

Sign In

Cell Architecture

Recommended Posts

alexander

geko

alexander

maddog

alexander

Join the conversation

Browse

Activity