I am interested to know if anyone is attacking the bottleneck of shared memory between multiple cores. If you are or you know of anyone who does please leave a message here.
The focus of this problem is that most operations are more memory-intensive than CPU-intensive. For example a CPU with 20 cores scanning a database in RAM will work as fast as a single core CPU because databases are designed to optimize CPU on expense of memory footprint.
(Don't be shay about this. I am going to add an article about it on Intel's website soon enough, so it will be public knowledge anyway)
Hi Asaf,
The problem is well known, and relate to the way the parallel work is distributed among the cores. It is always a question about how well locallity and re-use of data can be achieved. Also it is importent to be able to reuse data loaded/updated on one core by other cores without coherency problems and reloading from main mem.
You should look at the example of a L1 shared memory solution for 100's of cores on the same silicon by www.plurality.com
One of the biggest problems is that when you create threads to align them to the number of available processing cores, the virtual machine / common runtime, underlying garbage collector and the operating system, add memory pressure, because each thread needs memory. Therefore, it is easy to collapse the L2 cache when you are working with huge memory.
For example, in my book "C# 2008 and 2005 threaded programming: Beginner's Guide", I developed many examples related to huge image processing algorithms. The main problem that I faced when working with these algorithms in real-life was that the Core 2 Quad CPU I used for performance tests had two L2 cache memories (4 MB each).
I achieved the best performance transforming a memory intensive algorithm into a pipelined CPU intensive & memory intensive algorithm. Hence, first I clear the garbage collection, I prepare all the data for the algorithm, I split the data to each new thread (a well optimized operating system like FreeBSD with Mono C# works great, Windows... not as well), and then each thread works in its CPU intensive algorithm, with most of its data in its shared L2 cache with the other thread/core. I/O should be done first and GC first, too.
The great problem is GC. GC in Java, C# and managed C++ is the great enemy of multicore programming and memory intensive algorithms.
A completely shared L2/L3 cache is really helpful in this kind of algorithms too.
There are many code examples that you can download from the book's webpage: http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-t...
You'll find this and my explanation of the performance waterfall, to optimize multicores as much as possible.
By the way, Congratulations for this community iniciative!!
I've tried .Net Parallel Extensions and Visual C# 4.0 (2010) CTP (Community Technology Preview). I've also included a lot of information and samples in my book "C# 2008 and 2005 threaded programming" about .Net Parallel Extensions (Chapter 11: Coding with .Net Parallel Extensions).
Microsoft engineers are doing a really good job with Parallel Extensions. However, it is still a CTP and you still have to combine .Net Parallel Extensions with threaded models inherited from .Net 2.0 and 3.0. You cannot use Parallel Extensions in a batch medical imaging application..... Nevertheless, you can test them to prepare for Visual C# 4.0 (2010).
One of the most intersting issues about Parallel Extensions is that you can (and you must) control many parameters. Most developers that aren't involved in multicore programming think that .Net Parallel Extension will be a silver bullet.... Unfortunately, they won't be a silver bullet. However, they are going to simplify many tasks. One of the most interesting things is the possibility to work with Queues and Lists already prepared for multithreading, avoiding locks. They are fantastic.
The most difficult part of parallel programming is coordination and managing multiple threads, preparing the code to avoid side-effects and locks. Locks are horrible and create a lot of problems. Using .Net Parallel Extensions with an extra knowledge on parallel programming techniques will allow developers to avoid locks and side-effects easily than ever. However, developers must learn a lot of topics, like the ones explained in my book.
There is a great book published by Joe Duffy (the Parallel Extensions leader). Every serious C# developer interested in highly threaded applications should read Joe's book. http://www.amazon.com/Concurrent-Programming-Windows-Microsoft-Deve...
This book is a bible for every C# fan.
My book dedicates one entire chapter to certain topics about .Net Parallel Extensions:
Chapter 11: Coding with .NET Parallel Extensions
Parallelizing loops using .NET extensions
Time for action – Downloading and installing the .NET Parallel Extensions
No silver bullet
Time for action – Downloading and installing the imaging library
Time for action – Creating an independent class to run in parallel without
side effects
Counting and showing blobs while avoiding side effects
Time for action – Running concurrent nebula finders using a
parallelized loop
Using a parallelized ForEach loop
Coding with delegates in parallelized loops
Working with a concurrent queue
Controlling exceptions in parallelized loops
Time for action – Showing the results in the UI
Combining delegates with a BackgroundWorker
Retrieving elements from a concurrent queue in a producer-consumer scheme
Time for action – Providing feedback to the UI using a producer-consumer
scheme
Creating an asynchronous task combined with a synchronous parallel loop
Time for action – Invoking a UI update from a task
Providing feedback when each job is finished
Using lambda expressions to simplify the code
Parallelizing loops with ranges
Parallelizing queries
Time for action – Parallelized counter
Parallelizing LINQ queries with PLINQ
Specifying the degree of parallelism for PLINQ
Parallelizing statistics and multiple queries
Summary
However, Microsoft has a lot of work to do to provide developers a final stable version. I'm sure they will be able to do that, because many well-known gurus are working in their amazing team.
I'll be researching about the new about .Net 4.0 and Parallel Extensions in their final versions.
I completely agree that locks are bad and I also agree that libraries and infrastructure is the solution to most of our problems. The fact is that web servers and databases are parallel infrastructures that have existed for years and developers use them with a single task process (such as Asp.Net, PHP, etc.)
PLINQ is an excellent start as are parallel libraries such as true parallel linked lists etc. I keep surprising people when I tell then that the multithreaded version of STL is slower than the single threaded version.
The problem with the .Net Parallel Extensions that it is supposed to support seamless parallel work but it does require good understanding of the system. For example you shouldn't use wait-states such as opening a file on parallel or the internal pool is flooded. Another problem is that a local variable / object in a function is a global variable when used inside a parallel loop. What about canceling tasks: do you cancel the whole collection of iterations? What about a loop inside a loop? How can you repeat an operation?
My intuition is that it will go away in a few years and we will go back to the old model of parallel infrastructure and serial user-code. Today databases, web-servers, clustering, and kernel are all parallel and written by specialists. The only problem that we have today is that we do not have the right infrastructure.
I completely agree with your point of view. That's exactly the problem. .Net Parallel Extensions aren't a silver bullet.
Do you remember the 32 bits revolution... Applications were going to be faster than ever... In those days, 16 bits applications performed faster than their 32 bits counterparts.
The same happens with the multicore revolution. If the applications aren't optimized by qualified software engineers, they will run slower than their single threaded counterparts.
In the last years, developers using high-level programming languages forgot about the underlying hardware. In order to take full advantage of multicore, they need to begin learning hardware again.
A programmer using .Net Parallel Extensions without learning some parallel programming basics can create a horrible application. Because, .Net Parallel Extensions do not automatize parallelization. They help, but you need to use them with great care.
For example, in my book, I talked about .Net Parallel Extensions in Chapter 11. Why? Because the reader needs to understand 10 chapters before using .Net Parallel Extensions. I am horrorized when I read short blogs talking about .Net Parallel Extensions without explaining previous fundamental issues.
QA teams are going to have a lot of work with multicore applications.
However, there is a great potential when qualified software engineers with the right training take advantage of multicore CPUs combining native threads, excellent (and I mean excellent) object-oriented designs and parallel extensions and libraries.
Again, there is no silver bullet. Of course, marketing staff will tell .Net Parallel Extensions and .Net 4.0 will be the silver bullet. However, that is not true.
As a personal experience. I saw a highly threaded Java application running in a 32 CPUs Sun Server running slower than in an old Pentium CPU. It produced 1 lock per instruction per CPU!!! The team was a great single-core development team. Nevertheless, they had no idea of parallel programming! However, when it was optimized, it performed 20 times faster than its single CPU version.