Threads cannot be implemented as a library.

Abstract: In many environments, multi threaded code is written in a language that was originally designed without thread support (e.g. C), to which a library of threading primitives was subsequently added. 

This paper argues that a library of threading primitives is not enough to provide correctness and the compiler must be aware of the existence of threads.

This paper reviews the pthreads approach, explaining how and why it appears to work. It then discusses three distinct deficiencies in this approach which can lead to subtly incorrect code. Each of these failures is likely to be very intermittent and hard to expose during testing.

Any language supporting concurrency must specify the semantics of multi threaded execution. Most fundamentally, it must specify the memory model, i.e. which assignments to a variable by one thread can be seen by a concurrently running thread.

Sequential Consistency: http://en.wikipedia.org/wiki/Sequential_consistency

Example program: if we started in an initial state in which all variables are zero, and one thread executes:

X=1; r1=y while another executes
Y=1; r2=x;

The sequential consistency enforces either r1 or r2 to have a value of 1 when execution completes.

All realistic programming language implementations supporting true concurrency allow both r1 and r2 to remain zero in the above mentioned example. It is because compiler as well as hardware may reorder memory operations.

What is compile time barrier in linux? Is it asm volatile (“” : : : “memory”) ? Is it used to avoid compile time reordering? Well, this is GCC specific.

What are memory barriers? Is it lfence, sfence and mfence on x86 processors?

Correctness issues with thread library: because compiler is not aware of threads

Concurrent Modifications:  The compiler can transform the instruction sequence to a different set of instruction sequence which preserves the sequential correctness. But this can introduce data races when the instructions get executed on the multiple threads.

Please go through the example defined in the same section.

Rewriting of adjacent data:  It happens when store operation is done on a variable that is never mentioned in the source code. The bit-fields in the C/C++ structure are often exposed to this because there are no bit wide store instructions are available on the hardware.

This introduction of extra store operation can cause data races even if the data is protected by synchronization primitives. Is it true?

Please go through the example defined in the same section.

How byte-wide stores can avoid race? Can we really resolve the bit-fields issue with byte-wide stores? Do we need padding at byte level to use byte-wide stores effectively?

Register Promotion: The data operations perform better with register operands than memory operands. The compiler takes this fact as an opportunity to optimize the code but such optimizations can introduce store operations which are not there in the source code and thus lead to race.

Please go through the example defined in the same section. Notice that in this scenario, this transformation is indeed an "optimization".

Why compiler support is necessary for the correctness of parallel program? Is it due to optimizations/instruction reordering done by compiler?

Why there is a need to revise C++ language standard or other environments relying on library based threads? Is it to introduce some language level constructs which helps compiler doing optimizations accordingly in the presence of threads?

Performance: 

What is the cost associated with mutual exclusion primitives? Is it due to atomic memory update? Is it due to memory barrier operation?

What is double-checked locking?

http://en.wikipedia.org/wiki/Double-checked_locking

Expensive Synchronization: It’s observed from all performance figures that non-synchronized code performs better than synchronized one.

How synchronization avoids performance benefits of multi processor systems? Is it due to synchronization primitives? Is it due to absence of optimizations done by compiler?

Why spin-locks perform better than mutexes? Is it due to context switch?