Transactional Memory Coherence and Consistency

In this model, transactions are _always_ the basic unit of
parallel work, communication, memory coherence and memory reference consistency.

TCC hardware combines all writes from each transaction into a single packet
and broadcasts this packet to the permanent shared memory state atomically
as a large block. This significantly reduces cache coherence traffic.

TCC provides automatic hardware controlled rollback of speculative
transactions and therefore resolves any correctness violations that may occur
when several processors attempt to read and write the same data simultaneously.

TCC is an alternative to both message-passing and shared-memory.

Message passing system supports relatively simple hardware configurations,
such as clusters of workstations, but makes programmers work hard to
take advantage of the hardware.

Does message passing system implicitly synchronize processors? Is it because
applications/modules are designed in a way that nothing is shared
between modules/applications?

Shared memory adds additional hardware to provide programmers with an illusion
of a single shared memory common to all processors, avoiding or minimizing the
problem of manual data distribution. This is accomplished by tracking shared
cache lines as they move throughout the system either through the use of a
snoopy bus coherence protocol over a shared bus or through a directory based
coherence mechanism. Expensive.

What is the granularity at which communication events like cache updates happen in shared memory model? Is it at the level of individual load and store instructions?

Both these models therefore have drawbacks, message passing makes software
design difficult, while shared memory requires complex hardware to get only
slightly simpler programming model.

TCC claims that it provides a communication model that avoids
memory consistency issues present in SHM and reduces the need for hardware
to support frequent, latency sensitive coherence requests for individual
cache lines at load/store granularity.
It claims to retain the advantages of MPP-like synchronization.

What is a speculative transaction? Is it a transaction which can later be
rolled back at commit time?

Each transaction produces a block of writes called the write state which are
committed to shared memory only as an atomic unit. Once the transaction is
complete, hardware must arbitrate system wide for the permission to commit
its writes. After this permission is granted, the processor can take
advantage of high system interconnects bandwidths to simply broadcast
all writes for the entire transaction out as one large packet to the
rest of the system.

How this model minimizes the latency when compared with simple shared
memory model? Is it due to broadcast entire write set instead of individual
cache line stores?

How does this entire transaction commit provide inherent synchronization? Can
we treat a critical section as a transaction?

Consistency: A processor that reads data that is subsequently updated by
another processor's commit, before it can commit itself, is forced to
violate and rollback in order to enforce this model.

Coherence: No need for locks, transactions provide atomicity.

Programming Model: Transaction breaks should never be inserted in the middle
of a critical section. Unlike conventional parallelization, other errors
only cause reduced performance, not incorrect execution.

What kind of errors in this model can cause performance degradation? Can we
say that a transaction equivalent to one store would cause performance
degradation? Is it related to identification of transactions in the code base?

Divide into transactions: The first step is the creation of a parallel program
using TCC is to coarsely divide the program into blocks of code that can run
concurrently on different processors.

Specify order: The programmer can optionally specify an ordering between
transactions to maintain a program order that must be enforced. There are
places in parallel applications where certain transactions must complete
before others. This can be addressed by assigning hardware managed phase
numbers to each transaction. At any point in time, only transactions from
the oldest phase present in the system are allowed to commit.

Performance tuning: After transactions are selected and ordered, the program
can be run in parallel. The TCC system can automatically provide informative
feedback about where violations occur in the program and allows programmer
to perform further optimizations.

How this model does allow programmers to focus on performance instead of
correctness? Ans: correctness is automatically handled by hardware

Why can't TCC model scale infinitely? TCC model is similar to
directory based cache coherence except at coarser granularity, and so has
scalability limitations.

Basic TCC system: Individual processor nodes within a TCC system must have
features to provide speculative buffering of memory references. Each node
consists of a core plus its own local cache hierarchy. And each cache line
maintains the following information: read/modified bits. These implement
transactional semantics.

Cache lines with read or modified bits may not be flushed from the local
cache hierarchy before transaction commits.

What if cache capacity constraints force the cache entries to flush out in
the mid of transaction, cache "commit permission" helps. Once you have
commit permission, you don't need to maintain read/modified bits.

Can a processor hold the commit permission for any period of time? If so,
how does it impact the performance? (performance degradation due
to long commit latencies)

Improvement suggestions:
- Double Buffering: double-buffer the transactional bits. successive
transaction can set these bits while older bits are being flushed.
Figure 3 shows the various combinations.

- hardware controlled transactions: In the base TCC system, programmers
explicitly mark all transaction boundaries. It is also possible for
hardware to automatically mark transaction boundaries, avoiding critical
sections (marked by programmer). Hardware can also merge small transactions
into a single large transaction. Hardware can insert barriers into
unordered transactions. How do barriers help? helps avoid starvation,
a long transaction that makes no forward progress because it gets into
an infinite loop of violations and restarts?

- The programmer or compiler can provide hints to the hardware to reduce
the need for buffering for broadcast.

- Is there a need to buffer stack memory updates? Cache snooping not required.

I/O handling: The key constraint is that a transaction can't violate and
rollback after input is obtained.

Simulation Results: Figure 4 and 5

Why equake_s does perform better than equake_l? higher number of violations
in large transaction size?

Buffering requirements for typical transactions: The most significant
hardware cost of this system is in the addition of the speculative buffer
support to the local cache hierarchy. It is critical that the amount of state
read and/or written by an average transaction be small enough to be buffered
on chip.

Limited Bus Bandwidth: Figure 8 and 9

It shows the average number of addresses that must be broadcast on every cycle.

It shows average number of bytes that must be broadcast on every cycle.

How invalidate and update protocol are different? Does it affect the bandwidth?

Other Limited Hardware: It shows how a real system will work in practice, where
issues like finite bus bandwidths, reduced number of read state bits, limited
buffering, and time to handle the various protocol overheads can all be
significant limiting factors on speedup.

Figure 11:

Why fft does perform badly with low bandwidth? Is it because of large write
state?

Figure 12:

How does protocol overhead like commit permission affect the performance of
radix application? Is it because of small transaction size in radix
application but the timing overhead of commit permission is high?