Skip to content

History Correlation

Eric Voskuil edited this page Apr 18, 2017 · 8 revisions

Address receive (output) and spend (input) records are indexed by the server by 20 byte address hash. Each record contains either an receive or spend point. A point is the transaction hash and index within the transaction of the output or input respectively. Each record also contains a 8 byte value. The output.value field contains the amount of satoshis spent by the output. The input.value field is a compressed reference to the output that the input spends. These values are returned by the blockchain.fetch_history3 command of the query service.

The input.value is logically the output point. However a point would require 34 bytes of storage per indexed input. In order to compress storage and to provide uniform record size for inputs and outputs (which improves performance), the input's reference to the output that it spends is compressed to 8 bytes. This value is referred to as a checksum, and is computed from the output point. By computing the checksum for each output in a response an API user can correlate each input to an output that it spends. An unspent output has a checksum with no correlating input.

Checksum Computation

The following function computes the checksum of an output.

uint64_t point::checksum() const
{
    // Reserve 49 bits for the tx hash and 15 bits (32768) for the input index.
    static constexpr uint64_t mask = 0xffffffffffff8000;

    // Use an arbitrary offset to the middle of the hash.
    const auto tx = from_little_endian_unsafe<uint64_t>(hash_.begin() + 12);
    const auto index = static_cast<uint64_t>(index_);

    const auto tx_upper_49_bits = tx & mask;
    const auto index_lower_15_bits = index & ~mask;
    return tx_upper_49_bits | index_lower_15_bits;
}

Checksum Collision

It is not possible for any two outputs of the same transaction to have a checksum collision. Assuming randomly spread values for the middle 49 bits of distinct transaction hashes, the probability of a collision between any two outputs is 1 in 2^49. In the case of collision between two or more outputs in a query result it is not possible to correlate inputs with the conflicting outputs. Despite the unlikely occurrence, clients should handle collision as an error state.

Unconfirmed Points

Points of unconfirmed transactions, including those mined in blocks in a weak chain, are not indexed. To obtain unconfirmed address information subscribe to the address.

Partial Indexing

If the server indexes the entire blockchain no points at or below the current top block will be missing. However the server may start address indexing at any height.

Things can get weird if a server is restarted with a higher index start height than that for which it had previously indexed records. This would produce gaps in the history.

Unindexed Points

It is possible for points to be missing in the case where the server does not index the entire blockchain. In this case is not possible to determine if any of the oldest points are missing without prior knowledge of their existence. It is also possible that an input can be indexed with no corresponding output, since the output can be at a lower height than the input. In this case an uncorrelated input would result. This can occur with any number of inputs and/or outputs, including one of each. Lack of correlation always indicates missing history due to nonzero server index start height.

Clone this wiki locally