Lite Client Message Retrieval Proposal

Introduction
This is a proposal to introduce a mechanism into the Bitmessage protocol that will allow 'lite' Bitmessage clients to retreive messages that have been sent to them without downloading all the messages in a stream.

Summary of the proposal

 * Lite clients can determine which messages they need to dowload using 'prefix filters'.
 * Prefix filters allow the owner of each Bitmessage address to determine the balance between anonymity and efficiency that they will have when retrieving messages.
 * Users who wish to retain the highest level of anonymity (by downloading all messages in a stream) are not affected by this change. Messages to them are 'tagged' with completely random data.
 * Users who do not care about anonymity can specify a very precise prefix for each of their addresses, allowing them to download only messages destined for them.
 * Other users can achieve a balance between anonymity and convenience.

Reasoning behind the proposal
Currently, Bitmessage clients such as PyBitmessage act as full nodes in the Bitmessage peer-to-peer network. This involves transmitting, receiving, and processing a large amount of data. This workload may be manageable for more powerful devices, such as laptops, desktops, and servers, but it is typically too great for less powerful devices such as mobile phones and tablets to reasonably handle. This is particularly true in cases where mobile internet connections mean that bandwidth is very expensive. Because of this, a high proportion of the computing devices in use today are not suitable to act as full nodes in the Bitmessage peer-to-peer network. Therefore, in order for the users of these devices to be able to use Bitmessage, there must be some way of doing so that does not involve their device acting as a full node in the network.

One approach to solving this problem is the creation of 'lite' clients. These are Bitmessage clients which do not do all the processing and transmission work that full nodes do. Instead, the burden of this work is shifted to servers, which supply lite clients with the data they need and relay data sent by lite clients to the rest of the Bitmessage network. Thankfully the design of Bitmessage allows this to be implemented in such a way that very little security is lost. All message encryption and decryption can be done locally by the lite client, and it can retain sole control over its addresses and cryptographic keys. Data provided by servers can be cryptographically authenticated as being correct, and data sent to servers can be disseminated to several at once to ensure that it reaches the rest of the network.

However, this arrangement does require one significant compromise. In order for 'lite' clients such as mobile phones to receive messages, there needs to be some way for them to identify messages that have been sent to their addresses. Full nodes in Bitmessage do not require this because they receive and automatically attempt to decrypt all messages in the streams (segments of the network) which they are a part of. As described above however, the work required for this approach is too great for many devices, and therefore there must be some way for lite clients to identify which messages are destined for them, so that they can request those messages from a server.

Proposed changes
The pubkey of every hashed address contains two new fields: a 'prefix length' field and a 'prefix' field. The value of the prefix length field is defined by the owner of the pubkey. The prefix field is set randomly. The value of the prefix length field defines what balance of efficiency and anonymity the owner of that address wants. If the value of the prefix length field is zero, then the prefix field can be ignored entirely, giving maximum anonymity. On the other extreme, if the value of the prefix length field is 32 (the maximum), then the full 32 bits of the prefix should be used, giving maximum efficiency in retrieving data. Each additional bit used halves the proportion of the total data set that you will have to request in order to receive data destined for you.

All msgs have a 'prefix nonce' as part of their unencrypted data. This prefix nonce is always of a fixed length (32 bits) to avoid leaking information. When sending a msg to an address, the sending client examines the destination address's pubkey and sets the first n bits of the prefix nonce to match the prefix found in the pubkey. The remainder of the prefix nonce is set randomly.

When a lite client wishes to retrieve msgs from the network, it makes a request to a server for all msgs which have a prefix nonce where the first n bits matches the prefix of the lite client's address(s).

Pubkey object
Under this proposal, the encrypted part of a pubkey object would be composed as follows. This can be compared to the current specification found at https://bitmessage.org/wiki/Protocol_specification#pubkey.

Msg object
Under this proposal, a msg object would be composed as follows. This can be compared to the current specification found at https://bitmessage.org/wiki/Protocol_specification#msg.

Example 1
If Alice does not care about anonymity, she can have her address's pubkey specify that the full 32 bits of the prefix field should be used. This means that all msgs sent to her will effectively be uniquely tagged for her address (there are roughly 4 billion possible prefix values). This will allow Alice to only download msgs that are destined for her, giving her maximum usability and convenience at the expense of anonymity.

Example 2
If Bob wants to achieve a balance between anonymity and convenience, he can have his address's pubkey specify that 4 bits of the prefix field should be used, e.g 0110. This means that all msgs sent to his address will have a prefix nonce beginning with 0110. A random distribution of prefixes in a stream should mean that roughly 6.25% of msgs in a stream will have a prefix nonce beginning with 0110. Therefore Bob will be able to download roughly 6.25% of the msgs in a stream and still be guaranteed to receive all msgs destined for him.

Example 3
If Carol wants maximum anonymity then she can have her address's pubkey specify that zero bits of the prefix field should be used. This means that msgs sent to her will contain no identifying information, only a totally random prefix nonce. Carol will have to download and attempt to decrypt all msgs in a stream in order to be sure of receiving all msgs destined for her.

Other object types
Prefix filtering is not required for broadcasts, pubkeys, or getpubkeys. These object types are already tagged with an identifier.

Lite client pubkey retrieval
How can lite clients retrieve pubkeys without telling the servers who they are talking to? In the short to medium term (until streams are implemented), it will be perfectly viable for lite clients to simply download all pubkeys in stream 1. A pubkey is 392 bytes in size. If we assume an average of 5 addresses per user, we could support a 10x increase in the number of active users (from roughly 1,000 to 10,000), and lite clients would still only have to download 19MB worth of pubkey data. This is only a rough approximation, but it seems reasonably clear that the amount of data to download and store would be acceptable.

Lite client pubkey dissemination
How can lite clients respond to getpubkey requests? Lite clients can do this in two ways: 1. They can make periodic requests to full node servers to check for any getpubkeys for any of their addresses and then respond to any getpubkeys they receive by re-disseminating the relevant pubkeys. 2. They can periodically re-disseminate the pubkeys for each of their addresses, with the re-dissemiantion period determined by the time to live of the pubkeys they create. This will ensure that the pubkeys for all the lite client's addresses are always available in the stream(s) they are a part of.

Non-hashed addresses
If non-hashed addresses are introduced in the future, this scheme will not apply to them, as they will not have pubkeys. Users of non-hashed addresses will still need to download and attempt to decrypt all msgs in each stream in which they have an address. This makes sense though, as the whole point of introducing non-hashed addresses is to give people the ability to choose greater anonymity at the expense of convenience.

Streams
The system described in this proposal is intended to be complementary to streams (i.e. both should be implemented). It is not intended to replace them.

Possible refinement: Daily changing prefix values
While the system described above gives a reasonable balance between anonymity and efficiency, it has some weaknesses. In particular, users sending messages to addresses which specify a non-zero prefix length value will tend to leak some information about who they are communicating with. By observing the prefix nonces of msgs sent by a particular client, an attacker may be able build up information about how many different addresses a given user is communicating with, who they are communicating with more often, and who they do not communicate with for certain periods of time.

One possible way to mitigate this problem would be to modify the prefix filtering system so that the prefix values used to 'tag' msgs vary according to a time value. Instead of tagging msgs with the first n bits of the prefix defined in the destination address's pubkey, msgs could be tagged with the first n bits of a hash value. The hash value would be calculated from the first n bits of the prefix concatenated with a time value. The result of this would be that the identifying bits of the prefix nonce of a msg would change every so often in a predictable way, with the frequency of change defined by the time value used.

For example, if the time value is defined by the protocol as 00:00 on the current day in Unix time, then the identifying prefix bits used to tag msgs will be different each day, in a way that can be easily and reliably calculated by both the sender and the receiver of msgs. The result of this is that rather than all msgs ever sent to a given address having the same identifying prefix bits, all msgs sent to a given address on a given day have the same identifying prefix bits.

A rough pseudocode description of the procedure to calculate the prefixNonce for a msg follows: unixTime = getUnixTime remainderSeconds = unixTime mod 86,400 timeValue = unixTime - remainderSeconds hash = sha512(sha512(prefix[0-prefixLength] + timeValue)) identifyingBits = hash[0-prefixLength] prefixNonce = identifyingBits + randomBits

Advantages:
 * The information leaked by a client sending msgs to addresses which specify a non-zero prefix length value is greatly diminished, as the identifying bits in the prefix nonces of sent msgs change each day. Some information is still leaked, particularly when an attacker already knows the destination address(es) of sent msgs and can therefore calculate what the identifying prefix bits will be each day, but the overall amount of information leaked is substantially reduced.

Disadvantages:
 * When catching up with the network, lite clients must make a separate request for messages for each day in the time period that they need to check.