In this commit, we add a new atomic bool to only permit a single gossip
backlog goroutine per peer. If we get a new reuqest that needs a backlog
while we're still processing the other, then we'll drop that request.
In this commit, we introduce an asynchronous processing queue for
GossipTimestampRange messages in the GossipSyncer. This change addresses
a critical issue where the gossiper could block indefinitely when
processing timestamp range messages during periods of high load.
Previously, when a peer sent a GossipTimestampRange message, the
gossiper would synchronously call ApplyGossipFilter, which could block
on semaphore acquisition, database queries, and rate limiting. This
synchronous processing created a bottleneck where the entire peer
message processing pipeline would stall, potentially causing timeouts
and disconnections.
The new design adds a timestampRangeQueue channel with a capacity of 1
message and a dedicated goroutine for processing these messages
asynchronously. This follows the established pattern used for other
message types in the syncer. When the queue is full, we drop messages
and log a warning rather than blocking indefinitely, providing graceful
degradation under extreme load conditions.
For any method that takes a context that has a select that listens on
the systems quit channel, we should also listen on the ctx since we
should not need to worry about if this context is derived internally or
externally.
The `GossiperSyncer` makes various calls to the `ChannelGraphTimeSeries`
interface which threads through to the graph DB. So in preparation for
threading context through to all the methods on that interface, we
update the GossipSyncer accordingly by passing contexts through.
Two `context.TODO()`s are added in this commit. They will be removed in
the upcoming commits.
In this commit, we revamp the old message based rate limiting. First, we
move to meter by bytes/s instead of messages/s. The old logic had an
error in that it limited groups of message replies, instead of each
message. With this new approach, we'll use the newly added
SerializedSize method to implement fine grained bandwidth metering.
We need to pick two values, the burst rate, and the msg bytes rate. The
burst rate is the max amt that can be sent in a given period of time. We
need to set this above 65 KB, or the max msg limit, otherwise no
messages can be sent. The bucket starts with this many tokens (bytes).
As those are depleted, the amount of tokens is refilled at the msg
bytes rate.
As conservative values, we've chosen 200 KB as the burst rate, and 100
KB/s as the limit.
Here we introduce the access manager which has caches that will
determine the access control status of our peers. Peers that have
had their funding transaction confirm with us are protected. Peers
that only have pending-open channels with us are temporary access
and can have their access revoked. The rest of the peers are granted
restricted access.
Previously, we would set the state of the syncer after sending the msg,
which has the following flow,
1. In state `queryNewChannels`, we send the msg `QueryShortChanIDs`.
2. Once the msg is sent, we change to state `waitingQueryChanReply`.
But there's no guarantee the remote won't reply back inbetween the two
step. When that happens, our syncer would still be in state
`queryNewChannels`, causing the following error,
```
[ERR] DISC gossiper.go:873: Process query msg from peer [Alice] got unexpected msg *lnwire.ReplyShortChanIDsEnd received in state queryNewChannels
```
To fix it, we now make sure the state is updated before sending the msg.
The mocked peer used here blocks on `sendToPeer`, which is not the
behavior of the `SendMessageLazy` of `lnpeer.Peer`. To reflect the
reality, we now make sure the `sendToPeer` is non-blocking in the tests.
This commit fixes the following race,
1. syncer(state=syncingChans) sends QueryChannelRange
2. remote peer replies ReplyChannelRange
3. ProcessQueryMsg fails to process the remote peer's msg as its state
is neither waitingQueryChanReply nor waitingQueryRangeReply.
4. syncer marks its new state waitingQueryChanReply, but too late.
The historical sync will now fail, and the syncer will be stuck at this
state. What's worse is it cannot forward channel announcements to other
connected peers now as it will skip the broadcasting during initial
graph sync.
This is now fixed to make sure the following two steps are atomic,
1. syncer(state=syncingChans) sends QueryChannelRange
2. syncer marks its new state waitingQueryChanReply.
In preparation for adding the new ChannelAnnouncement2 message along
with a ChannelAnnouncement interface, we rename the existing message to
ChannelAnnouncement1.
This commit hooks up the banman to the gossiper:
- peers that are banned and don't have a channel with us will get
disconnected until they are unbanned.
- peers that are banned and have a channel with us won't get
disconnected, but we will ignore their channel announcements until
they are no longer banned. Note that this only disables gossip of
announcements to us and still allows us to open channels to them.
ChanUpdate timestamps are now restircted so that they cannot be
more than two weeks into the future. Moreover channels with both
timestamps in the ReplyChannelRange msg either too far in the past
or too far in the future are not queried.
Moreover fix unitests.
In order to prep for allowing TLV extensions for the `ReplyChannelRange`
and `QueryChannelRange` messages, we'll need to remove the struct
embedding as is. If we don't remove this, then we'll attempt to decode
TLV extensions from both the embedded and outer struct.
All relevant call sites have been updated to reflect this minor change.
Rather than performing this call in the SyncManager, we give each
gossipSyncer the ability to mark the first sync completed. This permits
pinned syncers to contribute towards the rpc-level synced_to_graph
value, allowing the value to be true after the first pinned syncer or
regular syncer complets. Unlinke regular syncers, pinned syncers can
proceed in parallel possibly decreasing the waiting time if consumers
rely on this field before proceeding to load their application.
A pinned syncer is an ActiveSyncer that is configured to always remain
active for the lifetime of the connection. Pinned syncers do not count
towards the total NumActiveSyncer count, which are rotated periodically.
This features allows nodes to more tightly synchronize their routing
tables by ensuring they are always receiving gossip from distinguished
subset of peers.
Modifies syncer.replyChanRangeQuery method to use the LastBlockHeight
method on the query. LastBlockHeight safely calculates the ending
block height and prevents an overflow of start_block + num_blocks.
Prior to this change, query messages that had a start_block +
num_blocks that overflows uint32_max would return zero results in the
reply message.
Tests are added to fix the bug and ensure proper start and end values
are supplied to the channel graph filter.
If the provided ChainHash in a QueryChannelRange message does not match
that of our current chain, then we should send a blank response, rather
than reply with channels for the wrong chain.
We move from our legacy way of interpreting ReplyChannelRange messages
which was incorrect. Previously, we'd rely on the Complete field of the
ReplyChannelRange message to determine when our peer had sent all of
their replies. Now, we properly adhere to the specification by
interpreting the block ranges of these messages as intended.
Due to the large number of nodes deployed with the previous method, we
still maintain and detect when we are communicating with them, such that
we are still able to sync with them for backwards compatibility.
In order to properly adhere to the spec, when handling a
QueryChannelRange message, we must reply with a series of
ReplyChannelRange messages, that when consumed together cover the
entirety of the block range requested.
In this commit we fix in a bug in `lnd` that could cause other
implementations which implement a strict version of the spec to
disconnect when trying to sync their channel graph using the gossip
query feature. Before this commit, we would embed the request to a
`QueryChannelRange` in the response, causing some clients to reject the
response as the `FirstBlockHeight` and `NumBlocks` field would be
identical for each chunk of the response.
In order to remedy this, we now properly set these two fields with each
returned chunk. Note that even after this commit, we keep our existing
behavior surrounding the `Complete` field as is. Otherwise, current
`lnd` clients which rely on this field (rather than the two
aforementioned fields) wouldn't be able to properly detect when a set of
responses to their query was "complete".
Partially fixes#3728.