This commit fixes a critical bug where the channelGraphSyncer goroutine
would enter an endless loop when context cancellation or peer disconnect
errors occurred during the syncingChans or queryNewChannels states.
The root cause was that state handler functions (handleSyncingChans and
synchronizeChanIDs) did not return errors to the main goroutine loop.
When these functions encountered fatal errors like context cancellation,
they would log the error and return early without changing the syncer's
state. This caused the main loop to immediately re-enter the same state
handler, encounter the same error, and loop indefinitely while spamming
error logs.
The fix makes error handling explicit by having state handlers return
errors. The main channelGraphSyncer loop now checks these errors and
exits cleanly when fatal errors occur. We return any error (not just
context cancellation) because fatal errors can manifest in multiple
forms: context.Canceled, ErrGossipSyncerExiting from the rate limiter,
lnpeer.ErrPeerExiting from Brontide, or network errors like connection
closed. This approach matches the error handling pattern already used in
other goroutines like replyHandler.
In preparation for adding a NodeAnnouncement2 struct along with a
NodeAnnouncement interface, this commit renames the existing
NodeAnnouncment struct to NodeAnnouncement1.
In this commit, we update ApplyGossipFilter to leverage the new
iterator-based UpdatesInHorizon method. The key innovation here is using
iter.Pull2 to create a pull-based iterator that allows us to check if
any updates exist before launching the background goroutine.
This approach provides several benefits over the previous implementation.
First, we avoid the overhead of launching a goroutine when there are no
updates to send, which was previously unavoidable without materializing
the entire result set. Second, we maintain lazy loading throughout the
sending process, only pulling messages from the database as they're
needed for transmission.
The implementation uses Pull2 to peek at the first message, determining
whether to proceed with sending updates. If updates exist, ownership of
the iterator is transferred to the goroutine, which continues pulling
and sending messages until exhausted. This design ensures memory usage
remains bounded regardless of the number of updates being synchronized.
In this commit, we complete the iterator conversion work started in PR
10128 by threading the iterator pattern through to the higher-level
UpdatesInHorizon method. This change converts the method from returning
a fully materialized slice of messages to returning a lazy iterator that
yields messages on demand.
The new signature uses iter.Seq2 to allow error propagation during
iteration, eliminating the need for a separate error return value. This
approach enables callers to handle errors as they occur during iteration
rather than failing upfront.
The implementation now lazily processes channel and node updates,
yielding them as they're generated rather than accumulating them in
memory. This maintains the same ordering guarantees (channels before
nodes) while significantly reducing memory pressure when dealing with
large update sets during gossip synchronization.
In this commit, we add a new atomic bool to only permit a single gossip
backlog goroutine per peer. If we get a new reuqest that needs a backlog
while we're still processing the other, then we'll drop that request.
In this commit, we introduce an asynchronous processing queue for
GossipTimestampRange messages in the GossipSyncer. This change addresses
a critical issue where the gossiper could block indefinitely when
processing timestamp range messages during periods of high load.
Previously, when a peer sent a GossipTimestampRange message, the
gossiper would synchronously call ApplyGossipFilter, which could block
on semaphore acquisition, database queries, and rate limiting. This
synchronous processing created a bottleneck where the entire peer
message processing pipeline would stall, potentially causing timeouts
and disconnections.
The new design adds a timestampRangeQueue channel with a capacity of 1
message and a dedicated goroutine for processing these messages
asynchronously. This follows the established pattern used for other
message types in the syncer. When the queue is full, we drop messages
and log a warning rather than blocking indefinitely, providing graceful
degradation under extreme load conditions.
For any method that takes a context that has a select that listens on
the systems quit channel, we should also listen on the ctx since we
should not need to worry about if this context is derived internally or
externally.
The `GossiperSyncer` makes various calls to the `ChannelGraphTimeSeries`
interface which threads through to the graph DB. So in preparation for
threading context through to all the methods on that interface, we
update the GossipSyncer accordingly by passing contexts through.
Two `context.TODO()`s are added in this commit. They will be removed in
the upcoming commits.
In this commit, we revamp the old message based rate limiting. First, we
move to meter by bytes/s instead of messages/s. The old logic had an
error in that it limited groups of message replies, instead of each
message. With this new approach, we'll use the newly added
SerializedSize method to implement fine grained bandwidth metering.
We need to pick two values, the burst rate, and the msg bytes rate. The
burst rate is the max amt that can be sent in a given period of time. We
need to set this above 65 KB, or the max msg limit, otherwise no
messages can be sent. The bucket starts with this many tokens (bytes).
As those are depleted, the amount of tokens is refilled at the msg
bytes rate.
As conservative values, we've chosen 200 KB as the burst rate, and 100
KB/s as the limit.
Here we introduce the access manager which has caches that will
determine the access control status of our peers. Peers that have
had their funding transaction confirm with us are protected. Peers
that only have pending-open channels with us are temporary access
and can have their access revoked. The rest of the peers are granted
restricted access.
Previously, we would set the state of the syncer after sending the msg,
which has the following flow,
1. In state `queryNewChannels`, we send the msg `QueryShortChanIDs`.
2. Once the msg is sent, we change to state `waitingQueryChanReply`.
But there's no guarantee the remote won't reply back inbetween the two
step. When that happens, our syncer would still be in state
`queryNewChannels`, causing the following error,
```
[ERR] DISC gossiper.go:873: Process query msg from peer [Alice] got unexpected msg *lnwire.ReplyShortChanIDsEnd received in state queryNewChannels
```
To fix it, we now make sure the state is updated before sending the msg.
The mocked peer used here blocks on `sendToPeer`, which is not the
behavior of the `SendMessageLazy` of `lnpeer.Peer`. To reflect the
reality, we now make sure the `sendToPeer` is non-blocking in the tests.
This commit fixes the following race,
1. syncer(state=syncingChans) sends QueryChannelRange
2. remote peer replies ReplyChannelRange
3. ProcessQueryMsg fails to process the remote peer's msg as its state
is neither waitingQueryChanReply nor waitingQueryRangeReply.
4. syncer marks its new state waitingQueryChanReply, but too late.
The historical sync will now fail, and the syncer will be stuck at this
state. What's worse is it cannot forward channel announcements to other
connected peers now as it will skip the broadcasting during initial
graph sync.
This is now fixed to make sure the following two steps are atomic,
1. syncer(state=syncingChans) sends QueryChannelRange
2. syncer marks its new state waitingQueryChanReply.
In preparation for adding the new ChannelAnnouncement2 message along
with a ChannelAnnouncement interface, we rename the existing message to
ChannelAnnouncement1.
This commit hooks up the banman to the gossiper:
- peers that are banned and don't have a channel with us will get
disconnected until they are unbanned.
- peers that are banned and have a channel with us won't get
disconnected, but we will ignore their channel announcements until
they are no longer banned. Note that this only disables gossip of
announcements to us and still allows us to open channels to them.
ChanUpdate timestamps are now restircted so that they cannot be
more than two weeks into the future. Moreover channels with both
timestamps in the ReplyChannelRange msg either too far in the past
or too far in the future are not queried.
Moreover fix unitests.
In order to prep for allowing TLV extensions for the `ReplyChannelRange`
and `QueryChannelRange` messages, we'll need to remove the struct
embedding as is. If we don't remove this, then we'll attempt to decode
TLV extensions from both the embedded and outer struct.
All relevant call sites have been updated to reflect this minor change.
Rather than performing this call in the SyncManager, we give each
gossipSyncer the ability to mark the first sync completed. This permits
pinned syncers to contribute towards the rpc-level synced_to_graph
value, allowing the value to be true after the first pinned syncer or
regular syncer complets. Unlinke regular syncers, pinned syncers can
proceed in parallel possibly decreasing the waiting time if consumers
rely on this field before proceeding to load their application.
A pinned syncer is an ActiveSyncer that is configured to always remain
active for the lifetime of the connection. Pinned syncers do not count
towards the total NumActiveSyncer count, which are rotated periodically.
This features allows nodes to more tightly synchronize their routing
tables by ensuring they are always receiving gossip from distinguished
subset of peers.
Modifies syncer.replyChanRangeQuery method to use the LastBlockHeight
method on the query. LastBlockHeight safely calculates the ending
block height and prevents an overflow of start_block + num_blocks.
Prior to this change, query messages that had a start_block +
num_blocks that overflows uint32_max would return zero results in the
reply message.
Tests are added to fix the bug and ensure proper start and end values
are supplied to the channel graph filter.
If the provided ChainHash in a QueryChannelRange message does not match
that of our current chain, then we should send a blank response, rather
than reply with channels for the wrong chain.
We move from our legacy way of interpreting ReplyChannelRange messages
which was incorrect. Previously, we'd rely on the Complete field of the
ReplyChannelRange message to determine when our peer had sent all of
their replies. Now, we properly adhere to the specification by
interpreting the block ranges of these messages as intended.
Due to the large number of nodes deployed with the previous method, we
still maintain and detect when we are communicating with them, such that
we are still able to sync with them for backwards compatibility.
In order to properly adhere to the spec, when handling a
QueryChannelRange message, we must reply with a series of
ReplyChannelRange messages, that when consumed together cover the
entirety of the block range requested.