Last week on Massa – team working hard for new episode release

andrei

Hello everyone !

Last week we finished redying testnet 14 for release. This testnet contains a lot of changes (see https://github.com/massalabs/massa/pull/2862/files ) and is therefore highly expirmental.
We went through unit testing, and labnet testing last week. We corrected some bugs until we found the system was stable, and released TEST.14.0 on friday.
On saturday around 2:20am UTC the majority of nodes crashed.

Initial investigations on saturday and sunday showed that there were at least three problems:
(1) there were deserialization errors on the peer lists being received which caused node disconnections
(2) nodes rejected many incoming blocks because they followed invalid PoS draws and banned the sending nodes
(3) the proof-of-stake subsystem reported that a required cycle was not drawn (which triggered the crash)
None of those had been observed on the labnet.

Followup investigations this monday showed that problem (2) was caused by a faulty proof-of-stake bootstrap edge case that we are still working on repairing fully (see https://github.com/massalabs/massa/pull/2997/files ).

We also discovered that the peer list deserialization error (1) happened when a peer batch with the maximum number of peers (exactly 10000) was being received and deserialized.
This was corrected promptly: https://github.com/massalabs/massa/pull/2994

Originally, the 10000 peers come from people keeping their old peer lists from previous testnets.

The mystery was: if deserializing the 10000 peers failed, how come any of the bootstrap nodes got 10000 peers in the first place ?

This was due to an interaction with the problem (2) that caused some nodes to ban some of their peers and ended up announcing complementary sets of 9999 peers or less.
That way, the bootstrap nodes ended up with 10000 peers each, and disconnected from each other whenever they announced those 10000 to each other due to (1).

See also  Masscots Series , Interview with Abderrazak aka Appieasahbie

The complete loss of synchrony between different nodes caused each node to end up isolated and block finality became too slow from their point of view (longer than 1 cycle) because they only observed the blocks they produced by themselves.

Final cycles not being available in a timely manner prevented the proof-of-stake system from performing draws, which triggered (3).

This complicated interaction of edge cases in apparently unrelated modules explains that we haven’t seen this in the labnet tests.

This tuesday will be dedicated to correcting the remaining problems with the proof-of-stake bootstrap.

TEST.14.1 will be released as soon as everything is corrected.
Of course, scoring will be postponed until everything is ready.

 

Thank you for your understanding !

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

We are excited to announce that testnet 4 “Gaghiel” is live

Next Post
the merge ethereum

Why “the Merge” could change the future of cryptocurrency


Disclaimer : This website does not invite anyone to invest in the projects we are talking about. This is simple information about crypto projects that we find interesting.
Related Posts
PhilippinesFrenchSpainUkraineRomania