WebSafe 3.7github.com
|
|
🏠
Skip to content

Fix tokenizer with DDP#2633

Merged
Adel-Moumen merged 39 commits intospeechbrain:developfrom
TParcollet:fix_tokenizer_ddp
Aug 9, 2024
Merged

Fix tokenizer with DDP#2633
Adel-Moumen merged 39 commits intospeechbrain:developfrom
TParcollet:fix_tokenizer_ddp

Conversation

@TParcollet
Copy link
Collaborator

torch.distributed.barrier has an interesting behavior that may cause our tokenizer (and maybe other parts of the toolkit) to explode. Any call to torch.distributed.barrier will automatically let any other process having encountered torch.distributed.barrier to continue, even if it's a different one. This means that if we have a torch.distributed.barrier in a if statement that is only encountered by the main process, it can just lift the barrier of the other process. Which is what may happen with our tokenizer... An example is the following code:

    if if_main_process():
        import time
        time.sleep(5)
        print("Barrier main")
        torch.distributed.barrier()
        time.sleep(5)
        print("End main process")
    else:
        print("First barrier second process")
        torch.distributed.barrier()
        print("End first barrier second process") ```
@TParcollet TParcollet added the bug Something isn't working label Aug 1, 2024
@pplantinga
Copy link
Collaborator

We should avoid using torch.distributed in favor of our internal speechbrain distributed methods which should already handle most cases.

@TParcollet
Copy link
Collaborator Author

@pplantinga that is not the case in this example. The SentencePiece class uses run_on_main yet the problem can happen.

Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@Adel-Moumen Adel-Moumen merged commit e88b9e8 into speechbrain:develop Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

4 participants

Comments