Developers often store secrets inside comments. Current tokenization approach allows us to understand that a token is a comment but we are unable to look deeply inside and break it into semantic tokens.
Potential solutions:
- "Decomment" the code by removing special symbols like "//" or "#". Potentially bad idea as a comment may not and should not be semantically correct for a given language. This may also brake the tokenization process of entire file.
- Perform a separate (regex-based) analysis for all comments found in code after tokenization stage.
Developers often store secrets inside comments. Current tokenization approach allows us to understand that a token is a comment but we are unable to look deeply inside and break it into semantic tokens.
Potential solutions: