Document data quality requirements and define rules for measuring quality. Create a reference for success, and targets to keep the project in check along the way. Set statistical checks on the data, and set a standard of quality control and completeness.
Create a strategy. Outline a plan for your data quality that supports ongoing operations and data management. Identify the data sets that meet your quality standard, and the data sets that need to be cleaned. Identify possible solutions with a plan for implementation. Your general …show more content…
Removal of Expressions: Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non- relevant to content of the speech and hence need to be removed.
7. Split Attached Words: These entities should be split into their normal forms using simple rules and regex.
8. Slangs lookup: These words should be transformed into standard words to make free text.
9. Standardizing words: Sometimes words are not in proper formats. Simple rules and regular expressions can help solve these cases.
10. Removal of URLs: URLs and hyperlinks in text data reviews should be removed.
11. Grammar checking: Grammar checking is majorly learning based, huge amount of proper text data is learned and models are created for grammar correction. There are many online tools that are available for grammar correction purposes.
12. Spelling correction: In natural language, misspelled errors are encountered. Companies like Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms like the Levenshtein Distances, Dictionary Lookup etc. or other modules and packages to fix these