Fix Your Strings!
Strings are one of the most used types in databases; they can store pretty much any data and don’t enforce any rules on the inserted input. This talk showcases the risk of using strings and how to mitigate them to ensure data quality.
Abstract
Strings are one of the most used types in databases; they can store pretty much any data and don’t enforce any rules on the inserted input. Yet too much freedom sometimes leads to inconsistencies: is it Aivan or Aiven? Øyvind or Oyvind? Wine or Whine? These seemingly small differences can have bad side-effects, causing lookups to fail and incorrect aggregation results to be returned. Luckily all is not lost: PostgreSQL has some features that can help us make sense of the chaos.
In this talk you will learn what PostgreSQL has to offer: starting with pattern matching, passing by regular expressions, and ending with more advanced functionality exposed by the fuzzystrmatch and unaccent extensions. I’ll demonstrate what tools can help you fixing string inconsistencies and how to avoid making the same mistakes again in the future. This session is recommended for anyone who deeply cares about their (string) data quality.
Useful Links
- Array Functions
- Date Functions
- Pattern Matching
- Domain
- Email Domain
- Unaccent
- FuzzyStrMatch
- Aiven
- GitHub repository with the notebook