Stay clean, stay healthy, stay focused! Some words on data cleaning

Ever seen a dog purposefully barking up a big tree little realizing that there are no monkeys or cats up there and continues till he starts frothing! Big show of strength. But does it suffice any need? None most likely.

Something similar happens when you use unreliable data either for marketing or collating information for reports. You could be way off the mark and very close to demise. The practice involves getting information from questionable sources with gaps therein in terms of wrong entries or missed entries, SPAM traps, wrongly spelt names and designations, not taking into account humungous changes on the ground like mergers, winding-up etc.

Where one plans a marketing/ promotional strategy based on this kind of information, failure of a campaign is the least that can happen. It may lead to reduced business in the long run. In the short run, it may involve difficulty in customer retention and loss of face due to wrongly directed efforts. Another fallout of this misdirected effort may be that standardized promotional communication in the form of email may be perceived as SPAM and ignored entirely or if done too often result in complete blacklisting at the ISP’s end.

A scary thought indeed. What then is the solution?

Restrains and restrictions. Though these may be perceived as taking away liberties, these are the best bet against dirty data.

Some of the time-tested ways of going about it are:

  • Know what you are looking for ie. Correctly formulated question result in and correct response. Formats of questions should be such that you get correct answers-always like twice entering an e-mail ID.
  • Make processes such that one person’s entry gets checked by another thus obviating the possibility of errors from the very onset. Additionally use external, specialized agencies to recheck details.
  • Scanning for wrong entries or missed entries should be an on-going process instead of an ad-hoc activity with regular audits every quarter.
  • Where it is about comparing sales data, see to it that it comes from reliable sources and is of reasonable size i.e. neither so small to be insignificant and nor so large to become unwieldy.
  • Sales, like all other economic activity are affected by economic factors like depression, rise etc. The data being viewed should take into account these factors to reflect the latest matter.
  • Just because the result of an exercise varies from your perception, do not take it that the data itself is wrong. Accept it.
  • Typos and similar shortcomings can be rectified by putting people on the job to check and correct entry-by-entry.
  • To stop sending e-mail which bounce back, use the system of calling up the party and getting correct details. This itself can be farmed-out to another entity on a contract basis.
  • The facility to unsubscribe should be taken seriously and where requested adhered to.
  • E-mail bounce-backs should be calibrated such that more than once should be taken as a warning not to persist.
  • Obey CAN SPAM rules
  • Use social media and specialized sites like LinkedIn etc. to verify facts whether an e-mail ID, telephone number etc or details of person being approached.

The above list gives only a few time-tested ways of cleaning-up dirty data and are by no means the only ones. Own working in the field can bring about the best insights as to how one can bring about a change in the data being used.

Happy sweeping!

More Tips


Data Cleansing

Fast Simple Cost Effective