Oz du Soleil is an Excel developer, blogger, trainer, co-host of ExcelTV, raconteur, Excel MVP, and co-author of the soon-to-be-published Guerrilla Data Analysis 2nd Ed. (with Mr. Excel, Bill Jelen). He is committed to empowering individuals to manage their own data.
Tell us your data story: How did you first become interested in working with data, and what path did you take to where you are today?
I was working at a place where customers kept calling with problems.
- A certification wasn’t awarded
- Exam completions hadn’t been reported to the regulating body
I sensed the real stress these people were under and the upheaval this was causing for them. I got fed up and started asking questions and digging. The culprits turned out to be bad data and old reports.
Problem: Why didn’t Sam get his certificate of completion?
- People had 3 years to finish a curriculum.
- Sam started the curriculum 5 years ago and stopped.
- 2 years ago he started again with the very first course and did finish.
- The database report that listed the completions was written to start at the date of the first course and look forward 3 years. It wasn’t written to start from today and look backward 3 years.
- Duplicate records would cause a completion to be missed by the report.
This is where even more reality sets in:
Updating the completion report would require an expensive consultant and take a long time. So, I started working with data dumps. Raw rows and columns of data that I would clean, and massage until I had an accurate list of completions. It took 3 days every month just for that.
At the time, I was a master of sorting, filtering, IF statements and manually coloring cells. I had no clue about VLOOKUP, pivot tables, SUMIF, conditional formatting, none of that. But my results were always right. Incredibly slow, but always correct.
Eventually, my director, Troy Berry, recognized that I had a skill for snooping out problems. He moved me around to various positions until I was basically his data guy. That led to other roles where there were data-related problems.
I loved it. I didn’t even think of this as a skill. I saw it as helping customers and solving fun riddles.
After all these years, for me, it’s always been about the people who are reliant on data being accurate. I am quick to say that I don’t care if a person uses Excel. If they manage data with crayon on a wall and payroll is met, and hotel rooms are cleaned on schedule, and the fleet of vehicles have accurate maintenance records, and warehouse inventory is forecast accurately, then fine. Use the crayon on a wall.
In 2008, a bunch of us were laid off and I did various things, including found a nonprofit, but Excel would always show up. So, I’ve been a consultant working with entities that have data issues.
I started my blog several years ago because I needed web presence. It started with Excel tips and tricks. But that bored me, and I realized that being analyst requires more than tips and tricks. It can mean having the mettle to spend 3 days cleaning data. It means having people question your results and not getting defensive; it means being the unknown superhero for a whole lot of people who’ll never know that you were key in steering their professional life away from a catastrophe.
What excites you most in the world of data literacy?
- The PASS Conference
- The Excel MVP Program
- The things that Johann Odu and John Persico are doing with ModelOff
All of the things that heighten Excel’s visibility and capabilities, and responds to the people who’ve been writing Excel’s obituary for the past 20 years. Also, I had a chance to meet with Excel developers last year for 3 hours. WOW! They really were interested in what I had to say, and in the work that I shared with them. One thing that has me excited about PASS is this inclusion of more tools than SQL. I’m interested in learning more, meeting other developers and seeing where our worlds overlap and there they can work together. You’re lowering a drawbridge across the moats between our castles. And ExcelTV is bringing fun and swagger! All good stuff!
As an Excel MVP, what one new formula would you tell someone who is new to working with data to put at the top of his/her to-learn list this year?
Whether you’re working on data structure, business intelligence, or analytics, focus on the business story and the part enhanced through your efforts.
Someone who’s new to working with data. New feature/function/formula: TABLES! Tables keep your data tight, the table references make formulas easier to read.Tables aren’t used enough.
What are businesses or analysts that ignore the value of Excel as a data-prep tool missing out on?
Let’s separate this:
- Businesses that don’t know what Excel can do
- Businesses that have decided that Excel is not the solution for them
- Businesses that ignore Excel’s ability to help them
The ones that ignore Excel’s value can end up working really hard or doing nothing. Whatever is supposed to be going on, it ain’t going on. Grant-writing, web analysis, keeping track of good customers, maintaining appropriate inventory levels. Not happening.
But here’s what I’ve seen. Businesses that consciously ignore Excel will explore a lot of options as if looking for the silver bullet to throw their money at. In those situations, what’s really weak is the People or Processes in the People – Processes – Tools trinity.
There’s something bigger going on.
What’s your favorite tip for cleaning data? How about your top pet-peeve?
Favorite tip: Slow down and develop a preparation strategy. Be clear of the starting point, the ending point, and what’s at stake. Look. A victory is a victory. If you can stay on a bull for 8 seconds, it doesn’t matter if you made it look easy or did it hanging off to one side. So, rather than using a huge formula and getting lost in parentheses and mysterious errors, step back and verbally articulate what the goal is, and possible steps. Then slowly work do it.
With data cleansing, this is critical because, let’s say you just extracted data from a pdf. Those things are a mess. Stop. Loot for patterns. See if there’s stuff that you can delete right away. See if the data that needs to be cleaned can be broken up into groups.
Pet peeve: Having the wrong person in ownership of the data. I could mention things like poor spreadsheet development or uncontrolled cleansing efforts. But let’s think about People – Processes – Tools. Messy data inside unestablished processes DEMANDS the right person who’s ready to get up on that bull and ride for the full 8 seconds. People who need step-by-step instructions aren’t the people who are going to hunt down 10 misspellings for ‘Chamomile’ and help you decide whether or not to end the product line.
You work with and educate many “de facto analysts”: folks who find themselves managing data without a lot of experience. What knowledge or skills do you focus on first and foremost?
a) Check data quality before starting analysis. Uncover:
- Multiple records
- Incomplete data
- Data that you just don’t need and can get it out of your way
- Single fields that need to be split out
b) Understanding how the data got bad in the first place: did it go wrong at the data entry point? Is the report bad? Is the bad data the result of a one-time copy-paste, or is it baked into the overall process and needs to be addressed?
c) Understanding or digging into the unexpected
- After you’ve developed a relationship with the data, and notice that something is weird, do whatever it takes to either verify that the numbers are weird but correct, or something is wrong.
- This requires an understanding of the entire process:
- What other data can I get to compare?
- Who can I ask?
d) Accept that data is never 100% clean. Determine what’s “clean enough.” Don’t keep fiddling around making miniscule improvements and miss critical deadlines.
Who is your data hero and why?
That’s a tough one because it can be answered in so many ways. I say: Phuong Ly and Claudia Núñez
Here’s the deal. There are folks whom I have huge admiration for. Mike Girvin, Bill Jelen are at the top. At ModelOff last year, I told Bill that I wanted to be like him: travelling the world, sharing a passion and keeping data clean. And I have huge respect for the Excel developers at Microsoft. They are working on the best software ever, and listen to the complaining about what it’s still not doing.
Phuong Ly and Claudia Núñez are my data heroes. They’ve been travelling the US and Mexico doing hackathons where journalists and data people are brought together from Friday morning to Sunday night to pick a project to work on. Their focus is immigration but here’s the point to get:
They saw that journalists could write warm stories that have very little real-world impact. A warm story will move a few citizens. But to get politicians, corporations, foundations and big money involved, there’s got to be data that tells a compelling story, and the journalists needs to understand where questions about the data might come up. They need greater data literacy.
On the other side, the data people aren’t in the streets and getting a sense of what matters to people and what’s at stake … and presenting data it in a way that moves the right people.
So, Claudia and Phong got MacArthur money and other sponsors to support them in getting out in the world, hosting these hackathons and bringing data folks and journalists closer to speaking each other’s language.
There are lots of people who need something like this. Public school teachers are at the mercy of the folks who possess and understand the data. And let me share one last interesting thing:
I talked with a journalist who had fought for months to get data from her local public school system. When they finally sent it, the data seemed to purposefully be a raw mess. Well ... she got the data. Now, who’s going to get up on that bull and ride the full 8 seconds?
Learn more with Oz: Don’t miss the chance to meet Oz and attend his breakout session, The Pain and Glory of Data Preparation at the PASS Business Analytics Conference in Santa Clara, CA, April 20-22. In the meantime, you can catch him interviewing Denise McInerney, VP of Marketing for PASS, on