A data lake is a large storage repository and processing engine, they provide “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs”.

The term was coined by James Dixon, Pentaho chief technology officer. Dixon used the term initially to contrast with “data mart”, which is a smaller repository of interesting attributes extracted from the raw data. He wrote: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” Dixon argued that data marts have several inherent problems, and that data lakes are the optimal solution.