List_of_datasets_for_machine-learning_research

List of datasets for machine-learning research

Add article description

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.^[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.^[2]^[3]^[4]^[5]

This article contains dynamic lists that may never be able to satisfy particular standards for completeness. You can help by adding missing items with reliable sources.

Many organizations including governments publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

More information Type, Subtypes ...

Type	Subtypes
Specific category	Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
Scope	Supranational Union, National, Subnational, Municipality, Urban, Rural
Language	Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
Type	Tabular, Graph, Text, Image, Sound, Video
Usage	Training, validating, and testing
File-Formats	CSV, JSON, XML, KML, GeoJSON, Shapefile, GML
Licenses	Creative-Commons, GPL, Other Non-Open data licenses
Last-Updated	Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-Size	Minimum, Maximum, Range
Status	Verified, In-Preparation, Deactivated(or Deprecated)
Number of records	100s, 1000s, 10000s, 100000s, Millions
Number of variables	Less than 10, 10s, 100s, 1000s, 10000s
Services	Individual, Aggregation

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

Portal-Name	License	List of Installations of the Portal	Typical Usages
Comprehensive Knowledge Archive Network (CKAN)	AGPL	https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
DKAN	GPL	https://getdkan.org/community	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
Dataverse	Apache	https://dataverse.org/installations https://dataverse.org/metrics	Data Management Solution for Research Institutes
DSpace	BSD	https://registry.lyrasis.org/	Data Management Solution for Research Institutes
OpenML	BSD	https://www.openml.org/search?type=data&sort=runs&status=active	Data Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

Academic Torrents	https://academictorrents.com
Amazon Datasets	https://registry.opendata.aws/
Awesome Public Datasets Collection	https://github.com/awesomedata/awesome-public-datasets
data.world	https://data.world/datasets/machine-learning
Datahub – Core Datasets	https://datahub.io/docs/core-data
DataONE	https://www.dataone.org/
DataPortals	https://dataportals.org/
Datasetlist.com	https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation	https://index.okfn.org/ Archived 25 May 2020 at the Wayback Machine
Google Dataset Search	https://datasetsearch.research.google.com/
Hugging Face	https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange	https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data	https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.doc
Kaggle	https://www.kaggle.com/datasets
Machine learning datasets	https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data	https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets	https://msropendata.com/datasets
Open Data Inception	https://opendatainception.io/
Opendatasoft	https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR	https://v2.sherpa.ac.uk/opendoar/
OpenML	https://www.openml.org/search?type=data
Papers with Code	https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks	https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs	https://github.com/public-apis/public-apis
Registry of Open Access Repositories	http://roar.eprints.org/
REgistry of REsearch Data REpositories	https://www.re3data.org/
UCI Machine Learning Repository	http://mlr.cs.umass.edu/ml/ Archived 26 June 2020 at the Wayback Machine
Speech Dataset	https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery	https://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

Image data

Main article: List of datasets in computer vision and image processing