Internal Site
- 36大数据
- About云开发:专注大数据云技术
- MATLAB中文论坛
- SegmentFault
- UDN企业互联网技术社区
- UML软件工程组织-火龙果软件工程
- 阿里中间件团队博客
- 并发编程网
- 美团点评技术团队
- 搜索技术博客-淘宝
- 淘宝数据库研发组
- 淘宝数据库研发组: 数据库内核月报
- 腾讯大数据:大数据学院
- 腾讯Dev开发者社区
- 携程技术中心:技术分享
- 云栖社区:优质博文集锦
- 有赞技术团队
- 中国云计算:云计算资料和交流中心
Foreign Site
- Airbnb
- AMPLab – UC Berkeley / Algorithms, Machines and People Lab
- Apache
- Apache Incubator
- arXiv.org e-Print archive
- AWS
- BuzzFeed
- CloudFlare
- Databricks: Making big data processing and analytics simple
- Databricks Community
- DB-Engines: Knowledge Base of Relational and NoSQL Database Management Systems
- Deep Learning
- DeveloperWorks:技术主题
- DMLC for Scalable and Reliable Machine Learning
- Dropbox
- Elastic · Revealing Insights from Data (Formerly Elasticsearch)
- Gartner: Technology Research
- Github
- Google Developers Blog
- HubSpot Product & Engineering
- ImageNet
- Kaggle: Your Home for Data Science
- Microsoft API 和参考目录
- Netflix
- Nginx
- OReilly
- OpenDNS
- Open Hub, the open source network
- Parse
- Pingterest
- Quora
- Spotify
- Stack Overflow
- TechNet 技术资源库
- TPC Benchmarks
- Uber
- viXra.org open e-Print archive
- W3Techs: World Wide Web Technology Surveys
- Yelp
Blog
- Alexander J. Smola
- Andrey Kurenkov’s Web World
- Colah’s blog
- Geoffrey E. Hinton
- Hellojavacases微信公众号网站
- Java Performance Tuning Guide
- July:结构之法 算法之道
- Jürgen Schmidhuber
- Knight:专注于互联网广告,社区平台,资源下载平台,计算机图像图形学技术
- Lxw:大数据田地Hadoop/Hive/HBase/Spark/Java
- MSDN Blogs
- Netkiller 系列电子书
- Sebastian Thrun
- Shai Shalev-Shwartz
- Yoshua Bengio
- 董的博客:关注大规模数据处理,Hadoop,YARN,MapReduce,Spark,Mesos
- 花钱的年华:
- 寒小阳:专注机器学习/数据挖掘
- 简单之美:大数据
- 开涛的博客
- 李鼎(哲良)
- 李社河:坚持努力做吧,少年!
- 阮城锋
- 如果天空不死
- 星空:做一个有准备的人
- 小石头的码疯窝-ML DL CV
- 星星:算法、搜索、分布式
- 杨尚川:大数据、搜索引擎
- 张龙(风中叶):探寻未知
Architecture
Project
- Accumulo
- Ambari
- Apex: Enterprise-grade unified stream and batch processing engine
- Atlas: Data Governance and Metadata framework for Hadoop
- Avro: a data serialization system.
- Alluxio: Open Source Memory Speed Virtual Distributed Storage
- Beam
- Canal: 阿里巴巴mysql数据库binlog的增量订阅&消费组件
- Cassandra: Manage massive amounts of data, fast, without losing sleep
- Druid: A high-performance, column-oriented, distributed data store.
- Falcon: Feed management and data processing platform
- Flink: Scalable Batch and Stream Data Processing
- Flume
- Ganglia Monitoring System
- Generatedata: Random data generator in JS, PHP and MySQL
- Hadoop: An open-source software for reliable, scalable, distributed computing
- HAWQ: Apache Hadoop Native SQL
- HBase: A distributed, scalable, big data store.
- HBase Blog
- HBase ™ Reference Guide
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Hive Wiki
- Hue: Hadoop User Experience
- Impala: Real-time Query for Hadoop
- iPython
- Kafka: A high-throughput distributed messaging system.
- Keras: Deep Learning library for Theano and TensorFlow
- Kerberos: The Network Authentication Protocol
- KeystoneML
- Kaldi Speech Recognition Toolkit
- Kaldi: Github
- Knox: REST API Gateway for the Apache Hadoop Ecosystem
- Kudu: Fast Analytics on Fast Data
- Kylin: an open source Distributed Analytics Engine
- Lasagne: Lightweight library to build and train neural networks in Theano http://lasagne.readthedocs.org/
- MADlib: Big Data Machine Learning in SQL
- Mahout: Scalable machine learning and data mining
- Matplotlib: Python Plotting
- Metron: REAL-TIME BIG DATA SECURITY
- Nginx is an HTTP and reverse proxy server
- Numpy
- Oozie: Apache Oozie Workflow Scheduler for Hadoop
- OpenCV: Open Source Computer Vision Library
- OpenCV: Github
- Otter: 阿里巴巴分布式数据库同步系统(解决中美异地机房)
- Pandas
- Parquet: a columnar storage format
- Pig
- Pivotal Extension Framework (PXF)
- Presto: Distributed SQL query engine for big data
- Quiver: Interactive convnet features visualization for Keras
- Ranger: Enable, monitor and manage comprehensive data security across the Hadoop platform
- Scikit-learn
- Scipy
- Shellinabox: Web based AJAX terminal emulator
- Slider: Dynamic YARN Applications
- Spark: Lightning-fast cluster computing
- Sqoop
- Storm: A free and open source distributed realtime computation system
- streamDM: Data Mining for Spark Streaming
- Sympy
- SystemML: Declarative Large-Scale Machine Learning
- Succinct: Enabling Queries on Compressed Data
- TensorFlow: an Open Source Software Library for Machine Intelligence
- TensorLayer: Deep Learning and Reinforcement Learning Library for TensorFlow
- Tesseract Open Source OCR Engine
- Tez
- TFLearn: Deep learning library featuring a higher-level API for TensorFlow
- Vert.x is a tool-kit for building reactive applications on the JVM.
- Zeppelin: A web-based notebook that enables interactive data analytics.
- ZooKeeper: A high-performance coordination service for distributed applicatins.
Resources
- Anaconda Cloud: Search
- Datahub
- Deep Learning Resources: NVIDIA Developer
- GitBook
- InfoQ迷你书
- Kaggle Datasets
- Python Extension Packages for Windows: Christoph Gohlke
- Read the Docs
- Seminar Schedule of Protein Structure Group
- Spark Packages
- Stanford Engineering Everywhere: Course
- Stanford University Explore Courses
- UCI Machine Learning Repository: Data Sets
Docs/Wiki
- ANACONDA Documentation
- Cloudera Product Documentation
- Conda documentation
- Deep Learning Tutorials
- Deep Learning: An MIT Press book
- Hortonworks Documentation
- MapR Documentation
- Pivotal Documentation
- Redhat Product Documentation
- Spring Documentation
- Transwarp Download
- Stanford Machine Learning
- UFLDL教程
Tools
Research/Reports
Conferences
- Computational Linguistics / NLP Conferences Calendar
- Conferences Archive: O’Reilly Media
- NIPS Conference
- Spark Summit: The premier event series of Apache Spark
- USENIX ATC Conferences
- USENIX NSDI Conferences
Github
- Deeplearning4j: Open-source, distributed deep learning for the JVM on Spark with GPUs
- Gliese581gg: Jinyoung Choi
- Iluwatar: Ilkka Seppälä
- Tobegit3hub: Storage(HBase, Ceph etc), IaaS(Linux, OpenStack etc) and Machine Learning with Kubernetes and TensorFlow.
- Ty4z2008: Jun Liao