Data Science Archive Webview 105.html Telegram

TG Telegram Group Link

Channel: Data Science Archive

Data Science Archive

Voila是一个新的Jupyter做可视化的插件，可以把notebook直接转换成standalone的web app。试了一下还是不错的，数据量大的情况有点卡。不过我自己现在都是更喜欢用plotly的Dash，更漂亮点，生成的HTML也更方便嵌入其他的文档说明页 like Python Sphinx。不过也算是多一个选择：https://blog.jupyter.org/a-gallery-of-voil%C3%A0-examples-a2ce7ef99130

A Gallery of Voilà Examples

Voilà is one of the latest addition to the Jupyter ecosystem, and can be used to turn notebooks into standalone applications and…

2.5K views小熊猫, edited 02:38

Data Science Archive

一份Data Visualization Style Guidelines的资源列表，作者收集挺精心的。https://medium.com/data-visualization-society/style-guidelines-92ebe166addc
这份excel里面有非常多的细节，包括如何选择合适的chart，style，甚至有的里面还有每一种颜色的使用场景，还是蛮有意思的。
https://docs.google.com/spreadsheets/d/1F1gm5QLXh3USC8ZFx_M9TXYxmD-X5JLDD0oJATRTuIE/edit#gid=1679646668

What Are Data Visualization Style Guidelines?

Data visualization style guides are standards for formatting and designing representations of information.

3.8K views小熊猫, 18:18

Data Science Archive

今天在推上被一位朋友问到AutoML的入门资料，我想了一下之前看过第四范式的这篇Survey，他们一直在KDD Cup/NIPS上承办AutoML Challenge。这篇入门survey也是我看过的写得最好的，2018年11月提交，2019年1月最后一次revised，内容够新够全。https://arxiv.org/abs/1810.13306
AutoML的很多工作都是集中于超参数调节，虽然我觉得它很多时候没有CV/NLP方向那么生动，却还是有自己很独特的魅力，落地价值也很强。

Automated Machine Learning: From Principles to Practices

Machine learning (ML) methods have been developing rapidly, but configuring and selecting proper methods to achieve a desired performance is increasingly difficult and tedious. To address this...

3.4K views小熊猫, 03:35

Data Science Archive

Chip Huyen是我非常喜欢的一个越南裔斯坦福的老师，产出博客和课程质量非常高，项目也都挺有趣。这是她的博客：https://huyenchip.com/
不过这次想分享的是她在推上写的关于ML eng/Data Scientist面试的一些琐碎，信息量很大，这条推看起来会一直更新下去，直到整理成书籍：https://twitter.com/chipro/status/1152077188985835521
以及每条推的评论部分也很值得一读

3.6K views小熊猫, edited 06:29

Data Science Archive

关于Pandas apply/groupby 并行老生常谈的问题，一直觉得dask不好用，需要转来转去，刚刚发现一个简单好用的工具。https://github.com/nalepae/pandarallel

GitHub - nalepae/pandarallel: A simple and efficient tool to parallelize Pandas operations on all available CPUs

A simple and efficient tool to parallelize Pandas operations on all available CPUs - nalepae/pandarallel

3.2K views小熊猫, 16:55

Data Science Archive

RAdam + LookAhead 实验结果还是有点奇怪的，不是太明朗的感觉。一个用fastdoai的实现。https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d

New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both.

A new paper in part by the famed deep learning researcher Geoffrey Hinton introduces the LookAhead optimizer(“LookAhead optimizer: k steps…

3.9K views小熊猫, 02:31

Data Science Archive

上周在造一个CTR项目轮子的时候又系统回顾了一些非复杂DNN模型的hyper param optmization 的方法和工具，发现一个新的工具：Optuna https://github.com/pfnet/optuna

GitHub - optuna/optuna: A hyperparameter optimization framework

A hyperparameter optimization framework. Contribute to optuna/optuna development by creating an account on GitHub.

3.6K views小熊猫, 17:13

Data Science Archive

最近在用一些非监督方法做降维的时候，发现在categorical feature有时候MCA比传统的PCA要好一些，（不过有时候先做target encoding再用普通的PCA也不错）。用了一段时间Prince，简单好用，性能不错。https://github.com/MaxHalford/Prince

GitHub - MaxHalford/prince: :crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA - GitHub - MaxHalford/prince: :crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, F...

3.4K views小熊猫, edited 10:59

Data Science Archive

晚上有一个朋友看到推送问我，对categorical feature 为什么要做target encoding。其实这比较取决于模型，不过对于tabular data常用的tree based model来说，OHE是比较差的，如果是用xgboost需要自己做target encoding，catBoost/lightGBM不需要，自带了。https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

Visiting: Categorical Features and Encoding in Decision Trees

When you have categorical features and you are using decision trees, you often have a major issue: how to deal with categorical features?

4.4K views小熊猫, 17:29

Data Science Archive

说到特征降维/选择的问题，大部分EDA的套路都是从model训练的loss来判断feature importance。其实有一个简单易行而且很有效的办法是在CV里面用做feature permutation，对原始特征shuffle得到shadow（也可以加一些噪音），在通过zscore比较两者差异来判断importance，不断遍历筛选。在ESLII中593页有提到这个办法。R里面有一个包Boruta可以做这件事，py也有：https://github.com/scikit-learn-contrib/boruta_py

GitHub - scikit-learn-contrib/boruta_py: Python implementations of the Boruta all-relevant feature selection method.

Python implementations of the Boruta all-relevant feature selection method. - scikit-learn-contrib/boruta_py

6.2K views小熊猫, 18:33

Data Science Archive

中间这段时间一直在面试换工作，现在基本稳定之后会继续更新和收集相关工作资料。感谢订阅的朋友。

3.4K views小熊猫, 09:31

Data Science Archive

PTP 是 IBM 出品的一个为 PyTorch 服务的部署框架。看了一下涵盖的领域比较全面，CV，NLP 都有，各种 pre-trained model 也比较全，甚至包含了许多评测基准和现成的一些更 high-level 的模型结构。非常适合快速实验。https://github.com/ibm/pytorchpipe

GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational…

PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational pipelines combining vision and language - GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a co...

3.8K views小熊猫, 09:34

Data Science Archive

一个 Time series 数据集补空的工具，集成了几乎全部所需的统计方法，transform 上也是该有的都用，Box-Cox 什么的，几乎不需要底层的那些 DS工具包了，api上兼容了 scikit-learn，用法和功能和 R 里面的auto.arima 一样，只多不少。https://github.com/alkaline-ml/pmdarima

GitHub - alkaline-ml/pmdarima: A statistical library designed to fill the void in Python's time series analysis capabilities, including…

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function. - alkaline-ml/pmdarima

4.8K views小熊猫, edited 03:30

Data Science Archive

2019 ACL Salesforce Research 上常识阅读理解paper 的 code 更新，依赖 huggingface 的 transformers，看过 demo 还是非常不错的。https://github.com/salesforce/cos-e

GitHub - salesforce/cos-e: Commonsense Explanations Dataset and Code

Commonsense Explanations Dataset and Code. Contribute to salesforce/cos-e development by creating an account on GitHub.

5.5K views小熊猫, 07:36

Data Science Archive

HuggingFace Transformers 包加了几组中文的 pre-trained models，包括 BERT-wwm, RoBERTa-wwm, XLNet，来自哈工大和讯飞。https://github.com/ymcui/Chinese-BERT-wwm/blob/master/README_EN.md

Chinese-BERT-wwm/README_EN.md at master · ymcui/Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT（中文BERT-wwm系列模型） - ymcui/Chinese-BERT-wwm

6.0K views小熊猫, 08:32

Data Science Archive

来自 Huggingface 的 tokenizer，Rust 实现，确实速度惊人。https://github.com/huggingface/tokenizers

GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers

5.9K views小熊猫, 07:52

Data Science Archive

CUDA 层面重新实现的几种 RNN，自带Zoneout 和DropConnect，试用了一下 Py 和 C++的 API，确实是快非常多，API 可设定的参数还不是太多。https://github.com/lmnt-com/haste

GitHub - lmnt-com/haste: Haste: a fast, simple, and open RNN library

Haste: a fast, simple, and open RNN library. Contribute to lmnt-com/haste development by creating an account on GitHub.

6.0K views小熊猫, 02:54

Data Science Archive

关于 Tabular dataset 中 GBM 的一些意见，虽说是目前为止（或者未来的一段时间）应该还将继续是 STOA，但是或多或少会有一些用浅层 NN 融合的方案来继续提升性能，比较重要的一份参考是两年前的 https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629
来源一条CPMP 的推文以及讨论：https://twitter.com/JFPuget/status/1233379034425384960

Porto Seguro’s Safe Driver Prediction

Predict if a driver will file an insurance claim next year.

7.1K views小熊猫, edited 05:51

Data Science Archive

本来以为是个水货，结果刚点进去就发现了Pharebank 这个好东西，强烈推荐给有协作需求的在读 PhD。https://www.annaclemens.com/blog/16-free-tools-scientists-write-better-more-productively

Researchers' Writing Academy -

19 Academic Writing Tools (that are completely free!)

Whether you're looking for an academic phrase finder, a collaborative academic writing software, a tool to stay focused on your writing or a writing project management app - I've got you covered!

7.8K views小熊猫, 17:16

Data Science Archive

最近重新开始接触时间序列，找到一个蛮不错的基础教材，准备开始恶补。http://www.math.pku.edu.cn/teachers/lidf/course/atsa/atsanotes/html/_atsanotes/index.html

www.math.pku.edu.cn

应用时间序列分析备课笔记

本科生《金融时间序列分析》授课备课资料。采用R的bookdown制作，输出格式为bookdown::gitbook.

8.1K views小熊猫, 03:48

HTML Embed Code:

<iframe width="100%" src="https://www.hottg.com/buyppe/webview?embed=1" title="Telegram Webview" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2024/04/29 18:10:24
Back to Top