# 数据可视化库Altair学习笔记

**Repository Path**: anson_xu/altair

## Basic Information

- **Project Name**: 数据可视化库Altair学习笔记
- **Description**: No description available
- **Primary Language**: Python
- **License**: AGPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2021-01-04
- **Last Updated**: 2022-04-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

[官网](https://altair-viz.github.io/)
# 用户指导
## 为Altair指定数据

每一个高级的图表对象（`Chart` ,`LayerChart` , and `VConcatChart` ,`HConcatChart`, `RepeatChart` ,`FaceChart` )接受一个数据集作为它的第一个参数。数据集可以通过一下几种方式指定：
+ as a Pandas DataFrame
+ as a `Data` or related object (`UrlData` , `InlineData` ,`NamedData`)
+ as a url string pointing to a `json`  or `csv`  formatted file
+ as an object that supports the __geo_interface___

通过DataFrame来指定数据：
```python
import altair as alt
import pandas as pd

data = pd.DataFrame({'x': ['A', 'B', 'C', 'D', 'E'],
'y': [5, 3, 6, 7, 2]})

alt.Chart(data).mark_bar().encode(
x = 'x',
y = 'y'
)
```
![1](./img/1.png) 

使用DataFrame指定数据时，Altair会自动的指定数据类型  

使用`Data`对象指定数据  
```python
import altair as alt

data = alt.Data(values=[{'x': 'A', 'y': 5},
                        {'x': 'B', 'y': 3},
                        {'x': 'C', 'y': 6},
                        {'x': 'D', 'y': 7},
                        {'x': 'E', 'y': 2}])
alt.Chart(data).mark_bar().encode(
    x='x:O',  # specify ordinal data
    y='y:Q',  # specify quantitative data
)
```

通过URL来指定数据  
```python
import altair as alt
from vega_datasets import data
url = data.cars.url

alt.Chart(url).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q'
)
```

![2](./img/2.png) 

### The population variance of field values.包含索引数据
根据设计，Altair只访问数据帧列，而不访问数据帧索引。有时，相关数据会出现在索引中   
如果想要索引可被图表接受，使用`reset_index()` 方法来将索引转变为一列：  

```python
import numpy as np
rand = np.random.RandomState(0)

data = pd.DataFrame({'value': rand.randn(100).cumsum()},
                    index=pd.date_range('2018', freq='D', periods=100))
data.head()
```

```python
alt.Chart(data.reset_index()).mark_line().encode(
    x='index:T',
    y='value:Q'
)
```

![3](./img/3.png) The population variance of field values.
如果index对象没有name属性集，则生成的列将被称为“index”  

### Long-form vs Wide-form Data

+ wide-form data has one row per independent variable, with metadata recorded in the row and column labels.
+ long-form data has one row per observation, with metadata recorded within the table as values.

Altair 的语法更擅长处理**long-form** data  


**wide_form data**   
```python
wide_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01'],
                          'AAPL': [189.95, 182.22, 198.08],
                          'AMZN': [89.15, 90.56, 92.64],
                          'GOOG': [707.00, 693.00, 691.48]})
```
宽格式的数据每一自变量有一行  
**long_form data**  
```python
long_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01'],
                          'company': ['AAPL', 'AAPL', 'AAPL',
                                      'AMZN', 'AMZN', 'AMZN',
                                      'GOOG', 'GOOG', 'GOOG'],
                          'price': [189.95, 182.22, 198.08,
                                     89.15,  90.56,  92.64,
                                    707.00, 693.00, 691.48]})
print(long_form)
```

请注意，每一行都包含一个观察值（即price），以及该观察值的元数据（日期和公司名称）。重要的是，列和索引标签不再包含任何有用的元数据.   

使用长型数据格式时，相关数据和元数据储存在表的本身而不是行和列的标签中  
```python
alt.Chart(long_form).mark_line().encode(
  x='Date:T',
  y='price:Q',
  color='company:N'
)
```

![4](./img/4.png) 

### Converting Between Long-form and Wide-form:Pandas
使用dataframe的`melt` 方法可以将宽型数据格式转化为长型数据格式  
melt的第一个参数是要作为索引变量处理的列或列列表；其余的列将组合为一个指示符变量和一个值变量，可以选择指定其名称：   
```python
wide_form.melt('Date', var_name='company', value_name='price')
```

### Converting Between Long-form and Wide-form: Fold Transform
```python
alt.Chart(wide_form).transform_fold(
    ['AAPL', 'AMZN', 'GOOG'],
    as_=['company', 'price']
).mark_line().encode(
    x='Date:T',
    y='price:Q',
    color='company:N'
)
```

### 生成数据
#### 序列生成器
使用`sequence()` 函数  
```python
import altair as alt

# Note that the following generator is functionally similar to
# data = pd.DataFrame({'x': np.arange(0, 10, 0.1)})
data = alt.sequence(0, 10, 0.1, as_='x')

alt.Chart(data).transform_calculate(
    y='sin(datum.x)'
).mark_line().encode(
    x='x:Q',
    y='y:Q',
)
```
![5](./img/5.png) 
#### Graticule Generator
另一种便于在图表本身中生成的数据类型是地理可视化（称为分划）上的纬度/经度线。   

```python
import altair as alt

data = alt.graticule(step=[15, 15])

alt.Chart(data).mark_geoshape(stroke='black').project(
    'orthographic',
    rotate=[0, -45, 0]
)
```

![6](./img/6.png) 

#### Sphere Generator
```python
import altair as alt

sphere_data = alt.sphere()
grat_data = alt.graticule(step=[15, 15])

background = alt.Chart(sphere_data).mark_geoshape(fill='aliceblue')
lines = alt.Chart(grat_data).mark_geoshape(stroke='lightgrey')

alt.layer(background, lines).project('naturalEarth1')
```

![7](./img/7.png) 

## Encodings
在Altair中，这种视觉属性到数据列的映射称为编码，通常通过`Chart.encode()`方法来实现  

For example, here we will visualize the cars dataset using four of the available encodings: `x` (the x-axis value), `y` (the y-axis value), `color` (the color of the marker), and `shape` (the shape of the point marker):   
```python
import altair as alt
from vega_datasets import data
cars = data.cars()

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    shape='Origin'
)
```

![8](./img/8.png) 
若数据是以DataFrame形式指定，Altair会自动地为每个encoding决定正确的数据类型，并且创建合适的比例和图例来表示这些数据。  

### Encoding Channels
**Position Channels** 
| Channel    | Altair Class     | Description                       |
| ------     | --------         | -------                           |
| x          | `x`              | The x-axis value                  |
| y          | `y`              | The y-axis value                  |
| x2         | `x2`             | Second x value for ranges         |
| y2         | `y2`             | Second y value for ranges         |
| longitude  | `Longitude`      | Lingitude for geo charts          |
| latitude   | `Latitude`       | Latitude for geo charts           |
| longitude2 | `Longitude2`     | Second longitude value for ranges |
| latitude2  | `Latitude2`      | Second latitude value for ranges  |
| xError     | `XError`         | The x-axis error value            |
| yError     | `YError`         | The y-axis error value            |
| xError2    | `XError2`        | The second x-axis error value     |
| yError2    | `YError2`        | The second y-axis error value     |


**Mark Proerty Channels:**  
| Channel       | Altair Class         | Description                    |
| -------       | --------             | --------                       |
| color         | `Color`              | The color of the mark          |
| fill          | `Fill`               | The fill for the mark          |
| fillopacity   | `FillOpacity`        | The opacity of the mark's fill |
| opacity       | `Opacity`            | The opacity of the mark        |
| shape         | `Shape`              | The shape of the mark          |
| size          | `Size`               | The size of the mark           |
| stroke        | `Stroke`             | The stroke of the mark         |
| strokeDash    | `StrokeDash`         | The stroke dash style          |
| strokeOpecity | `StrokeOpacity`      | The opacity of the line        |
| strokeWidth   | `StrokeWidth`        | The width of the line          |

**Text and Tooltip Channels** 
| Channel | Altair Class   | Description              |
| -----   | -----          | -----                    |
| text    | `Text`         | Text to use fot the mark |
| key     | `Key`          | -                        |
| tooltip | `Tooltip`      | The tooltip value        |

**Hyperlink Channel** 
| Channel | Altair Class | Description          |
| ------  | ----------   | --------             |
| href    | `Href`       | Hyperlink for points |

**Level of Detail Channel:** 
| Channel | Altair Class | Description                     |
| ------  | -----------  | ----------                      |
| detail  | `Detail`     | Additional property to group by |

**Order Channels:** 

| Channel | Altair Class | Description                 |
| ------- | ------------ | -----------                 |
| order   | `Order`      | Sets the order of the marks |

**Facet Channels:** 

| Channel | Altair Class | Description                                    |
| ------- | ------------ | -----------                                    |
| column  | `Column`     | The column of a faceted plot                   |
| row     | `Row`        | The row of a faceted plot                      |
| facet   | `Facet`      | The row and/or column of a general facetd plot |


### Encoding Data Types
Altair recognizes four main data types:

| Data Type    | Shorthand Code | Description                       |
| ---------    | -------------- | -----------                       |
| quantitative | `Q`            | a continuous real-valued quantity |
| ordinal      | `O`            | a discrete ordered quantity       |
| nominal      | `N`            | a discrete unordered category     |
| temporal     | `T`            | a time or date value              |
| geojson      | `G`            | a geographic shape                |

discrete: 离散的

如果没有指定数据的类型，Altair会默认将所有的数字数据指定为`quantitative` 类型，时间/日期数据指定为`temporal` 类型，字符数据指定为`nominal` 类型。需要注意的是默认的类型并不总是正确的选择。 


有两种指定数据类型的方式，如下：
```python
alt.Chart(cars).mark_point().encode(
x='Acceleration:Q',
y='Miles_per_Gallon:Q',
color='Origin:N'
)
```
```python
alt.Chart(cars).mark_point().encode(
alt.X('Acceleration', type='quantitative'),
alt.Y('Miles_per_Gallon', type='quantitative'),
alt.Color('Origin',type='nominal')
)
```

### Effect of Data Type on Color Scales
同样的数据，将颜色编码为三种不同的数据类型:
```python
base = alt.Chart(cars).mark_point().encode(
	x='Horsepower:Q',
	y='Miles_per_Gallon:Q',
).properties(
	width=150,
	height=150)
alt.vconcat(
	base.encode(color='Cylinders:Q').properties(title='quantitative')
	base.encode(color='Cylinders:O').properties(title='ordinal'),
	base.encode(color='Cylinders:N').properties(title='nominal')
	)
```

![9](./img/9.png) 
### Effect of Data Type on Axis scales
数据所使用的类型将会影响所用的刻度和标记的特性  
```python
pop = data.population.url

base = alt.Chart(pop).mark_bar().encode(
    alt.Y('mean(people):Q', title='total population')
).properties(
    width=200,
    height=200
)

alt.hconcat(
    base.encode(x='year:Q').properties(title='year=quantitative'),
    base.encode(x='year:O').properties(title='year=ordinal')
)
```

![10](./img/10.png) 
在Altair中，除非另有规定，否则quantitative scales总是从零开始，而ordinal scales仅限于数据中的值。  

### Encoding Channel Options
Each encoding channel allows for a number of additional options to be expressed; these can control things like axis properties, scale properties, headers and titles, binning parameters, aggregation, sorting, and many more.  
`X` 和`Y` 的encoding可以接受如下选项：
+ aggregate: 字段的聚合函数 mean sum max count
	+ [aggregate documentation](https://vega.github.io/vega-lite/docs/aggregate.html )
+ axis: 定义轴的网格线、记号和标签属性的对象。如果为`null` 编码通道的轴将被删除。
	+ [axis documentation](https://vega.github.io/vega-lite/docs/axis.html)
+ band: 对于rect-based的标记（rect、bar和image），标记大小与频带刻度或时间单位的带宽有关。如果设置为1，则标记大小设置为带宽或时间单位间隔。如果设置为0.5，则标记大小为带宽或时间单位间隔的一半。
+ bin: 用于将`quantitative`字段、定义binning参数的对象或指示`x` 或`y` 通道的数据导入到Vega-Lite中之前进行合并的对象(binned)
	+ 如果为`true`,将应用默认的binning参数
	+ 如果为`binned` ,则表示`x` (`y` )通道的数据已经binned.可以将"bin start"字段映射到`x` (`y` )，将"bin end"字段映射到`x2` (`y2` )。刻度和轴的格式将类似于Vega Lite中的binning。若要基于bin步长调整axis ticks，还可以设置轴的，`tickMinStep` 属性
	+ [bin documentation](https://vega.github.io/vega-lite/docs/bin.html )
+ field: 必需的。定义从中提取数据值的字段名称的字符串，或定义`repeat` 运算符中迭代值的对象
+ impute:定义要应用的插补运算属性的对象.另一个位置通道的字段值作为`Impute`的`key` 。`color` 通道的字段作为插补操作的groupby
	+ [impute documentation](https://vega.github.io/vega-lite/docs/impute.html )

### Binning and Aggregation
除了简单的通道编码之外，Altair的可视化是建立在数据库样式分组和聚合的概念上的；也就是说，支持许多数据分析方法的拆分应用组合抽象。   
例如，从一维数据集构建柱状图涉及到根据数据所在的bin拆分数据，使用数据计数聚合每个bin中的结果，然后将结果组合成最终的图形。   
```python
alt.Chart(cars).mark_bar().encode(
	alt.X('Horsepower', bin=True),
	y='count()'
	# could also use alt.Y(aggregate='count', type='quantitative')
	)
```

![11](./img/11.png)
类似地，我们可以使用例如点的大小来创建二维直方图，以指示网格内的计数（有时称为“气泡图”）：  
```python
alt.Chart(cars).mark_point().encode(
	alt.X('Horsepower', bin=True),
	alt.Y('Miles_per_Gallon',bin=True),
	size='count()',
	)
```

![12](./img/12.png) 
但是，没有必要将聚合仅限于计数。例如，我们可以类似地创建一个图，其中每个点的颜色表示第三个量的平均值，例如加速度:   
```python
alt.Chart(cars).mark_circle().encode(
	alt.x('Horsepower', bin=True),
	alt.Y('Miles_per_Gallon', bin=True),
	size='count()',
	color='average(Acceleration):Q'
	)
```

![13](./img/13.png)

除了`count` 和`average`外，Altair还内置了许多的聚合函数；这些统计函数通常包含以下内容：如下表  

| **Aggregate** | **Description**                   |
| ------------- | ---------------                   |
| argmin        | 包含最小字段值的输入数据          |
| argmax        | 包含最大字段值的输入数据          |
| average       | 字段值的平均值。等同于平均数      |
| count         | 组中数据对象的总数                |
| distinct      | 不同字段值的计数                  |
| max           | 最大的字段值                      |
| mean          | 字段值的平均                      |
| median        | 字段值的中位数                    |
| min           | 最小的字段                        |
| missing       | 空或未定义字段的计数              |
| q1            | 值的四分位下限                    |
| q3            | 值的四分位上限                    |
| ci0           | 自举的下边界为平均值的95%置信区间 |
| ci1           | 自举的上边界为平均值的95%置信区间 |
| stderr        | 字段值的标准差                    |
| stdev         | 字段值的样本标准偏差          |
| sum           | 字段值的和                        |
| valid         | 非null或未定义的字段值的计数      |
| values        | ？？                              |
| variance      | 字段值的样本方差                  |
| variancep     | 字段值的总体方差                  |

### Encoding Shorthands
| **Shorthand**        | **Equivalent long-form**                              |
| --------             | ----------                                            |
| `x='name'`           | `alt.X('name')`                                       |
| `x='name:Q'`         | `alt.X('name', type='quantitative')`                  |
| `x='sum(name)'`      | `alt.X('name', aggregate='sum')`                      |
| `x='sum(name):Q'`    | `alt.X('name', aggregate='sum', type='quantitative')` |
| `x='count():Q'`      | `alt.X(aggregate='count', type='quantitative')`       |

### Ordering marks
order选项和`Order`通道可以对标记在图表上的绘制方式进行排序。  
对于堆叠标记，这控制堆叠组件的顺序。 在此，每个条形的元素均按颜色通道中标称数据的名称按字母顺序排序。
```python
import altair as alt
from vega_datasets import data

barley = data.barley()
alt.Chart(barley).mark_bar().encode(
	x='variety:N',
	y='sum(yield):Q',
	color='site:N'
	order=alt.Order('site', sort='assending')
	)
```

![14](./img/14.png) 
相同的方法适用于其他标记类型，例如堆积面积图:
```python
import altair as alt
from vega_datasets import data

barley = data.barley()

alt.Chart(barley).mark_area().encode(
    x='variety:N',
    y='sum(yield):Q',
    color='site:N',
    order=alt.Order("site", sort="ascending")
)
```

![15](./img/15.png) 
对于线标记，顺序通道对数据点的连接顺序进行编码。 这对于创建散点图很有用，该散点图使用与x和y轴不同的字段在点之间绘制线。
```python
import altair as alt
from vega_datasets import data

driving = data.driving()

alt.Chart(driving).mark_line(point=True).encode(
    alt.X('miles', scale=alt.Scale(zero=False)),
    alt.Y('gas', scale=alt.Scale(zero=False)),
    order='year'
)
```

![16](./img/16.png) 

### Sorting
特定通道可以采用`sort`属性，该属性确定通道所用scale的顺序。有许多不同的排序选项可用:
+ `sort='ascending'` 将字段值按升序排序。对于字符串数据，它使用标准的字母顺序。
+ `sort='descending'` 降序排列
+ 传递编码通道的名称进行排序，例如“`x`”或“`y`”，允许按该通道进行排序。可选的减号前缀可用于降序排序。例如，sort='-x'将按`x`通道降序排序。
+ 通过传递列表进行排序，可以显式设置编码的显示顺序
+ 通过将EncodingSortField类传递给sort，可以根据数据集中其他字段的值对轴进行排序。

例子：  
```python
import altair as alt
from vega_datasets import data

barley = data.barley()

base = alt.Chart(barley).mark_bar().encode(
    y='mean(yield):Q',
    color=alt.Color('mean(yield):Q', legend=None)
).properties(width=100, height=100)

# Sort x in ascending order
ascending = base.encode(
    alt.X(field='site', type='nominal', sort='ascending')
).properties(
    title='Ascending'
)

# Sort x in descending order
descending = base.encode(
    alt.X(field='site', type='nominal', sort='descending')
).properties(
    title='Descending'
)

# Sort x in an explicitly-specified order
explicit = base.encode(
    alt.X(field='site', type='nominal',
          sort=['Duluth', 'Grand Rapids', 'Morris',
                'University Farm', 'Waseca', 'Crookston'])
).properties(
    title='Explicit'
)

# Sort according to encoding channel
sortchannel = base.encode(
    alt.X(field='site', type='nominal',
          sort='y')
).properties(
    title='By Channel'
)

# Sort according to another field
sortfield = base.encode(
    alt.X(field='site', type='nominal',
          sort=alt.EncodingSortField(field='yield', op='mean'))
).properties(
    title='By Yield'
)

alt.concat(
    ascending, descending, explicit,
    sortchannel, sortfield,
    columns=3
)
```

![17](./img/17.png) 

要突出显示通过通道排序和通过字段排序之间的区别，请考虑以下示例，其中我们不聚合数据：
```python
import altair as alt
from vega_datasets import data

barley = data.barley()
base = alt.Chart(barley).mark_point().encode(
    y='yield:Q',
).properties(width=200)

# Sort according to encoding channel
sortchannel = base.encode(
    alt.X(field='site', type='nominal',
          sort='y')
).properties(
    title='By Channel'
)

# Sort according to another field
sortfield = base.encode(
    alt.X(field='site', type='nominal',
          sort=alt.EncodingSortField(field='yield', op='min'))
).properties(
    title='By Min Yield'
)
sortchannel | sortfield
```

![18](./img/18.png) 
### Sorting Legends
虽然上面的示例显示了通过在X和Y编码中指定排序来对轴进行排序，但是可以通过在颜色编码中指定排序来对图例进行排序：  
```python
alt.Chart(barley).mark_rect().encode(
    alt.X('mean(yield):Q', sort='ascending'),
    alt.Y('site:N', sort='descending'),
    alt.Color('site:N',
        sort=['Morris', 'Duluth', 'Grand Rapids',
              'University Farm', 'Waseca', 'Crookston']
    )
)
```

![19](./img/19.png) 
这里y轴按字母顺序倒序排列，而颜色图例则按指定顺序排列，从“Morris”开始。

## Marks